Task Mapping Methodology for Heterogeneous Multicore System


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date15 Mar 2017


As Moore’s law is going slow or even expected to die by some observers, it is becoming increasingly difficult to improve the performance of the transistors. On the one hand, technologies like FinFET are introduced to reduce the leakage current and overcome other short-channel effects when technology scales down; on the other hand, parallel computing and heterogeneous computing have been more and more obsessed in order to further explore the computing capability of existing processors. Heterogeneous multicore system is a computing system equipped with different type of processors, which usually includes Central Processing Unit (CPU), General-Purpose Graphics Processing Unit (GPGPU), and Field-Programmable Gate Array (FPGA). In order to fully utilize the computing resources of a heterogeneous system, the target applications have to be mapped to the most appropriate processors to achieve the best performance or performance/energy. The efficient and automatic mapping scheme, however, remains a problem that limits the usage of heterogeneous system.

In this thesis, we propose a novel library-based heterogeneous system architecture and programming model, and develop a hybrid task partitioning scheme based on the model. With the proposed partitioning scheme, one can predict the performance of an application on a given system during runtime with very short pre-training process. The training results are stored in the knowledge database for future use. A set of pre-designed library and data structure are provided as well. The scheme is determined by system configuration and application property.

When many different applications execute on a heterogeneous system, task scheduling becomes a problem because traditional solutions on homogeneous systems fail to provide an optimal plan. To solve this problem, we propose a new task scheduling algorithm which takes the properties of target applications and systems both into consideration. During runtime scheduling and partitioning works as a combination to generate the optimal mapping result. The proposed scheduling algorithm is scalable with the number of processors/nodes in the whole system.

In the implementation level, we focus on how to design the accelerators with the best performance or performance/energy. In the thesis, we take Nvidia GPU and CUDA programming model as an example. Two key issues are discussed here: thread allocation and memory access arrangement, and we provide the guideline to decide the best solutions for them. With proper implementation of the application, the accelerator performance will be an important factor of the prediction model, thus affecting the final system performance.

The proposed mapping scheme for heterogeneous computing system is validated using a set of widely-accepted benchmarks from real world. They cover from computation-intensive ones to communication-intensive ones, and serve in different application domains. The experimental results confirm the advantage of our proposed mapping approach, which offers an automatic and satisfying mapping solution.