Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE ‘09. A Generic.

Similar presentations


Presentation on theme: "1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE ‘09. A Generic."— Presentation transcript:

1 1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE ‘09. A Generic Platform for Estimation of Multi-threaded Program Performance on Heterogeneous Multiprocessors

2 This paper deals with a methodology for software estimation to enable design space exploration of heterogeneous multiprocessor systems. Starting from fork-join representation of application specification along with high level description of multiprocessor target architecture and mapping of application components onto architecture resource elements, it estimates the performance of application on target multiprocessor architecture. The methodology proposed includes the effect of basic compiler optimizations, integrates light weight memory simulation and instruction mapping for complex instruction to improve the accuracy of software estimation. To estimate performance degradation due to contention for shared resources like memory and bus, synthetic access traces coupled with interval analysis technique is employed. The methodology has been validated on a real heterogeneous platform. Results show that using estimation it is possible to predict performance with average errors of around 11%. 2

3 There are many mappings between application and hardware architecture.  How to know the mapping we used is the best one? 。 We need a performance estimator to estimate the performance of the mapping. So, when estimation, we have three input :  Application specification 。 task, data and communication  Architecture specification 。 processor, memory and bus  Mapping description 。 Mapping application components onto architecture components 3 P1 P2 A B C D A B C D Time 0 10 Saved time

4 4 Fork-join task graph Represent parallel phase of computation This paper SUIF Compiler Software profiling HMDES( High Level Machine Description ) Target processor description includes pipeline stages, memory, link, … Application specification Architecture specification

5 1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 5

6 1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 6

7 Introduction 7 Multi-threaded Application Processor #1 Processor #n Processor #2 …… tasks fork Example of Multi-threaded Application running on Multi-processors. Communication, Synchronization, Resource Contention …

8 Estimation Input – Application Specification  Fork-join task graph : A task graph consisting of alternating sequential and parallel phases consists of independent tasks. 。 Vertex : a task which is a unit of work in a parallel program. 。 Edge : precedence between a pair of tasks 8 Sequential phase Parallel phase Examples of fork-join task graph.

9 9 bb : basic block L : latency T : total time F : frequency Register Allcation

10 Estimated processor  Cradle PE  Leon3 with FPU  SS-mips Estimated and actual execution cycles Average error rate : 14 % 10

11 1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 11

12 A application may be composed of some sequential pre/post-processing and nested fork-joins. A fork-join may be iterated for many times. 12 Includes the shared resource contention delay

13 1. Estimation of mapped task on a processor. 2. Estimation of communication and synchronization delays of multi-threaded tasks. 3. Estimation of contention delays of shared resources. 13

14 Interval analysis  generate access rate data 14

15 Spreading of interval 15 At next time slot : P1 P2 P3 P4

16 16 ( Time Interval )

17 17 ( Time Interval )

18 Architecture of Cradle CT3400 heterogeneous multi-processor chip.  4 processors  4 DSE (Digital Signal Engine) 18

19 JPEG application 19 Parallelism The mapping description P : processor D : DSE (Digital Signal Engine)

20 Estimated cycles for 8 mappings of JPEG application over Cradle architecture.  Estimated cycles without contention delay must lower than the others. 20

21 Conclusion  The presented framework for retargetable performance estimation of multi-threaded applications on heterogeneous multi-processors.  The estimated performance includes shared resource contention delay, task execution time on uni-processor. Comment  The mapping of multi-threaded application component to hardware architecture is very important for improving performance.  The error rate is not good. 21


Download ppt "1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE ‘09. A Generic."

Similar presentations


Ads by Google