Presentation is loading. Please wait.

Presentation is loading. Please wait.

MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL.

Similar presentations


Presentation on theme: "MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL."— Presentation transcript:

1 MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2011 PRESENTER : 陳彥廷 2012/02/23 1

2 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 2

3 INTRODUCTION  Heterogeneous parallel computing platforms, which contain multiple processing elements (PEs) with distinct architectures, offer a potential for significant improvements in performance and power efficiency.  Programming frameworks TBB CUDA OpenCL PO-DAGs (parallel operator directed acyclic graphs) 3

4 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 4

5 SLAC  Suitability: which processing element is inherently better suited (faster) to execute a given task.  Locality: whether the data required for a task is present in the local memory of the processing element.  Availability: when next a processing element will become available to execute a given task.  Criticality: how a given task's execution is likely to affect the overall program execution time. 5

6 PO-DAG 6 37 28 PE 2 -only utilization-based

7 PO-DAG (SUITABILITY) 7 28 22

8 PO-DAG (LOCALITY) 8 2221

9 PO-DAG (AVAILABILITY) 9 21 16

10 PO-DAG (CRITICALITY) 10 16 14

11 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 11

12 MOTIVATION  Programmers must manually partition workload between the PEs. 12

13 MOTIVATION ( CONT )  Suitability Programmer have to create a parameterized model of the kernel per platform. MDR alleviates the programmer of this cumbersome task.  Locality: To take locality into account a programmer would have to keep track of the location of the most recent copy of a data item and have some means of computing the transfer time for a data item. MDR automates these steps, alleviating the programmer from the need to perform them. 13

14 MOTIVATION ( CONT )  Availability: It is able to model when a PE will next become available with reasonable accuracy. MDR keeps track of all tasks scheduled, executing and completed and uses performance models to estimate their execution times.  Criticality: Criticality adds a more global consideration to the scheduling process and it depends on the graph structure, in addition to computation and communication costs. 14

15 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 15

16 MDR SYSTEM  MDR is a heterogeneous runtime framework that models task execution, orchestrates data movement and intelligently schedules tasks using online performance models, thereby considering suitability, locality, availability, and criticality.  MDR utilizes graph-based application-level performance models, online history-based models for kernel-level execution time, and analytical models for communication time. 16

17 MDR SYSTEM  The MDR scheduler can be viewed as a meta-scheduler on top of PE-specific schedulers.  MDR does not attempt to manage parallelism within each PE ─ this is left to PE-specific runtimes such as TBB and CUDA. 17

18 MDR ABSTRACTION AND FRAMEWORK  Application  Computational kernel  Data  Scheduler 18

19 MDR PIPELINE 19

20 PERFORMANCE MODELS  Communication model  Kernel model  Graph-level model 20

21 COMMUNICATION MODEL  The communication model consists of two parts: a data object model and a byte-level communication model.  The byte-level model is constructed by MDR.  MDR uses the data object model during execution, and programmers creating custom data types are expected to define the data object model in terms of this byte-level model. 21

22 KERNEL MODEL  The signature is used to identify the most similar previous executions of a kernel, this recorded execution times is used to predict current instance's execution time. 22  The kernel model is used to predict a task‘s execution time.  A signature is a per-kernel programmer specified n-tuple of ordinal characteristics that separates kernel instances from a performance modeling point of view.

23 GRAPH-LEVEL MODEL  pred(t): the predecessors of a task t.  succ(t): the successors of a task t.  prod(t) : the set of data produced.  PE[t]: the set of PEs which t can execute upon.  c(t,p) : t is a task and p a PE, would correspond to the expected runtime of t on p.  c(d,i → j): the expected cost of copying data object d from PE i to PE j. 23

24 GRAPH-LEVEL MODEL ( CONT )  EST: earliest starting time  LTE: length to end 24

25 WORK-LIST BASED SCHEDULER  dep(t) : the set of dependent data.  ρ: Tasks that are ready to execute are ordered in decreasing order of it.  arg min: argument of the minimum, which are values of x for which f(x) attains its smallest value M. 若 X 1 = 3, X 2 = 1, X 3 = 6 假設 y = min X i, 則 y = 1 故 z = arg min X i, 則 z = 2 25

26 WORK-LIST BASED SCHEDULER ( CONT ) 26

27 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 27

28 EXPERIMENT METHOD  We evaluate MDR on three different platforms, which span a large segment the GPGPU programming spectrum starting from the low-power commodity Atom-ION combination to the server-class Xeon-Tesla pairing.  These diverse platforms allow us to test MDR's performance across greatly varying performance characteristics and investigate the portability, scalability and performance robustness of the runtime. 28

29 PLATFORMS 29

30 BENCHMARKS  CFD: (Computational fluid dynamics) CFD solver is an unstructured grid CFD solver using three- dimensional Euler equations for inviscid, compressible flow.  NW: Needleman-Wunsch is an algorithm commonly used in bioinformatics for global alignment of two sequences.  LU and CHO: LU- and Cholesky-decomposition are procedures for decomposing a matrix into a product of upper and lower triangular matrices.  SSI: Semantic indexing is a popular technique used to access and organize large amounts of unstructured data. 30

31 BASELINE SCHEDULERS  CPU-only  GPU-only  Random  Round-robin  Utilization  GPU-first: if faced with a choice (i.e., both CPU and GPU are idle) 31

32 RESULT 32

33 33 netbook laptop server

34 OVERHEADS  Any runtime will introduce overheads when compared to an identically scheduled static execution.  For MDR we measured the overhead added by the runtime framework and found that it varied between 1% to 7%.  In cases where GPU-only was the best scheduling decision we found the MDR scheduler to add, at worst, a 3% runtime overhead compared to using the GPU-only scheduler and the MDR runtime.  We used Intel VTune to profile the execution of applications under MDR on the server class platform, and found that 97.2% of the execution time was spent in application code. 34

35 RESULT ( CONT )  The performance benefit of the intelligent decisions made by MDR offsets the overhead incurred by the more complex scheduling function.  MDR achieves up to 4.2X speedup (1.5X on average) over the best of CPU-only, GPU-only, round-robin, GPU-first, and utilization-driven schedulers. 35

36 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 36

37 RELATED WORK  Merge allows the programmer to specify predicates, which are used to select the most appropriate function variant and PE.  MapCG allows one to concurrently use both the CPU and GPU to perform tasks, but they found no performance improvement in using both over the best of CPU- and GPU-only for their benchmarks. 37

38 OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 38

39 CONCLUSION  We presented a programming model and runtime framework for executing computations that can be represented as PO-DAGs on heterogeneous parallel platforms.  We described a runtime framework where performance models are used to drive key runtime decisions, consider the impact of the SLAC factors.  We believe that the proposed heterogeneous runtime framework will be especially useful as larger applications that have a complex mixture of computations are developed on heterogeneous parallel platforms. 39

40 THANKS FOR YOUR LISTENING! 40


Download ppt "MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL."

Similar presentations


Ads by Google