MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL.

MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2011 PRESENTER : 陳彥廷 2012/02/23 1

OUTLINE  Introduction  SLAC  Motivation  MDR introduce  Experiment  Related work  Conclusion 2

INTRODUCTION  Heterogeneous parallel computing platforms, which contain multiple processing elements (PEs) with distinct architectures, offer a potential for significant improvements in performance and power efficiency.  Programming frameworks TBB CUDA OpenCL PO-DAGs (parallel operator directed acyclic graphs) 3

SLAC  Suitability: which processing element is inherently better suited (faster) to execute a given task.  Locality: whether the data required for a task is present in the local memory of the processing element.  Availability: when next a processing element will become available to execute a given task.  Criticality: how a given task's execution is likely to affect the overall program execution time. 5

PO-DAG 6 37 28 PE 2 -only utilization-based

PO-DAG (SUITABILITY) 7 28 22

PO-DAG (LOCALITY) 8 2221

PO-DAG (AVAILABILITY) 9 21 16

PO-DAG (CRITICALITY) 10 16 14

MOTIVATION  Programmers must manually partition workload between the PEs. 12

MOTIVATION ( CONT )  Suitability Programmer have to create a parameterized model of the kernel per platform. MDR alleviates the programmer of this cumbersome task.  Locality: To take locality into account a programmer would have to keep track of the location of the most recent copy of a data item and have some means of computing the transfer time for a data item. MDR automates these steps, alleviating the programmer from the need to perform them. 13

MOTIVATION ( CONT )  Availability: It is able to model when a PE will next become available with reasonable accuracy. MDR keeps track of all tasks scheduled, executing and completed and uses performance models to estimate their execution times.  Criticality: Criticality adds a more global consideration to the scheduling process and it depends on the graph structure, in addition to computation and communication costs. 14

MDR SYSTEM  MDR is a heterogeneous runtime framework that models task execution, orchestrates data movement and intelligently schedules tasks using online performance models, thereby considering suitability, locality, availability, and criticality.  MDR utilizes graph-based application-level performance models, online history-based models for kernel-level execution time, and analytical models for communication time. 16

MDR SYSTEM  The MDR scheduler can be viewed as a meta-scheduler on top of PE-specific schedulers.  MDR does not attempt to manage parallelism within each PE ─ this is left to PE-specific runtimes such as TBB and CUDA. 17

MDR ABSTRACTION AND FRAMEWORK  Application  Computational kernel  Data  Scheduler 18

MDR PIPELINE 19

PERFORMANCE MODELS  Communication model  Kernel model  Graph-level model 20

COMMUNICATION MODEL  The communication model consists of two parts: a data object model and a byte-level communication model.  The byte-level model is constructed by MDR.  MDR uses the data object model during execution, and programmers creating custom data types are expected to define the data object model in terms of this byte-level model. 21

KERNEL MODEL  The signature is used to identify the most similar previous executions of a kernel, this recorded execution times is used to predict current instance's execution time. 22  The kernel model is used to predict a task‘s execution time.  A signature is a per-kernel programmer specified n-tuple of ordinal characteristics that separates kernel instances from a performance modeling point of view.

GRAPH-LEVEL MODEL  pred(t): the predecessors of a task t.  succ(t): the successors of a task t.  prod(t) : the set of data produced.  PE[t]: the set of PEs which t can execute upon.  c(t,p) : t is a task and p a PE, would correspond to the expected runtime of t on p.  c(d,i → j): the expected cost of copying data object d from PE i to PE j. 23

GRAPH-LEVEL MODEL ( CONT )  EST: earliest starting time  LTE: length to end 24

WORK-LIST BASED SCHEDULER  dep(t) : the set of dependent data.  ρ: Tasks that are ready to execute are ordered in decreasing order of it.  arg min: argument of the minimum, which are values of x for which f(x) attains its smallest value M. 若 X 1 = 3, X 2 = 1, X 3 = 6 假設 y = min X i, 則 y = 1 故 z = arg min X i, 則 z = 2 25

WORK-LIST BASED SCHEDULER ( CONT ) 26

EXPERIMENT METHOD  We evaluate MDR on three different platforms, which span a large segment the GPGPU programming spectrum starting from the low-power commodity Atom-ION combination to the server-class Xeon-Tesla pairing.  These diverse platforms allow us to test MDR's performance across greatly varying performance characteristics and investigate the portability, scalability and performance robustness of the runtime. 28

PLATFORMS 29

BENCHMARKS  CFD: (Computational fluid dynamics) CFD solver is an unstructured grid CFD solver using three- dimensional Euler equations for inviscid, compressible flow.  NW: Needleman-Wunsch is an algorithm commonly used in bioinformatics for global alignment of two sequences.  LU and CHO: LU- and Cholesky-decomposition are procedures for decomposing a matrix into a product of upper and lower triangular matrices.  SSI: Semantic indexing is a popular technique used to access and organize large amounts of unstructured data. 30

BASELINE SCHEDULERS  CPU-only  GPU-only  Random  Round-robin  Utilization  GPU-first: if faced with a choice (i.e., both CPU and GPU are idle) 31

RESULT 32

33 netbook laptop server

OVERHEADS  Any runtime will introduce overheads when compared to an identically scheduled static execution.  For MDR we measured the overhead added by the runtime framework and found that it varied between 1% to 7%.  In cases where GPU-only was the best scheduling decision we found the MDR scheduler to add, at worst, a 3% runtime overhead compared to using the GPU-only scheduler and the MDR runtime.  We used Intel VTune to profile the execution of applications under MDR on the server class platform, and found that 97.2% of the execution time was spent in application code. 34

RESULT ( CONT )  The performance benefit of the intelligent decisions made by MDR offsets the overhead incurred by the more complex scheduling function.  MDR achieves up to 4.2X speedup (1.5X on average) over the best of CPU-only, GPU-only, round-robin, GPU-first, and utilization-driven schedulers. 35

RELATED WORK  Merge allows the programmer to specify predicates, which are used to select the most appropriate function variant and PE.  MapCG allows one to concurrently use both the CPU and GPU to perform tasks, but they found no performance improvement in using both over the best of CPU- and GPU-only for their benchmarks. 37

CONCLUSION  We presented a programming model and runtime framework for executing computations that can be represented as PO-DAGs on heterogeneous parallel platforms.  We described a runtime framework where performance models are used to drive key runtime decisions, consider the impact of the SLAC factors.  We believe that the proposed heterogeneous runtime framework will be especially useful as larger applications that have a complex mixture of computations are developed on heterogeneous parallel platforms. 39

THANKS FOR YOUR LISTENING! 40

MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL.

Similar presentations

Presentation on theme: "MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL.

Similar presentations

Presentation on theme: "MDR: PERFORMANCE MODEL DRIVEN RUNTIME FOR HETEROGENEOUS PARALLEL PLATFORMS AUTHORS: JACQUES A. PIENAAR, ANAND RAGHUNATHAN, SRIMAT CHAKRADHAR SOURCE: INTERNATIONAL."— Presentation transcript:

Similar presentations

About project

Feedback