Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Slides:

Advertisements

Similar presentations

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

U of Houston – Clear Lake

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Instruction Level Parallelism (ILP) Colin Stevens.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

1 Compiling with multicore Jeehyung Lee Spring 2009.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.

“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Data Structures and Algorithms in Parallel Computing Lecture 2.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Marilyn Wolf1 With contributions from:

Ph.D. in Computer Science

Conception of parallel algorithms

Linchuan Chen, Xin Huo and Gagan Agrawal

Adaptation Behavior of Pipelined Adaptive Filters

Cache Aware Optimization of Stream Programs

StreamIt: High-Level Stream Programming on Raw

EE 4xx: Computer Architecture and Performance Programming

Mattan Erez The University of Texas at Austin

Presentation transcript:

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

2 Outline Motivation  Multicore trend  Stream programming Research Questions  How to profiling communication overhead on Multicores?  How to deploy stream programs? Related works 2

3 Motivation # cores/chip Courtesy: Scott’08 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming 3

4 Stream Programming Paradigm Programs expressed as stream graphs  Streams: Infinite sequence of data elements  Actors: Functions applied to streams 4 Actor Stream

5 Properties of Stream Program Regular and repeating computation Independent actors with explicit communication  Producer / Consumer dependencies 5 Adder Speaker AtoD FMDemod LPF 1 Splitter Joiner LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

6 StreamIt Language An implementation of stream prog. Hierarchical structure Each construct has single input/output stream parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 6

7 Outline Motivation  Multicore trend  Stream programming Research Questions  How to profiling communication overhead on Multicores?  How to deploy stream programs? Related works 7

How to Estimate the Communication Overhead on Multicores? 8

Problems to Measure Communication Overhead on Multicores Reasons:  Multicores are non-communication exposed architecture  Complex cache hierarchy  Cache coherence protocols Consequence:  Cannot directly measure the communication cost  Estimate the communication cost by measuring the execution time of actors 9

Measuring the Communication Overhead of an Edge 10 ik Processor 1 No communication cost Processor 1 With communication cost Processor 2 ki

How to Minimize the Required Number of Experiments 11 A B C 1 2 Pipeline Graph Coloring Requires 2+1 Steps A B C D Processor 1Processor E F 5 4 Even edges across partition Processor 1 A D B C E Processor Odd edges across partition

Obs. 1: There is no loop of three actors in a stream graph 12 ik l Processor 1Processor 2

Obs. 2: There is no interference of adjacent nodes between edges 13 A B CD E F For blue color edges P-1 P-2 P-3 P-4

Remove Interference Convert to a line graph Add interference edges Use vertex coloring algorithm 14 A B CD E F AB BC BD CE DE EF Line graph Stream graph AB BC BD CE DE EF

Processor Leveling Graph 15 A B CD E F For blue colored edge Processor leveling graph A B, C, D, E F

Coloring the Processor Labelling Graph 16 A B, C, D, E F Processor 2Processor 1 A B, C, D, E F A F

Measuring the Communication Cost 17 A B CD E F A B, C, D, E F Processor 2Processor 1 For blue colored edge

Profiling Performance Benchmark Total EdgeProf StepsSteps/Edge (%)Err (%) SAR MatrixMult MergeSort FMRadio DCT RadixSort FFT MPEG Channel BeamFormer39513 GM17%15% 18

19 Outline Motivation  Multicore trend  Stream programming Research Questions  How to profiling communication overhead?  How to deploy stream programs? Related works 19

Deployment of Stream Programs 20 A (5) B (40) C (40) D(5) Processor 1Processor A (5) B (40) C (40) D(5) Load = (5 + 40) + 5 = 50 Load = (40 + 5) + 5 = 50 Makespan = 50, Speedup = 90/50 = 1.8

Deploying Stream Programs without Considering Communication 21 A (5) B (40) C (40) D(5) Processor 1Processor 2 A (5) C (40) B (40) D(5) Load = (5+40) + ( ) = 100 Load = (40+5) + ( ) = 100 Makespan = 100, Speedup = 90/100 = 0.9 Compare = (100 – 50)x100%/50 = 100%

Deployment Performance Benchmarkm (us)ḿ (us) (ḿ – m)/m% SAR MatrixMult MergeSort FMRadio DCT RadixSort FFT MPEG Channel BeamFormer

Speedups obtained for 2, 4 and 6 processors 23

Summary We propose an efficient profiling technique for multicore that minimizes profiling steps We propose ILP based approach that minimizes the makespan We conducted experiments  The number of profiling steps is on the average only 17%  The profiling scheme shows only 15% error on the average in the random mapping test  Obtains speedup of 3.11x for 4 processors and a speedup of 4.02x for 6 processors 24

25 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 25

Questions?

Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform cache hierarchy We pin the threads using likwidpin tools 27

Cache Topology of Processor 28 Core #0Core #1Core #2Core #3Core #4Core #5 L1: 64kB L2: 512kB L3: 6MB 800MHz hexa-core AMD Phenom(tm) II X6 1090T