Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Slides:

Advertisements

Similar presentations

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

ECE-777 System Level Design and Automation Hardware/Software Co-design

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Minimum Spanning Tree Sarah Brubaker Tuesday 4/22/8.

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

LOCALITY IN DISTRIBUTED GRAPH ALGORITHMS Nathan Linial Presented by: Ron Ryvchin.

DATAFLOW PROCESS NETWORKS Edward A. Lee Thomas M. Parks.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Using Interfaces to Analyze Compositionality Haiyang Zheng and Rachel Zhou EE290N Class Project Presentation Dec. 10, 2004.

Phased Scheduling of Stream Programs Michal Karczmarek, William Thies and Saman Amarasinghe MIT LCS.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Dataflow Process Networks Lee & Parks Synchronous Dataflow Lee & Messerschmitt Abhijit Davare Nathan Kitchen.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

Optimus: Efficient Realization of Streaming Applications on FPGAs University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke IBM Research: David.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.

1 Compiling with multicore Jeehyung Lee Spring 2009.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz

A Computing Origami: Folding Streams in FPGAs S. M. Farhad PhD Student University of Sydney DAC 2009, California, USA.

CHALLENGING SCHEDULING PROBLEM IN THE FIELD OF SYSTEM DESIGN Alessio Guerri Michele Lombardi * Michela Milano DEIS, University of Bologna.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Chapter 4 sections 1 and 2.  Fig. 1  Not connected  All vertices are even.  Fig. 2  Connected  All vertices are even.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

1 Optimizing Stream Programs Using Linear State Space Analysis Sitij Agrawal 1,2, William Thies 1, and Saman Amarasinghe 1 1 Massachusetts Institute of.

StreamIt: A Language for Streaming Applications William Thies, Michal Karczmarek, Michael Gordon, David Maze, Jasper Lin, Ali Meli, Andrew Lamb, Chris.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 8 Networks and Multiprocessors.

Distributed Handler Architecture Beytullah Yildiz

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

StreamIt: A Language for Streaming Applications

Ph.D. in Computer Science

The Dataflow Interchange Format (DIF): A Framework for Specifying, Analyzing, and Integrating Dataflow Representations of Signal Processing Systems Shuvra.

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

A Common Machine Language for Communication-Exposed Architectures

Teleport Messaging for Distributed Stream Programs

Cache Aware Optimization of Stream Programs

StreamIt: High-Level Stream Programming on Raw

AB AC AD AE AF 5 ways If you used AB, then, there would be 4 remaining ODD vertices (C, D, E and F) CD CE CF 3 ways If you used CD, then, there.

EE 4xx: Computer Architecture and Performance Programming

Mattan Erez The University of Texas at Austin

Parallel Graph Algorithms

Presentation transcript:

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz

2 Outline Motivation  Multicore trend  Stream programming Profiling communication overhead Related works 2

3 Motivation # cores/chip Courtesy: Scott’08 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming 3

4 Stream Programming Paradigm Programs expressed as stream graphs  Streams: Infinite sequence of data elements  Actors: Functions applied to streams 4 Actor Stream

5 Properties of Stream Program Regular and repeating computation Independent actors with explicit communication  Producer / Consumer dependencies 5 Adder Speaker AtoD FMDemod LPF 1 Splitter Joiner LPF 2 LPF 3 HPF 1 HPF 2 HPF 3

6 StreamIt Language An implementation of stream prog. Hierarchical structure Each construct has single input/output stream parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 6

How to Estimate the Communication Overhead? 7

Problems to Measure Communication Overhead Reasons:  Multicores are non-communication exposed architecture  Complex cache hierarchy  Cache coherence protocols Consequence:  Cannot directly measure the communication cost  Estimate the communication cost by measuring the execution time of actors 8

Measuring the Communication Overhead of an Edge 9 ik Processor 1 No communication cost Processor 1 With communication cost Processor 2 ki

How to Minimize the Required Number of Experiments 10 A B C 1 2 Pipeline Graph Coloring Requires 2+1 Exps A B C D Processor 1Processor E F 5 4 Even edges across partition Processor 1 A D B C E Processor Odd edges across partition

Obs. 1: There is no loop of three actors in a stream graph 11 ik l Processor 1Processor 2

Obs. 2: There is no interference of adjacent nodes between edges 12 A B CD E F For blue color edges P-1 P-2 P-3 P-4

Remove Interference Convert to a line graph Add interference edges Use vertex coloring algorithm 13 A B CD E F AB BC BD CE DE EF Line graph Stream graph AB BC BD CE DE EF

Processor Leveling Graph 14 A B CD E F For blue colored edge Processor leveling graph A B, C, D, E F

Coloring the Processor Labelling Graph 15 A B, C, D, E F Processor 2Processor 1 A B, C, D, E F A F

Measuring the Communication Cost 16 A B CD E F A B, C, D, E F Processor 2Processor 1 For blue colored edge

Profiling Performance Benchmark Total EdgeProf StepsSteps/Edge (%)Err (%) SAR MatrixMult MergeSort FMRadio DCT RadixSort FFT MPEG Channel BeamFormer39513 GM17%15% 17

18 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 18

Questions?