Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Hadi Goudarzi and Massoud Pedram

ECE-777 System Level Design and Automation Hardware/Software Co-design

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

A Dataflow Programming Language and Its Compiler for Streaming Systems

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

CHALLENGING SCHEDULING PROBLEM IN THE FIELD OF SYSTEM DESIGN Alessio Guerri Michele Lombardi * Michela Milano DEIS, University of Bologna.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

1 Customer-Aware Task Allocation and Scheduling for Multi-Mode MPSoCs Lin Huang, Rong Ye and Qiang Xu CHhk REliable computing laboratory (CURE) The Chinese.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

Michael I. Gordon, William Thies, and Saman Amarasinghe

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.

Sunpyo Hong, Hyesoon Kim

Task Mapping and Partition Allocation for Mixed-Criticality Real-Time Systems Domițian Tămaș-Selicean and Paul Pop Technical University of Denmark.

Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.

Marilyn Wolf1 With contributions from:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

CS427 Multicore Architecture and Parallel Computing

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

Mattan Erez The University of Texas at Austin

Presentation transcript:

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors: Haitao Wei, Junqing Yu, Huafei Yu, Mingkang Qin, Guang R. Gao Chih-Sheng Lin

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 2

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 3

Multi-core Architectures Multi-core architectures have become the mainstream solution and industry standard from servers to desktop platforms and handheld devices ▫IBM’s Cell, Nvidia’s GPU, ICT’s Godson, MIT’s raw Multi-core processor ▫increases the computation ability ▫pushes the performance burden to the compiler and programmer to effectively exploit the coarse-grained parallelism across the cores 4

Stream Programming Model The stream programming model is an approach! Stream languages ▫StreamIt, Brook, CUDA, SPUR and Cg ▫are motivated by applications in media processing domains ▫are based on synchronous dataflow (SDF) or regular stream flow graphs (RSFG) 5

Regular Stream Flow Graph (RSFG) Node ▫a computation task (actor) ▫has an independent instruction stream and address space ▫fire repeatedly in a periodic schedule Arc(Edge) ▫the communication (flow of data) between nodes ▫through the communication channel 6

Software Pipelining Software pipelining ▫an efficient method to exploit the coarse-grained parallelism in stream programs ▫takes whole program as a loop and periodic schedule as iteration of the loop Stream programs can be easily and naturally mapped to communication-exposed multi-core architecture ▫but the gains through parallel execution can be overshadowed by the cost of communication and synchronization 7

Software Pipelining (Cont.) The performance metric of software pipelining ▫the initiation rate of successive iteration Rate optimal schedule ▫The schedule with the maximum initiation rate (minimum initiation interval) Resource limitations ▫Processor capability, the size of memory with each PE, interconnect bandwidth and direct memory access (DMA) 8

Goal To orchestrate an efficient software pipelining schedule which obtains optimal computation rate while minimize the communication cost and satisfying the resource constraints under the system 9

CMRO and ROMC CMRO (Communication Minimized Rate- Optimal) ▫minimizes the communication cost at optimal computation rate ▫formulated as an unified Integer Linear Programming (ILP) problem ROMC (Rate-Optimal with Memory Constraints) ▫formulated as an unified integer quadratic programming problem ▫transformed to an ILP problem by using stage adjustment optimization 10

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 11

DFBrook Steam Language DFBrook: extension of Brook for SDF 12

Target Architecture – Godson-T Communication exposed multi-core platform 13

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 14

CMRO Schedule – Problem Definition 15

CMRO Schedule – Problem Definition (Cont.) 16

Example of Stream Graph and DDG Stream Graph Data Dependency Graph 17

CMRO Problem 18

Continued with the previous example SGMS (Stream Graph Modulo Schedule) ▫lacks the consideration of communication 19

Continued with the previous example CMRO 20

ILP Formulation - Space 21

ILP Formulation - Space(Cont.) 22

ILP Formulation - Space(Cont.) 23

ILP Formulation - Space(Cont.) 24

ILP Formulation - Time 25

ILP Formulation – Time(Cont.) 26

ILP Formulation for CMRO Problem 27

Rate-Optimal Schedule with Memory Constraints (ROMC) 28

ROMC(Cont.) Considerations ▫All the buffers used for an instance are allocated statically in the memory of the processor where the instance is assigned to ▫In the software pipelining schedule, multiple buffers are introduced to keep up with the distance in the stages between two connected instances 29

Example of Buffer Allocation Schemes 30

ROMC(Cont.) 31

Solving ROMC Problem 32

Solving ROMC Problem 33

Stage Assignment and Adjustment Optimization Process 34

Stage Assignment and Adjustment Optimization Process(Cont.) 35 Key: The stage of DMA-node can be adjusted to reduced the buffer usage of victim processors

Buffer Usage Calculation 36 The number of input buffers in each PE’s memory

Buffer Usage Calculation(Cont.) 37 The number of output buffers in each PE’s memory

Stage Adjustment Optimization 38

Stage Adjustment Optimization(Cont.) 39

Stage Adjustment Optimization(Cont.) 40

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 41

Experiment Infrastructure and Methodology Scheduler ▫implemented by DFBrook to generate codes for the software pipelining schedules Experimental Platform ▫Godson-T Architecture Simulator Solving ILPs ▫Commercial program CPLEX 42

Comparison 43

Comparison(Cont.) 44

ROMC Schedule Performance Number of processors = 9 MinMem = 16KB for all benchmarks MaxMem = 512KB for imgsmth, Gauss and aveMotion; 32KB for others 45

ROMC vs Conservative Estimate Method (CEM) *: both of the two schedulers can find a feasible solution +: only ROMC finds a solution while the solution by CEM is unable to meet the memory constraints 46

Scalability (over single processor) 47

ROMC ILP Solving Time (in CPU seconds) In 70% of the cases, ROMC scheduler can obtain an optimal solution in less than 6 minutes 48

CMRO ILP Solving Time 49

CMRO Performance Improvement 50

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 51

Related Works The schedule of stream graph ▫Ptolemy: model of computation and scheduling on SDF ▫Regular Stream Flow Graph (RSFG) can be statically schedule at compiler time Stream compilation ▫Coarse-grained task, data, pipeline parallelism have been exploited for StreamIt on raw architecture 52

Related Works(Cont.) Software pipelining is a well-known technique for loop optimization and recently used to used to schedule stream programs ▫LP formulation for min buffer requirements of rate optimal software pipelining of RSFGs SGMS for StreamIt applications on multi-core architecture ▫focused on the balance of work partition but lack considering the cost of communication 53

Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 54

Conclusion A unified ILP formulation that combines the requirement of rate-optimal software pipelining and the min inter-core communication overhead Consideration of memory constraints Implementation on DFBrook language and Godson-T architecture Good performance improvement comparing with other schedules 55

Thanks for your listening~ 56