Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors: Haitao Wei, Junqing Yu, Huafei Yu, Mingkang Qin, Guang R. Gao Chih-Sheng Lin
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 2
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 3
Multi-core Architectures Multi-core architectures have become the mainstream solution and industry standard from servers to desktop platforms and handheld devices ▫IBM’s Cell, Nvidia’s GPU, ICT’s Godson, MIT’s raw Multi-core processor ▫increases the computation ability ▫pushes the performance burden to the compiler and programmer to effectively exploit the coarse-grained parallelism across the cores 4
Stream Programming Model The stream programming model is an approach! Stream languages ▫StreamIt, Brook, CUDA, SPUR and Cg ▫are motivated by applications in media processing domains ▫are based on synchronous dataflow (SDF) or regular stream flow graphs (RSFG) 5
Regular Stream Flow Graph (RSFG) Node ▫a computation task (actor) ▫has an independent instruction stream and address space ▫fire repeatedly in a periodic schedule Arc(Edge) ▫the communication (flow of data) between nodes ▫through the communication channel 6
Software Pipelining Software pipelining ▫an efficient method to exploit the coarse-grained parallelism in stream programs ▫takes whole program as a loop and periodic schedule as iteration of the loop Stream programs can be easily and naturally mapped to communication-exposed multi-core architecture ▫but the gains through parallel execution can be overshadowed by the cost of communication and synchronization 7
Software Pipelining (Cont.) The performance metric of software pipelining ▫the initiation rate of successive iteration Rate optimal schedule ▫The schedule with the maximum initiation rate (minimum initiation interval) Resource limitations ▫Processor capability, the size of memory with each PE, interconnect bandwidth and direct memory access (DMA) 8
Goal To orchestrate an efficient software pipelining schedule which obtains optimal computation rate while minimize the communication cost and satisfying the resource constraints under the system 9
CMRO and ROMC CMRO (Communication Minimized Rate- Optimal) ▫minimizes the communication cost at optimal computation rate ▫formulated as an unified Integer Linear Programming (ILP) problem ROMC (Rate-Optimal with Memory Constraints) ▫formulated as an unified integer quadratic programming problem ▫transformed to an ILP problem by using stage adjustment optimization 10
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 11
DFBrook Steam Language DFBrook: extension of Brook for SDF 12
Target Architecture – Godson-T Communication exposed multi-core platform 13
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 14
CMRO Schedule – Problem Definition 15
CMRO Schedule – Problem Definition (Cont.) 16
Example of Stream Graph and DDG Stream Graph Data Dependency Graph 17
CMRO Problem 18
Continued with the previous example SGMS (Stream Graph Modulo Schedule) ▫lacks the consideration of communication 19
Continued with the previous example CMRO 20
ILP Formulation - Space 21
ILP Formulation - Space(Cont.) 22
ILP Formulation - Space(Cont.) 23
ILP Formulation - Space(Cont.) 24
ILP Formulation - Time 25
ILP Formulation – Time(Cont.) 26
ILP Formulation for CMRO Problem 27
Rate-Optimal Schedule with Memory Constraints (ROMC) 28
ROMC(Cont.) Considerations ▫All the buffers used for an instance are allocated statically in the memory of the processor where the instance is assigned to ▫In the software pipelining schedule, multiple buffers are introduced to keep up with the distance in the stages between two connected instances 29
Example of Buffer Allocation Schemes 30
ROMC(Cont.) 31
Solving ROMC Problem 32
Solving ROMC Problem 33
Stage Assignment and Adjustment Optimization Process 34
Stage Assignment and Adjustment Optimization Process(Cont.) 35 Key: The stage of DMA-node can be adjusted to reduced the buffer usage of victim processors
Buffer Usage Calculation 36 The number of input buffers in each PE’s memory
Buffer Usage Calculation(Cont.) 37 The number of output buffers in each PE’s memory
Stage Adjustment Optimization 38
Stage Adjustment Optimization(Cont.) 39
Stage Adjustment Optimization(Cont.) 40
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 41
Experiment Infrastructure and Methodology Scheduler ▫implemented by DFBrook to generate codes for the software pipelining schedules Experimental Platform ▫Godson-T Architecture Simulator Solving ILPs ▫Commercial program CPLEX 42
Comparison 43
Comparison(Cont.) 44
ROMC Schedule Performance Number of processors = 9 MinMem = 16KB for all benchmarks MaxMem = 512KB for imgsmth, Gauss and aveMotion; 32KB for others 45
ROMC vs Conservative Estimate Method (CEM) *: both of the two schedulers can find a feasible solution +: only ROMC finds a solution while the solution by CEM is unable to meet the memory constraints 46
Scalability (over single processor) 47
ROMC ILP Solving Time (in CPU seconds) In 70% of the cases, ROMC scheduler can obtain an optimal solution in less than 6 minutes 48
CMRO ILP Solving Time 49
CMRO Performance Improvement 50
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 51
Related Works The schedule of stream graph ▫Ptolemy: model of computation and scheduling on SDF ▫Regular Stream Flow Graph (RSFG) can be statically schedule at compiler time Stream compilation ▫Coarse-grained task, data, pipeline parallelism have been exploited for StreamIt on raw architecture 52
Related Works(Cont.) Software pipelining is a well-known technique for loop optimization and recently used to used to schedule stream programs ▫LP formulation for min buffer requirements of rate optimal software pipelining of RSFGs SGMS for StreamIt applications on multi-core architecture ▫focused on the balance of work partition but lack considering the cost of communication 53
Outline Introduction Background ▫DFBrook Stream Language ▫Architecture – Godson-T Software Pipelining Scheduling with Resource Constraints Experiments and Evaluation Related Works Conclusion 54
Conclusion A unified ILP formulation that combines the requirement of rate-optimal software pipelining and the min inter-core communication overhead Consideration of memory constraints Implementation on DFBrook language and Godson-T architecture Good performance improvement comparing with other schedules 55
Thanks for your listening~ 56