UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Slides:

Advertisements

Similar presentations

UPC MICRO35 Istanbul Nov Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Advertisements

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Hadi Goudarzi and Massoud Pedram

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Modeling shared cache and bus in multi-core platforms for timing analysis Sudipta Chattopadhyay Abhik Roychoudhury Tulika Mitra.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

Label Placement and graph drawing Imo Lieberwerth.

Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Register Allocation (via graph coloring)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

Improving Code Generation Honors Compilers April 16 th 2002.

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Generic Software Pipelining at the Assembly Level Markus Pister

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Optimizing Parallel Algorithms for All Pairs Similarity Search

Michael Chu, Kevin Fan, Scott Mahlke

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Presentation transcript:

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning Alex Aletà Josep M. Codina Jesús Sánchez Antonio González David Kaeli {aaleta, jmcodina, fran, PACT 2002, Charlottesville, Virginia – September 2002

Clustered Architectures  Current/future challenges in processor design  Delay in the transmission of signals  Power consumption  Architecture complexity  Clustering: divide the system in semi-independent units  Each unit  Cluster  Fast interconnects intra-cluster  Slow interconnects inter-clusters  Common trend in commercial VLIW processors  TI’s C6x  Analog’s TigerSHARC  HP’s LX  Equator’s MAP1000

Architecture Overview L1 CACHE LOCAL REGISTER FILE FU MEM LOCAL REGISTER FILE FU MEM Register Buses CLUSTER 1CLUSTER n

Instruction Scheduling  For non-clustered architectures  Resources  Dependences  For clustered architectures  Cluster assignment  Minimize inter-cluster communication delays  Exploit communication locality  This work focuses on modulo scheduling for clustered VLIW architectures  Technique to schedule loops

Talk Outline  Previous work  Proposed algorithm  Overview  Graph partitioning  Pseudo-scheduling  Performance evaluation  Conclusions

MS for Clustered Architectures  Two steps  Data Dependence Graph partitioning: each instruction is assigned to a cluster  Scheduling: instructions are scheduled in a suitable slot but only in the preassigned cluster  In previous work, two different approaches were proposed: II++ Cluster Assignment + Scheduling  One step  There is no initial cluster assignment  The scheduler is free to choose any cluster Cluster Assignment Cluster Assignment Scheduling II++

Goal of the Work  Both approaches have benefits  Two steps  Global vision of the Data Dependence Graph  Workload is better split among different clusters  Number of communications is reduced  One step  Local vision of partial scheduling  Cluster assignment is performed with information of the partial scheduling  Goal: obtain an algorithm taking advantage of the benefits of both approaches

Baseline  Baseline scheme: GP [Aletà et al., Micro34]  Cluster assignment performed with a graph partitioning algorithm  Feed-back between the partitioning and the scheduler  Results outperformed previous approaches  Still little information available for cluster assignment  New algorithm: better partition  Pseudo-schedules are used to guide the partition  Global vision of the Data Dependence Graph  More information to perform cluster assignment

Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

Graph Partitioning Background  Problem statement  Split the nodes into a pre-determined number of sets and optimizing some functions  Multilevel strategy  Coarsen the graph  Iteratively, fuse pairs of nodes into new macro-nodes  Enhancing heuristics  Avoid excess load in any one set  Reduce execution time of the loops

Graph Coarsening  Previous definitions  Matching  Slack  Iterate until same number of nodes than clusters:  The edges are weighted according to  Impact on execution time of adding a bus delay to the edge  Slack of the edge  Then, select the maximum weight matching  Nodes linked by edges in the matching are fused in a single macro-node

Coarsening Example Find matching Final graph Initial graph

coarsening Example (II) 1st STEP : Partition induced in the original graph Initial graphInduced Partition Final graph

 Estimation of execution time needed Pseudo-schedules  Information obtained  II  SC  Lifetimes  Spills Reducing Execution Time

 Dependences  Respected if possible  Else a penalty on register pressure and/or in execution time is assessed  Cluster assignment  Partition strictly followed Building pseudo-schedules

Pseudo-schedule: example Induced partition A D B C Cluster 1Cluster 2 0A 1 2 3B 4D 5 6C?NO 7 Cluster 1Cluster 2 AD B  2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2 Instruction latency= 3

Pseudo-schedule: example Induced partition A D B C Cluster 1Cluster 2 0A 1 2 3B 4D C Cluster 1Cluster 2 A,CD B

Heuristic description  While improvement, iterate:  Different partitions are obtained by moving nodes among clusters  Partitions that produce overload resources in any of the clusters are discarded  The partition minimizing execution time is chosen  In case of tie, the one that minimizes register pressure is selected

Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

The Scheduling Step  To schedule the partition we use URACAM [Codina et al., PACT’01]  Figure of merit  Uses dynamic transformations to improve the partial schedule  Register communications Bus  memory  Spill code on-the-fly Register pressure  memory  If an instruction can not be scheduled in the cluster assigned by the partition  Try all other clusters  Select the best one according to a figure of merit

Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

Partition Refinement  II has increased  A better partition can be found for the new II  New slots have been generated in each cluster  More lifetimes are available  A larger number of bus communications allowed  Coarsening process is repeated  Only edges between nodes in the same set can appear in the matching  After coarsening, the induced partition will be the last partition that could not be scheduled  The reducing execution time heuristic is reapplied

Benchmarks and Configurations  Benchmarks - all the SPECfp95 using the ref input set  Two schedulers evaluated:  GP – (previous work)  Pseudo-schedule (PSP) Resources INT/cluster FP/cluster MEM/cluster Unified cluster cluster Latencies INTFP MEM 22 ARITH 13 MUL/ABS DIV/SQR/TRG

GP vs PSP 32 registers split into 2 clusters 1 bus (L=1) 32 registers split into 4 clusters 1 bus (L=1)

GP vs PSP 64 registers split into 4 clusters 1 bus (L=2) 32 registers split into 4 clusters 1 bus (L=2)

Conclusions  A new algorithm to perform MS for clustered VLIW architectures  Cluster assignment based on multilevel graph partitioning  The partition algorithm is improved  Based on pseudo-schedules  Reliable information available to guide the partition  Outperform previous work  38.5% speedup for some configurations

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Any questions?

GP vs PSP 64 registers split into 2 clusters 1 bus (L=1) 64 registers split into 4 clusters 1 bus (L=1)

Different Alternatives Cluster Assignment Cluster Assignment Scheduling II++ Global vision when assigning clusters Schedule follows exactly assignment Re-scheduling does not take into account more resources available Local vision when assigning and scheduling Assignment is based on current resource usage No global view of the graph II++ Cluster Assignment + Scheduling Global and local views of the graph If cannot schedule, depending on the reason Re-schedule Re-compute cluster assignment Cluster Assignment Cluster Assignment Scheduling II++ ? ?

Clustered Architectures  Current/future challenges in processor design  Delay in the transmission of signals  Power consumption  Architecture complexity  Solutions:  VLIW architectures  Clustering: divide the system in semi-independent units  Fast interconnects intra-cluster  Slow interconnects inter-clusters  Common trend in commercial VLIW processors TI’s C6x Analog’s Tigersharc HP’s LX Equator’s MAP1000

Example (I) 1st STEP : Coarsening the graph Initial graph 15 3 Find matching New graph 3 1 Find matching 3 1 Final graph 1

coarsening Example (I) 1st STEP : Partition induced in the original graph Initial graphInduced partition coarsened graph 1

Reducing Execution Time  Heuristic description  Different partitions are obtained by moving nodes among clusters  Partitions overloading resources in any of the clusters are discarded  The partition minimizing execution time is chosen  In case of tie, the one that minimizes register pressure  Estimation of execution time needed Pseudo-schedules

 Building pseudo-schedules  Dependences  Respected if possible  Else a penalty on register pressure and/or in execution time is assumed  Cluster assignment  Partition strictly followed  Valuable information can be estimated  II  Length of the pseudo-schedule  Register pressure Pseudo-schedules Execution time