Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Similar presentations


Presentation on theme: "UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning."— Presentation transcript:

1 UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning Alex Aletà Josep M. Codina Jesús Sánchez Antonio González David Kaeli {aaleta, jmcodina, fran, antonio}@ac.upc.es kaeli@ece.neu.edu PACT 2002, Charlottesville, Virginia – September 2002

2 Clustered Architectures  Current/future challenges in processor design  Delay in the transmission of signals  Power consumption  Architecture complexity  Clustering: divide the system in semi-independent units  Each unit  Cluster  Fast interconnects intra-cluster  Slow interconnects inter-clusters  Common trend in commercial VLIW processors  TI’s C6x  Analog’s TigerSHARC  HP’s LX  Equator’s MAP1000

3 Architecture Overview L1 CACHE LOCAL REGISTER FILE FU MEM LOCAL REGISTER FILE FU MEM Register Buses CLUSTER 1CLUSTER n

4 Instruction Scheduling  For non-clustered architectures  Resources  Dependences  For clustered architectures  Cluster assignment  Minimize inter-cluster communication delays  Exploit communication locality  This work focuses on modulo scheduling for clustered VLIW architectures  Technique to schedule loops

5 Talk Outline  Previous work  Proposed algorithm  Overview  Graph partitioning  Pseudo-scheduling  Performance evaluation  Conclusions

6 MS for Clustered Architectures  Two steps  Data Dependence Graph partitioning: each instruction is assigned to a cluster  Scheduling: instructions are scheduled in a suitable slot but only in the preassigned cluster  In previous work, two different approaches were proposed: II++ Cluster Assignment + Scheduling  One step  There is no initial cluster assignment  The scheduler is free to choose any cluster Cluster Assignment Cluster Assignment Scheduling II++

7 Goal of the Work  Both approaches have benefits  Two steps  Global vision of the Data Dependence Graph  Workload is better split among different clusters  Number of communications is reduced  One step  Local vision of partial scheduling  Cluster assignment is performed with information of the partial scheduling  Goal: obtain an algorithm taking advantage of the benefits of both approaches

8 Baseline  Baseline scheme: GP [Aletà et al., Micro34]  Cluster assignment performed with a graph partitioning algorithm  Feed-back between the partitioning and the scheduler  Results outperformed previous approaches  Still little information available for cluster assignment  New algorithm: better partition  Pseudo-schedules are used to guide the partition  Global vision of the Data Dependence Graph  More information to perform cluster assignment

9 Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

10 Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

11 Graph Partitioning Background  Problem statement  Split the nodes into a pre-determined number of sets and optimizing some functions  Multilevel strategy  Coarsen the graph  Iteratively, fuse pairs of nodes into new macro-nodes  Enhancing heuristics  Avoid excess load in any one set  Reduce execution time of the loops

12 Graph Coarsening  Previous definitions  Matching  Slack  Iterate until same number of nodes than clusters:  The edges are weighted according to  Impact on execution time of adding a bus delay to the edge  Slack of the edge  Then, select the maximum weight matching  Nodes linked by edges in the matching are fused in a single macro-node

13 Coarsening Example Find matching 4 4 2 Final graph Initial graph 4 4 4 2 1 4

14 coarsening Example (II) 1st STEP : Partition induced in the original graph Initial graphInduced Partition Final graph

15  Estimation of execution time needed Pseudo-schedules  Information obtained  II  SC  Lifetimes  Spills Reducing Execution Time

16  Dependences  Respected if possible  Else a penalty on register pressure and/or in execution time is assessed  Cluster assignment  Partition strictly followed Building pseudo-schedules

17 Pseudo-schedule: example Induced partition A D B C Cluster 1Cluster 2 0A 1 2 3B 4D 5 6C?NO 7 Cluster 1Cluster 2 AD B  2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2 Instruction latency= 3

18 Pseudo-schedule: example Induced partition A D B C Cluster 1Cluster 2 0A 1 2 3B 4D 5 6 7 8C Cluster 1Cluster 2 A,CD B

19 Heuristic description  While improvement, iterate:  Different partitions are obtained by moving nodes among clusters  Partitions that produce overload resources in any of the clusters are discarded  The partition minimizing execution time is chosen  In case of tie, the one that minimizes register pressure is selected

20 Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

21 The Scheduling Step  To schedule the partition we use URACAM [Codina et al., PACT’01]  Figure of merit  Uses dynamic transformations to improve the partial schedule  Register communications Bus  memory  Spill code on-the-fly Register pressure  memory  If an instruction can not be scheduled in the cluster assigned by the partition  Try all other clusters  Select the best one according to a figure of merit

22 Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES

23 Partition Refinement  II has increased  A better partition can be found for the new II  New slots have been generated in each cluster  More lifetimes are available  A larger number of bus communications allowed  Coarsening process is repeated  Only edges between nodes in the same set can appear in the matching  After coarsening, the induced partition will be the last partition that could not be scheduled  The reducing execution time heuristic is reapplied

24 Benchmarks and Configurations  Benchmarks - all the SPECfp95 using the ref input set  Two schedulers evaluated:  GP – (previous work)  Pseudo-schedule (PSP) Resources INT/cluster FP/cluster MEM/cluster Unified 4 4 4 2-cluster 2 2 2 4-cluster 1 1 1 Latencies INTFP MEM 22 ARITH 13 MUL/ABS 26 618 DIV/SQR/TRG

25 GP vs PSP 32 registers split into 2 clusters 1 bus (L=1) 32 registers split into 4 clusters 1 bus (L=1)

26 GP vs PSP 64 registers split into 4 clusters 1 bus (L=2) 32 registers split into 4 clusters 1 bus (L=2)

27 Conclusions  A new algorithm to perform MS for clustered VLIW architectures  Cluster assignment based on multilevel graph partitioning  The partition algorithm is improved  Based on pseudo-schedules  Reliable information available to guide the partition  Outperform previous work  38.5% speedup for some configurations

28 UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Any questions?

29 GP vs PSP 64 registers split into 2 clusters 1 bus (L=1) 64 registers split into 4 clusters 1 bus (L=1)

30 Different Alternatives Cluster Assignment Cluster Assignment Scheduling II++ Global vision when assigning clusters Schedule follows exactly assignment Re-scheduling does not take into account more resources available Local vision when assigning and scheduling Assignment is based on current resource usage No global view of the graph II++ Cluster Assignment + Scheduling Global and local views of the graph If cannot schedule, depending on the reason Re-schedule Re-compute cluster assignment Cluster Assignment Cluster Assignment Scheduling II++ ? ?

31 Clustered Architectures  Current/future challenges in processor design  Delay in the transmission of signals  Power consumption  Architecture complexity  Solutions:  VLIW architectures  Clustering: divide the system in semi-independent units  Fast interconnects intra-cluster  Slow interconnects inter-clusters  Common trend in commercial VLIW processors TI’s C6x Analog’s Tigersharc HP’s LX Equator’s MAP1000

32 Example (I) 1st STEP : Coarsening the graph Initial graph 15 3 Find matching New graph 3 1 Find matching 3 1 Final graph 1

33 coarsening Example (I) 1st STEP : Partition induced in the original graph Initial graphInduced partition coarsened graph 1

34 Reducing Execution Time  Heuristic description  Different partitions are obtained by moving nodes among clusters  Partitions overloading resources in any of the clusters are discarded  The partition minimizing execution time is chosen  In case of tie, the one that minimizes register pressure  Estimation of execution time needed Pseudo-schedules

35  Building pseudo-schedules  Dependences  Respected if possible  Else a penalty on register pressure and/or in execution time is assumed  Cluster assignment  Partition strictly followed  Valuable information can be estimated  II  Length of the pseudo-schedule  Register pressure Pseudo-schedules Execution time


Download ppt "UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning."

Similar presentations


Ads by Google