Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González 1,2 1 Intel Barcelona Research Center Intel Labs, Barcelona 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)2 Issue #1: Energy Consumption First class design goal Heterogeneity –↓ supply voltage and/or ↑ threshold voltage Cache memory ARM10 –D-cache 24% dynamic energy –I-cache 22% dynamic energy Heterogeneity can be exploited in the D-cache for VLIW processors processor front-end processor back-end processor front-end processor back-end Higher performance Higher energy Lower performance Lower energy
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)3 Issue #2: Wire Delays From capacity-bound to communication-bound One possible solution: clustering Unified cache clustered VLIW processor –Used as a baseline throughout this work CLUSTER 1 Reg. File FUs Global communication buses Cache Memory buses … CLUSTER 2 Reg. File FUs CLUSTER n Reg. File FUs
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)4 Contributions GOAL : exploit heterogeneity in the L1 D-cache for clustered VLIW processors Power-efficient distributed L1 data cache –Divide data cache into two modules and assign each to a cluster Modules may be heterogeneous –Map variables statically between cache modules –Develop instruction scheduling techniques Results summary –Heterogeneous distributed data cache good design point –Distributed data cache vs. unified data cache Distributed caches outperform unified schemes in EDD and ED –No single distributed cache configuration is the best Reconfigurable distributed cache allows additional improvements
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)5 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)6 L2 D-CACHE Register buses load *p RF CLUSTER 1 var X RF CLUSTER 2 var Y Variable-Based Multi-Module Cache FU RF CLUSTER 1 FIRST MODULE SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE Memory instructions have a preferred cluster cluster affinity “Wrong” cluster assignment performance, not correctness Resume execution Stall clusters Empty communication buses Send request Access memory Send reply back load X STACK HEAP DATA GLOBAL DATA STACK HEAP DATA GLOBAL DATA FIRST SPACE SECOND SPACE SP1 SP2 distributed stack frames Logical Address Space
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)7 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)8 Distributed Cache Configurations 8KB FASTSLOW 1 R/W latency ↑ energy ↓ FAST FU+RF CLUSTER 1 FU+RF CLUSTER 2 FAST+NONE FAST FU+RF CLUSTER 1 FAST FU+RF CLUSTER 2 FAST+FAST SLOW FU+RF CLUSTER 1 FU+RF CLUSTER 2 SLOW+NONE SLOW FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 SLOW+SLOW FAST FU+RF CLUSTER 1 SLOW FU+RF CLUSTER 2 FAST+SLOW FIRST MODULE FU RF CLUSTER 1 SECOND MODULE FU RF CLUSTER 2 Register buses L2 D-CACHE
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)9 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)10 Instructions-to-Variables Graph Built with profiling information Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1VAR V2VAR V3VAR V4 FIRSTSECOND CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 LD2 LD1 LD4 LD5 ST1 LD3 ST2
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)11 Greedy Mapping / Scheduling Algorithm Initial mapping all to space Assign affinities to instructions –Express a preferred cluster for memory instructions: [0,1] –Propagate affinities from memory insts. to other insts. Schedule code + refine mapping Compute IVG Compute mapping Compute affinities using IVG + propagate affinities Compute affinities using IVG + propagate affinities Schedule code
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)12 Computing and Propagating Affinity add1 add2 LD1 LD2 mul1 add6 add7 ST1 add3 add4 LD3 LD4 add5 L=1 L=3 LD1 LD2 LD3 LD4 ST1 V1 V2 V4 V3 FIRSTSECOND AFFINITY=0AFFINITY=1 FIRST MODULE FU RF CLUSTER 1 Register buses SECOND MODULE FU RF CLUSTER 2 AFF.=0.4 slack 0 slack 2 slack 0 slack 2 slack 0 slack 5
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)13 Cluster affinity + affinity range used to: –Define a preferred cluster –Guide the instruction-to-cluster assignment process Strongly preferred cluster –Schedule instruction in that cluster Weakly preferred cluster –Schedule instruction where global comms. are minimized Cluster Assignment IBIB ICIC Affinity range (0.3, 0.7) ≤ 0.3≥ 0.7 CACHE FU+RF CLUSTER 1 CACHE FU+RF CLUSTER 2 V1 IAIA 100 Affinity=0 Affinity=0.9 V2V Affinity=0.4 ICIC ICIC ? IAIA IBIB
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)14 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)15 Evaluation Framework IMPACT compiler infrastructure +16 Mediabench Cache parameters –CACTI SIA projections + ARM10 datasheets Data cache consumes 1/3 of the processor energy Leakage accounts for 50% of the total energy Results outline –Distributed cache schemes: F+Ø, F+F, F+S, S+S, S+Ø Affinity range EDD and ED comparison the lower, the better F+Ø used as baseline throughout presentation –Comparison with a unified cache scheme FAST and SLOW unified schemes State-of-the-art scheduling techniques for these schemes –Reconfigurable distributed cache 8KB FASTSLOW 1 R/W L = 2 1 R/W L = 4 latency x2 energy by 1/3
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)16 Affinity Range Affinity plays a key role in cluster assignment –36% - 44% better in EDD than no-affinity –32% better in ED than no-affinity (0,1) affinity range is the best –~92% of memory instructions access a single variable –Binary affinity for memory instructions NO AFFINITY FAST+FAST EDD FAST+SLOW EDD SLOW+SLOW EDD
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)17 EDD Results Memory Ports SensitiveInsensitive Memory LatencySensitiveFAST+FASTFAST+NONE InsensitiveSLOW+SLOWSLOW+NONE
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)18 ED Results
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)19 Comparison With Unified Cache BEST DISTRIBUTEDUNIFIED FASTUNIFIED SLOW EDD0.89 (FAST+SLOW) ED0.89 (SLOW+SLOW) Distributed schemes are better than unified schemes –29-31% better in EDD and 19-29% better in ED FUs RF CLUSTER 1 FAST CACHE FUs RF CLUSTER 2 FUs RF CLUSTER 1 SLOW CACHE FUs RF CLUSTER 2 Instruction Scheduling Aletà et al. (PACT’02)
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)20 Reconfigurable Distributed Cache The OS can set each module in one state: –FAST mode / SLOW mode / Turned-off The OS reconfigures the cache on a context switch –Depending on the applications scheduled in and scheduled out Two different V DD and V TH for the cache –Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] Simple heuristic to show potential –For each application, choose the estimated best cache configuration BEST DISTRIBUTED RECONFIGURABLE SCHEME EDD0.89 (FAST+SLOW) 0.86 ED0.89 (SLOW+SLOW) 0.86
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)21 Talk Outline Variable-Based Multi-Module Data Cache Distributed Cache Configurations Instruction Scheduling Results Conclusions
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors (PACT’05)22 Conclusions Distributed Variable-Based Multi-Module Cache –Affinity is crucial for achieving good performance 36-44% better in EDD and 32% in ED than no-affinity –Heterogeneity ( FAST+SLOW ) is a good design point 4-11% better in EDD and from 6% worse to 10% better in ED –No single cache configuration is the best Reconfigurable cache modules exploit additional 3-4% Distributed schemes vs. unified schemes –All distributed schemes outperform unified ones 29-31% better in EDD, 19-29% better in ED
Q&A