Presentation is loading. Please wait.

Presentation is loading. Please wait.

U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric.

Similar presentations


Presentation on theme: "U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric."— Presentation transcript:

1 U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric Gibert 1 Jesús Sánchez 2 Antonio González 1,2 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona

2 U P C CGO’03 San Francisco March 2003 Motivation  Capacity vs. Communication-bound  Clustered microarchitectures –Simpler + faster –Power consumption –Communications not homogeneous  Clustering  embedded/DSP domain

3 U P C CGO’03 San Francisco March 2003 Clustered Microarchitectures CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache module L1 cache module L2 cache L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache module L1 cache module L2 cache L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module Memory buses

4 U P C CGO’03 San Francisco March 2003 Contributions  Distribution of data cache –Architecture design + data mapping Word-interleaved scheme [ICS’02] –Appropriate scheduling techniques [MICRO’02] –Memory coherence  Scheduling techniques for mem. coherence –Local software-based techniques –Applied to word-interleaved cache Complex conf. (with Attraction Buffers – refer to paper) Simple conf. (without Attraction Buffers) –Applicable to any other cache configuration

5 U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

6 U P C CGO’03 San Francisco March 2003 Word-Interleaved Distribution CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAGW0W1W2W4W5W6W7W3 TAGW0W4TAGW1W5TAGW2W6TAGW3W7 subblock 1 cache block local hitremote hit local missremote miss

7 U P C CGO’03 San Francisco March 2003 Scheduling Techniques CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } ld r31, a[i]ld r32, a[i+1]ld r33, a[i+2]ld r34, a[i+3] for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes)... } ld r3, a[i] Modulo scheduling Loop unrolling Assignment of latencies Padding + Profiling

8 U P C CGO’03 San Francisco March 2003 Cluster Assignment  Non-memory instructions Minimize register communications Maximize workload balance  Memory instructions  2 heuristics: –PrefClus Heuristic Preferred Cluster = most accessed cluster Profiling + Padding –MinComs Heuristic Minimize register communications Maximize workload balance Post-pass phase to increase local accesses

9 U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

10 U P C CGO’03 San Francisco March 2003 Memory Coherence Problem CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i+1---- cycle i+2---- cycle i+3---- cycle i+4load from a[0]--- Store to a[0] Update a[0] Read a[0] Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! Store to a[0]

11 U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

12 U P C CGO’03 San Francisco March 2003 Solutions Outline  Local scheduling solutions  applied at a loop granularity –Memory Dependent Chains (MDC) –Data Dependence Graph Transformations (DDGT) Store replication Load-store synchronization  Software-based solutions  Applicable to other configurations –Replicated distributed cache –MultiVLIW [MICRO00] …

13 U P C CGO’03 San Francisco March 2003 Memory Dependent Chains  Sets of aliased instructions: –Memory Dependent Chains (MDC)  Instructions in same set: –Assigned to same cluster  Restrictions on cluster assignment –PrefClus: average preferred cluster –MinComs: minimize comms. when scheduling first node n1 load n2 load n3 add n4 store n6 load n7 div n8 add RF MA MF = memory-flow MA = memory-anti RF = register-flow MF

14 U P C CGO’03 San Francisco March 2003 Memory Dependent Chains CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i+1---- cycle i+2---- cycle i+3---- cycle i+4load from a[0]--- store to a[0] load from a[0]

15 U P C CGO’03 San Francisco March 2003 DDGT: Store Replication  Overcome MEM_FLOW (MF) and MEM_OUT (MO) store A store A load B load B MF store A store A store A’ store A’ store A’’ store A’’ store A’’’ store A’’’ load B load B MF store replication store A store A store B store B MO store A store A store A’ store A’ store A’’ store A’’ store A’’’ store A’’’ MO store replication store B store B store B’ store B’ store B’’ store B’’ store B’’’ store B’’’ local instance remote instances

16 U P C CGO’03 San Francisco March 2003 DDGT: Store Replication CLUSTER 1 a[0]a[4] Cache module CLUSTER 3 CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i+1store to a[0]- - cycle i+2---- cycle i+3-store to a[0]-- cycle i+4load from a[0]--- local instance remote instances Increase number of register communications!!!

17 U P C CGO’03 San Francisco March 2003 DDGT: ld-st Synchronization  Overcome MEM_ANTI (MA) dependences load A load A store B store B MA add RF load-store sync. load A load A store B store B SYNC add RF  Special cases: –Store is already REG_FLOW dependent on the load –Impossible recurrences load A load A store C store C RF store B store B MA MO load A load A store C store C RF store B store B MO fake cons fake cons RF SYNC load-store sync. MA

18 U P C CGO’03 San Francisco March 2003 CCCC BA MRT II res =2 C1C2C3C4 MDC Solution: Case Study  Impact on compute time –May increase the II res load A load A store C store C load B load B C BA MRT II res =2 C1C2C3C4 MA MF B C A MRT II res =3 C1C2C3C4  Impact on stall time –May increase remote accesses Extra stall cycles = 3 cycles / iteration always accesses data in cluster 1 always accesses data in cluster 2 Latency LH = 1 cycle Latency RH = 5 cycles add RF cycle 1 cycle 3

19 U P C CGO’03 San Francisco March 2003 DDGT Solution: Case Study  Impact on compute time –More instructions (II res ) Store replication Fake consumers (few) Register communications MRT II res =2 C1C2C3C4 X XXX store B store B load A load A MA MF C4 MRT II res =3 C1C2C3 BXBB B AXXX set of memory instructions X  Impact on stall time –Small New dependences may decrease slack of some memory instructions

20 U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

21 U P C CGO’03 San Francisco March 2003 Evaluation Framework  IMPACT C compiler Compile + optimize + memory disambiguation  Mediabench benchmark suite ProfileExecution epicdec test_imagetitanic g721dec clintonS_16_44 g721enc clintonS_16_44 gsmdec clintonS_16_44 gsmenc clintonS_16_44 jpegdec testimgmonalisa jpegenc testimgmonalisa ProfileExecution mpeg2dec mei16v2tek6 pegwitdec pegwittechrep pegwitenc pgptesttechrep pgpdec pgptexttechrep pgpenc pgptesttechrep rasta ex5_c1

22 U P C CGO’03 San Francisco March 2003 Evaluation Framework Word-Interleaved Cache Clustered VLIW Processor # clusters 4 Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster Register buses 4 buses running at ½ the core freq. Memory buses 4 buses running at ½ the core freq. Cache configuration 8KB, 2-way set-associative, 32 byte blocks L2 always hits Cache latencies Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15 Algorithm PrefClus and MinComs Interleaving factor 2 or 4 bytes depending on benchmark BASELINE Same architecture but complete freedom when assigning instructions to clusters

23 U P C CGO’03 San Francisco March 2003 Local vs. Remote Accesses

24 U P C CGO’03 San Francisco March 2003 Execution Time

25 U P C CGO’03 San Francisco March 2003 Other Configurations  Configuration 1 24Memory buses42Register buses Latency# BusesLatency# Buses More pressure on register buses MDC outperforms DDGT in all cases  MDC requires less register communications 42Memory buses24Register buses Latency# BusesLatency# Buses More pressure on memory buses DDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%…  Configuration 2

26 U P C CGO’03 San Francisco March 2003 Talk Outline  Architecture and Scheduling Algorithms  Memory Coherence Problem  Solutions –Memory Dependent Chains (MDC) –DDG Transformations (DDGT)  Evaluation  Conclusions

27 U P C CGO’03 San Francisco March 2003 Conclusions  Memory coherence problem –Two software-based solutions: MDC and DDGT –Applied to a word-interleaved cache clustered VLIW processor  MDC vs DDGT –Results depending on architecture configuration MDC outperforms DDGT in most cases DDGT better by up to 20% in specific configuration –Sets of memory dependent insts. are small –DDGT  freedom in cluster assignment Increase local accesses by 15%  reduce stall time

28 U P C CGO’03 San Francisco March 2003 Questions?


Download ppt "U P C CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric."

Similar presentations


Ads by Google