UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez 1,2 Antonio González 1,2 1 Dept. dArquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona

UPC MICRO35 Istanbul Nov. 2002 Motivation Capacity vs. Communication-bound Clustered microarchitectures –Simpler + faster –Power consumption –Communications not homogeneous Clustering embedded/DSP domain

UPC MICRO35 Istanbul Nov. 2002 Clustered Microarchitectures CLUSTER 1 Reg. File FUs CLUSTER 2 Reg. File FUs CLUSTER 3 Reg. File FUs CLUSTER 4 Reg. File FUs Register-to-register communication buses L1 cache L2 cache Memory buses GOAL: distribute the memory hierarchy!!!

UPC MICRO35 Istanbul Nov. 2002 Contributions Distribution of data cache: –Interleaved cache clustered VLIW processor Hardware enhancement: –Attraction Buffers Effective instruction scheduling techniques –Modulo scheduling –Loop unrolling + smart assignment of latencies + padding

UPC MICRO35 Istanbul Nov. 2002 Talk Outline MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions

UPC MICRO35 Istanbul Nov. 2002 MultiVLIW CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache cache block TAG+STATE+DATA Cache-Coherence Protocol!!!

UPC MICRO35 Istanbul Nov. 2002 Interleaved Cache CLUSTER 1 Register File Func. Units Register-to-register communication buses cache module CLUSTER 2 Register File Func. Units cache module CLUSTER 3 Register File Func. Units cache module CLUSTER 4 Register File Func. Units cache module L2 cache TAGW0W1W2W4W5W6W7W3 TAGW0W4TAGW1W5TAGW2W6TAGW3W7 subblock 1 local hit remote hitlocal missremote miss cache block

UPC MICRO35 Istanbul Nov. 2002 successful not successful BASE Scheduling Algorithm II=II+1 Best profit in output edges START Sort nodes Next node Select possible clusters How Many? Least loaded Schedule it How Many? >0 >1 1 0 successful not successful

UPC MICRO35 Istanbul Nov. 2002 Scheduling Algorithm For word-interleaved cache clustered processors Scheduling steps: 1.Loop unrolling 2.Assignment of latencies to memory instructions – latencies stall time + compute time 3.Order instructions (DDG nodes) 4.Cluster assignment and scheduling

UPC MICRO35 Istanbul Nov. 2002 STEP 1: Loop Unrolling CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } ld r31, a[i]ld r32, a[i+1]ld r33, a[i+2]ld r34, a[i+3] 25% local accesses 100% local accesses for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes)... } ld r3, a[i] 25% local accesses Selective unrolling : No unrolling UnrollxN OUF unrolling Strides multiple of NxI Optimum Unrolling Factor (OUF)

UPC MICRO35 Istanbul Nov. 2002 STEP 2: Latency Assignment n1 load n2 load n3 add n4 store n5 sub REC1 distance=1 n6 load n7 div n8 add REC2 memory dependences register-flow deps. distance=1 STEP 2 II stall B 5 10 14 1 3 6.8 5 3.3 2.06 -59-59 - 0.5 2.7 - 10 3.3 STEP 1 LoadLatency change II stall B n1 To LM To RH To LH 5 10 14 1 3 6.8 5 3.3 2.06 n2 To LM To RH To LH 5 10 14 0.25 0.75 2.95 20 13.3 4.75 LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles L=1 L=8 L=1 L=15 MII=33 MII=22 L=15 L=10 L=15 MII=28 MII=22 L=15 L=5 L=15 MII=23 MII=22 L=5 L=1 MII=9 MII=10

UPC MICRO35 Istanbul Nov. 2002 Step 3: Order instructions Step 4: Cluster assignment and scheduling STEPS 3 and 4

UPC MICRO35 Istanbul Nov. 2002 Scheduling Restrictions CLUSTER 1 a[0]a[4] Cache module CLUSTER 3CLUSTER 2 CLUSTER 4 a[3]a[7] Cache module NEXT MEMORY LEVEL memory buses cycle i---store to a[0] cycle i+1---- cycle i+2---- cycle i+3load from a[0]--- NON-DETERMINISTIC BUS LATENCY!!!

UPC MICRO35 Istanbul Nov. 2002 Step 3: Order instructions Step 4: Cluster assignment and scheduling –Non-memory instructions same as BASE Minimize register communications + maximize workload –Memory instructions: Memory instructions in same chain same cluster IPBC (Interleaved Preferred Build Chains) –Average preferred cluster of the chain –Padding meaningful preferred cluster information »Stack frames »Dynamically allocated data IBC (Interleaved Build Chains) –Minimize register communications of 1 st instr. of chain STEPS 3 and 4 NxI boundary

UPC MICRO35 Istanbul Nov. 2002 Memory Dependent Chains n1 load n2 load n3 add n4 store n5 sub distance=1 n6 load n7 div n8 add memory dependences register-flow deps. distance=1 Preferred = 1 Preferred = 2 LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles L=1 L=8 L=1 L=5 L=1 n1n2n4n6 IPBCcluster 1cluster 2 IBCsame as n4minimize register communications order={n5, n4, n3, n2, n1, n8, n7, n6}

UPC MICRO35 Istanbul Nov. 2002 Attraction Buffers Cost-effective mechanism local accesses CLUSTER 1 cache module a[0]a[4] CLUSTER 2 cache module a[1]a[5] CLUSTER 3 cache module a[2]a[6] CLUSTER 4 cache module a[3]a[7] ABuffer ld r3, a[3] ld r3, a[7]... stride 16 bytes a[3]a[7] Local accesses = 0% Local accesses = 50%

UPC MICRO35 Istanbul Nov. 2002 Evaluation Framework IMPACT C compiler Mediabench benchmark suite ProfileExecution epicdec test_imagetitanic epicenc test_imagetitanic g721dec clintonS_16_44 g721enc clintonS_16_44 gsmdec clintonS_16_44 gsmenc clintonS_16_44 jpegdec testimgmonalisa ProfileExecution jpegenc testimgmonalisa mpeg2dec mei16v2tek6 pegwitdec pegwittechrep pegwitenc pgptesttechrep pgpdec pgptexttechrep pgpenc pgptesttechrep rasta ex5_c1

UPC MICRO35 Istanbul Nov. 2002 Evaluation Framework Unified cacheMultiVLIWInterleaved cache # clusters 4 Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster Register buses 4 buses running at ½ the core freq. Cache configuration 8KB, 2-way set-associative, 32 byte blocks L2 always hits Cache latencies Hit=5 Miss=14 Hit=1 Miss=10 Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15 Algorithm BASEIBCIPBC + IBC Interleaving factor --4 bytes

UPC MICRO35 Istanbul Nov. 2002 Local Accesses OUF=Optimum UF P=Padding NC=No Chains

UPC MICRO35 Istanbul Nov. 2002 Why Remote Accesses? Double precision accesses (mpeg2dec) Unclear preferred cluster information Indirect accesses (e.g. a[b[i]] ) (jpegdec, jpegenc, pegwitdec, pegwitenc) Different alignment (epicenc, jpegdec, jpegenc) Strides not multiple of NxI (selective unrolling, …) Memory dependent chains (epicdec, pgpdec, pgpenc, rasta) for (k=0; k<MAX; k++){ for (i=k; i<MAX; i++) load a[i] }

UPC MICRO35 Istanbul Nov. 2002 Stall Time

UPC MICRO35 Istanbul Nov. 2002 Cycle Count Results

UPC MICRO35 Istanbul Nov. 2002 Conclusions Interleaved cache clustered VLIW processor Effective instruction scheduling techniques –Smart assignment of latencies –Loop unrolling + padding (27% local hits) Source of remote accesses and stall time Attraction Buffers ( stall time up to 34%) Cycle count results: –MultiVLIW (7% slowdown but simpler hardware) –Unified cache (11% speedup)

UPC MICRO35 Istanbul Nov. 2002 Questions?

UPC MICRO35 Istanbul Nov. 2002 Question: Latency Assignment MII(REC1)=20MII(DDG)=10 Node II stall B(ratio)B(substract) n11543.7511 n210525 n35154 n45154 n5100MAX10

UPC MICRO35 Istanbul Nov. 2002 Question: Padding void foo(int *array, int *accum) { *accum = 0; for (i=0; i<MAX; i++) *accum += array[i]; } void main() { int *a, value; a = malloc(MAX*sizeof(int)); foo(a, &value); } CLUSTER 1 a[0] a[4]... CLUSTER 2 accum a[1] a[5]... CLUSTER 3 a[2] a[6]... CLUSTER 4 a[3] a[7]...

UPC MICRO35 Istanbul Nov. 2002 Question: Coherence Memory Dependent Chains –Modified data Present in only one Attraction Buffer –Data present in multiple Attraction Buffers Replicated in read-only manner –Local scheduling technique At end of loop flush Attraction Buffers contents CLUSTER 1 a[2] ABuffer CLUSTER 2 a[2] ABuffer CLUSTER 3 ABuffer CLUSTER 4 a[2] ABuffer

UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Similar presentations

Presentation on theme: "UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez.

Similar presentations

Presentation on theme: "UPC MICRO35 Istanbul Nov. 2002 Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert 1 Jesús Sánchez."— Presentation transcript:

Similar presentations

About project

Feedback