ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) * Also at Intel Barcelona Research Center June 2002

ICS’02 UPC Motivation  Capacity-bound vs. Communication-bound  Solution: clustered microarchitectures Partition some hardware resources Simpler + faster Power consumption Communications not homogeneous  Goal: clustering the memory hierarchy in statically scheduled processors Motivation

ICS’02 UPC Talk Outline  State-of-the-art: multiVLIW  Interleaved Cache Clustered VLIW  Scheduling Algorithms  Enhancement: Attraction Buffers  Experimental Framework  Results  Conclusions

ICS’02 UPC State-of-the-art: MultiVLIW  Sánchez and González [MICRO’00] Reg. File F.U. L1 data cache Cluster 1 Reg. File F.U. L1 data cache Cluster 2 Reg. File F.U. L1 data cache Cluster n Coherency network... Register-to-register buses Next memory level

ICS’02 UPC Basic Interleaved Cache Clustered VLIW Processor Reg. File FUs TAGW0W4 cache module Reg. File FUs TAGW1W5 cache module Reg. File FUs TAGW2W6 cache module Reg. File FUs TAGW3W7 cache module TAGW0W1W2W4W5W6W7W3 Subblock 1 memory buses NEXT MEMORY LEVEL cache block Register-to-register buses CLUSTER 1 CLUSTER 2CLUSTER 3CLUSTER 4

ICS’02 UPC Modulo Scheduling  Extract ILP from loops  overlap execution of iterations A A B B C C A A B B C C A’ B’ C’ A’’ B’’ C’’ II SC Kernel LOOP L

ICS’02 UPC Base Scheduling Algorithm  Used for Unified Cache II=II+1 Best profit in output edges START Sort nodes Next node Select possible clusters How Many? Least loaded Schedule it How Many? >0 >1 1 0

ICS’02 UPC Interleaved Cache Scheduling Algorithm  Unroll loop to maximize instructions with a stride multiple of NxI  access ONE cache module  Assign latencies to memory instructions  Assign memory instructions to clusters: –IPBC (Interleaved Pre-Build Chains)  minimize stall time –IBC (Interleaved Build Chains)  minimize compute time

ICS’02 UPC Memory Dependent Instructions store load add load add store load store memory dependant chain 1 memory dependant chain 2 IPBC  preferred info is used vs. IBC  minimize register comms. Preferred=1 Preferred=2

ICS’02 UPC Local Data Local Data ABuffer local logic datahit data hit ADDRESS TAGW2W6 = TAGW ADDRESS datahit ATTRACTION BUFFER word select CACHE MODULE Enhacement: Attraction Buffers

ICS’02 UPC for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3] } 16 byte strides (NxI multiple) N = 4 clusters, I= 4 bytes Unroll x4 An Example a[3]a[7]a[0]a[4] CLUSTER 4 ABuffer Local module ld r31, a[0] CLUSTER 3CLUSTER 2CLUSTER 1 a[0] a[1] a[2] a[3]...

ICS’02 UPC Enhacement: Attraction Buffers  Why remote accesses? Why Attraction Buffers? –Double precision accesses  low benefit –Indirect accesses: a[b[i]]  low benefit –“Unclear” preferred cluster  big benefit for (i=0; i<MAX; i++) for (k=i; k<i+MAX; k+=4) ld a[k], ld a[k+1], ld a[k+2], ld a[k+3] –Memory dependent chains  big benefit –IBC: preferred cluster info is not used  big benefit

ICS’02 UPC Experimental Framework  IMPACT C compiler  Modulo scheduling on hyperblock loops –BASE for a Unified Cache –IPBC and IBC for an Interleaved Cache –IPBC and IBC for the MultiVLIW –The same unrolling factor has been used for all architecture configurations!  Mediabench benchmark suite

ICS’02 UPC Experimental Framework Number of clusters4 Functional units1 FP / cluster + 1 int / cluster + 1 mem / cluster Cache configuration8KB, 32-byte lines, 2-way set associative, 1 cycle latency Reg-to-reg communication buses 4 buses that run at ½ the core frequency Memory buses4 buses that run at ½ (or ¼) the core frequency Next memory level4 ports, 5 cycle latency, always hit Interleaving factor (Interleaved Cache) 4 bytes Latencies1-10 (Unified Cache + MultiVLIW) 1-(5/6)-10-15 (Interleaved Cache)

ICS’02 UPC Results (I)  IPBC vs IBC  similar cycle count results  MultiVLIW vs Interleaved  similar results BUT… … lower complexity!

ICS’02 UPC Results (II)  Memory dependent chains –Interleaved cache  workload unbalance +  remote accesses –MultiVLIW  workload unbalance –Working on techniques to overcome scheduling restrictions

ICS’02 UPC Results (III)  Local hits are increased by 15%  Stall time reduced by 30%

ICS’02 UPC Conclusions  Scheduling Algorithms –Good latency assignment process (stall time accounts for 9% of execution time) –Coherence kept through memory dependent chains (5% cycle count degradation)  Attraction Buffers –Effective to increase local hits (15% average) + reduce stall time (30% average) –Reduce remote hits to previously accessed subblocks (70% average)  Cycle count results –similar to Unified Cache and MultiVLIW

ICS’02 UPC Questions

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

Similar presentations

Presentation on theme: "ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

Similar presentations

Presentation on theme: "ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica."— Presentation transcript:

Similar presentations

About project

Feedback