Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Similar presentations


Presentation on theme: "Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1."— Presentation transcript:

1 Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1 st 2004

2 University of Utah2 Billion-Transistor Chips Partitioned architectures: small computational units connected by a communication fabric Small computational units with limited functionality  fast clocks, low design effort, low power Numerous computational units  high parallelism

3 University of Utah3 The Communication Bottleneck Wire delays do not scale down at the same rate as logic delays [Agarwal, ISCA’00][Ho, Proc. IEEE’01]  30 cycle delay to go across the chip in 10 years  1-cycle inter-hop latency in the RAW prototype [Taylor, ISCA’04]

4 University of Utah4 Cache Design L1D Address Transfer 6 cyc Data 6 cyc Transfer 6 cyc RAM Access Centralized Cache 18-cycle access (12 cycles for communication)

5 University of Utah5 Cache Design L1D Address Transfer 6 cyc Data 6 cyc Transfer 6 cyc RAM Access Centralized Cache 18-cycle access (12 cycles for communication) L1D Decentralized Cache

6 University of Utah6 Research Goals Identify bottlenecks in cache access Design cluster prefetch, a latency hiding mechanism Evaluate and compare centralized and decentralized designs

7 University of Utah7 Outline Motivation Evaluation platform Cluster prefetch Centralized vs. decentralized caches Conclusions

8 University of Utah8 Clustered Microarchitectures Centralized front-end Dynamically steered (dependences & load) O-o-o issue and 1-cycle bypass within a cluster Hierarchical interconnect L1D lsq Instr Fetch crossbar ring

9 University of Utah9 Simulation Parameters Simplescalar-based simulator In-flight instruction window of 480 16 clusters, each with 60 registers, 30 issue queue entries, and one FU of each kind Inter-cluster latencies between 2-10 Primary focus on SPEC-FP programs

10 University of Utah10 Steps Involved in Cache Access L1D lsq Instr Fetch Instr Dispatch Effective Address Computation Effective Address Transfer Memory Disambiguation RAM Access Data Transfer

11 University of Utah11 Lifetime of a Load

12 University of Utah12 Load Address Prediction L1D LSQLSQ Cluster Eff. Addr. Transfer Cycle 27 Data Transfer Cycle 94 Dispatch at cycle 0 Cache Access Cycle 68

13 University of Utah13 Load Address Prediction L1D LSQLSQ Cluster Eff. Addr. Transfer Cycle 27 Data Transfer Cycle 94 L1D LSQLSQ Cluster Eff. Addr. Transfer Cycle 27 Data Transfer Cycle 26 Address Predictor Cache Access Cycle 68 Dispatch at cycle 0 Cache Access Cycle 0

14 University of Utah14 Memory Dependence Speculation To allow early cache access, loads must issue before resolving earlier store addresses High-confidence store address predictions are employed for disambiguation Stores that have never forwarded results within the LSQ are ignored Cluster Prefetch: Combination of Load Address Prediction and Memory Dependence Speculation

15 University of Utah15 Implementation Details Centralized table that maintains stride and last address; stride is determined by five consecutive accesses and cleared in case of five mispredicts Separate centralized table that maintains a single bit per entry to indicate stores that pose conflicts Each mispredict flushes all subsequent instrs Storage overhead: 18KB

16 University of Utah16 Performance Results Overall IPC improvement: 21%

17 University of Utah17 Results Analysis Roughly half the programs improved IPC by >8% Load address prediction rate: 65% Store address prediction rate: 79% Stores likely to not pose conflicts: 59% Avg. number of mispredicts: 12K per 100M instrs

18 University of Utah18 Decentralized Cache Replicated Cache Banks Loads do not travel far Stores & cache refills are broadcast Memory disambiguation is not accelerated Overheads: interconnect for broadcast and cache refill, power for redundant writes, distributed LRU, etc. L1D lsq L1D lsq L1D lsq L1D lsq

19 University of Utah19 Comparing Centralized & Decentralized L1D lsq L1D lsq L1D lsq L1D lsq L1D lsq IPCs without cluster prefetch 1.431.52 IPCs with cluster prefetch 1.731.79

20 University of Utah20 Sensitivity Analysis Results verified for processor models with varying resources and interconnect latencies Evaluations on SPEC-Int: address prediction rate is only 38%  modest speedups:  twolf (7%), parser (9%)  crafty, gcc, vpr (3-4%)  rest (< 2%)

21 University of Utah21 Related Work Modest speedups with decentralized caches: Racunas and Patt [ICS ’03], for dynamic clustered processors; Gibert et al. [MICRO ’02], for VLIW clustered processors Gibert et al. [MICRO ’03]: compiler-managed L0 buffers for critical data

22 University of Utah22 Conclusions Address prediction and memory dependence speculation can hide latency to cache banks; prediction rate of 66% for SPEC-FP and IPC improvement of 21% Additional benefits from decentralization are modest Future work: build better predictors, impact on power consumption [WCED ’04]

23

24 University of Utah24 Title Bullet

25 University of Utah25 Title Bullet


Download ppt "Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1."

Similar presentations


Ads by Google