Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm

2 Cache Power Consumption Increasing on-chip cache size  Increasing cache power consumption Increasing clock frequency  Increasing dynamic power Lots of prior work to reduce cache power consumption

3 Prior Work Cache subbanking, bitline segmentation [Su et al. 1995, Ghose et al. 2001] Cache decomposition [Huang et al. 2001] Block buffering [Su et al. 1995] Reducing Leakage power Drowsy caches [Flautner et al. 2002, Kim et al. 2002] Cache decay [Kaxiras et al. 2001] Gated Vdd [Powell et al. 2000]

4 Cache Subbanking Proposed by Su et al. 1995 Fetching only requested subline Partitioned data array vertically into several subbanks Further study by Ghose et al. 2001 Partitioned data array vertically and horizontally Only activate the requested subbanks

5 Bit-sliced ALU Originally proposed by Hsu et al. 1985 Slices the addition operations i.e. 32-bit addition -> four 8-bit addition Avoids waiting for full-width addition Bypasses partial operand result Has been successfully implemented in Pentium 4 staggered adder

6 Outline Motivation Prior Work Bit-sliced Cache Experiment Results Conclusion

7 Power Consumption in Cache  Row decoding consumes up to 40% of active power

8 Bit-sliced Cache Extends cache subbanking technique Saves decoding power Enables only row decoders that are accessed Serializes subarray decoding with row decoding Uses low order index bits to select row decoder Minimal changes to subbanking technique

9 Pipelining the Cache Access Cache access time increases due to: Serializing subarray decoder with row decoder Pipeline the access to hide the delay Need to balance the latency of each stage Choose operations for each stage carefully Provide more throughput Same throughput as a conventional cache with n ports

10 Pipelined-Cache’s Access Steps Cycle 1 Start subarray decoding for data and tag Cycle 2 Activate necessary row decoders Read tag array while waiting Cycle 3 Read data array Concurrently, do partial tag comparison Cycle 4 Compare the rest of the tag bits Use tag comparison result to select data

11 Bit-sliced Cache

12 Bit-sliced Cache + Bit-sliced ALU Optimal performance benefit Cache access starts sooner As soon as the first slice is available Limited number of subarrays According to the number of bits per slice When the bitslice is too small Unable to achieve optimal power saving

13 Pipelining with Bit-sliced Cache addi R3, R3, 4add R3, R2, R1lw R4, 4(R3) lw R1, 0(R3) Pipelined Execution Stage with Pipelined Cache add R3, R2, R1addi R3, R3, 4lw R1, 0(R3)lw R4, 4(R3)add R3, R2, R1addi R3, R3, 4lw R4, 4(R3)lw R1, 0(R3) Bit-sliced Execution Stage with Bit-sliced Cache Bit-sliced Execution Stage with Pipelined Cache

14 Cache Model Simulation Estimates energy consumption and cache latency Uses a modified version of CACTI 3.0 Parameters: Ntbl, Ndbl, Ntwl, Ndwl. Enumerates all possible configurations Chooses the one with the best weighted value (cycle time and energy consumption) Simulates: Various cache sizes (8K-512K), 64 B blocks DM, 2-way, 4-way, and 8-way Uses 0.18 um technology

15 Processor Simulation Estimates performance benefit Uses a heavily modified SimpleScalar 3.0 Supports bit-sliced execution stage Supports speculative slice execution Benchmarks Eight Spec2000 Integer benchmarks Full reference input set Fast forward 500M, simulate 100M

16 Machine Configuration 4-wide fetch, issue, commit 128 entry ROB 32 entry scheduler 20 stage pipeline 64K-entry gshare L1 I-Cache: 32KB, 2-way, 64B block L1 D-Cache: 8KB, 4-way, 64B block L2 Cache: 512KB, 8-way, 128B block

17 Energy Consumption / Access

18 Cycle Time Comparison

19 Speed Up Comparison

20 Speed Up Comparison

21 Conclusion Bit-sliced cache Achieves significant power reduction Without adds much complexity Adds some delay to access latency Pipelined bit-sliced cache Reduces cycle time Provides more bandwidth Measurable speed up (w/ bit-sliced ALU)

22 Question? Thank you

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Similar presentations

Presentation on theme: "Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Similar presentations

Presentation on theme: "Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison."— Presentation transcript:

Similar presentations

About project

Feedback