Presentation is loading. Please wait.

Presentation is loading. Please wait.

‘99 ACM/IEEE International Symposium on Computer Architecture

Similar presentations


Presentation on theme: "‘99 ACM/IEEE International Symposium on Computer Architecture"— Presentation transcript:

1 ‘99 ACM/IEEE International Symposium on Computer Architecture
Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio

2 Roadmap Need for Higher Bandwidth Caches Multi-Ported Data Caches
Data Decoupling Motivation Approach Implementation Issues Quantitative Evaluation Conclusions ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

3 Wide-Issue Superscalar Processors
Current Generation Alpha 21264 Intel’s Merced Future Generation (IEEE Computer, Sept. ‘97) Superspeculative Processors Trace Processors ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

4 Multi-Ported Data Caches
Cache Built with Multi-Ported Cells Replicated Cache Alpha 21164 Interleaved Cache MIPS R10K Time-Division Multiplexing Alpha 21264 ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

5 Replicated Cache Pros. Cons. Simple design Symmetric read ports
Doubled area Exclusive writes for data coherence ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

6 Time-Division Multiplexed Cache
Pros. True 2-port cache Cons. Hardware design complexity Not scalable beyond 2 ports ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

7 Interleaved Cache Pros. Cons. Scalable Asymmetric ports Bank conflicts
Constraints in number of banks ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

8 Window Logic Complexity
Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97) More severe for Memory window Difficult to partition Thick network needed to connect RSs and LSUs ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

9 Data Decoupling A Divide-and-Conquer approach
Instructions partitioned before entering RS Narrower networks Less ports to each cache ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

10 Data Decoupling: Operating Issues
Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

11 Case for Decoupling Stack Accesses
Easily Identifiable Hardware Mechanism Simple 1-bit predictor with enough context information works well (>99.9%). Compiler Mechanism Helps reduce required prediction table space for good performance; but not essential. Many of Them 30% of loads, 48% of stores Well-Interleaved Continuous supply of stack references with reasonable window size Details are found in: Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Processor Memory System Design”, CSTR #99-004, Univ. of Minnesota. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

12 Data Decoupling: Mechanism
Dynamically Predicting Access Regions for Partitioning Memory Instructions Utilize Access Region Locality Refer to context information, e.g., global branch history, call site identifier Dynamically Verifying Region Prediction Let TLB (i.e., page table) contain verification information such that memory access is reissued on mispredictions. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

13 Data Decoupling: Mechanism, Cont’d
Access Region Locality ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

14 Data Decoupling: Mechanism, Cont’d
Dynamic Partitioning Accuracy 8 KB 4 KB Unlimited 2 KB 1 KB go m88ksim gcc compress li ijpeg perl vortex Int.Avg FP.Avg tomcatv swim su2cor mgrid ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

15 Data Decoupling: Optimizations
Fast Forwarding Uses offset (used with $sp) to resolve dependence Can shorten latency Access Combining Combines accesses to adjacent locations Can save bandwidth st r3, 8($sp) ... ld r4, 8($sp) st r3, 4($sp) st r4, 8($sp) st {r3,r4} {4,8($sp)} Addr Matched! ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

16 Benchmark Programs ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

17 Program’s Memory Accesses
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

18 Program’s Frame Size Distribution
4 8 12 16 Stack references tend to access small region. Average size of dynamic frames was around 3 words. Average size of static frames was around 7 words. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

19 Base Machine Model ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

20 Program’s Bandwidth Requirements
Integer FP Performance suffers greatly with less than 3 cache ports. We study 3 cases: Cache has 2 ports Cache has 3 ports Cache has 4 ports ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

21 Impact of LVC Size 2KB and 4KB LVCs achieve high hit rates (~99.9%).
Set associativity less important if LVC is 2KB or more. Small, simple LVC works well. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

22 Fast Data Forwarding 2KB and 4KB LVCs achieve high hit rates (~99.9%).
Set associativity less important if LVC is 2KB or more. Small, simple LVC works well. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

23 Access Combining (3+1) (3+2) Effective (over 8% improvement) when LVC bandwidth is scarce. 2-way combining is enough. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

24 Performance of Various Config.’s
ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

25 Performance of 126.gcc (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) ISCA ‘99
May 1, 1999 Cho, Yew, and Lee

26 Performance of 130.li (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) ISCA ‘99
May 1, 1999 Cho, Yew, and Lee

27 Performance of 102.swim (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) ISCA ‘99
May 1, 1999 Cho, Yew, and Lee

28 Other Findings LVC hit latency has less impact than data cache due to
Many loads hitting in LVAQ Out-of-order issuing Addition of LVC reduced conflict misses in 130.li (by 24%) and 147.vortex (by 7%) May reduce bandwidth requirements on bus to L2 cache ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

29 Overall Performance go m88ksim gcc compress li ijpeg perl vortex
tomcatv swim su2cor mgrid Int.Avg FP.Avg ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

30 Conclusions Superscalar Processors will be around…
But its design complexity will call for architectural solutions. Memory bandwidth becomes critical. Data Decoupling is a way to Decrease hardware complexity of memory issue logic and cache. Provide additional bandwidth for decoupled stack accesses. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee


Download ppt "‘99 ACM/IEEE International Symposium on Computer Architecture"

Similar presentations


Ads by Google