‘99 ACM/IEEE International Symposium on Computer Architecture

‘99 ACM/IEEE International Symposium on Computer Architecture
Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor Sangyeun Cho, U of Minnesota/Samsung Pen-Chung Yew, U of Minnesota Gyungho Lee, U of Texas at San Antonio

Roadmap Need for Higher Bandwidth Caches Multi-Ported Data Caches
Data Decoupling Motivation Approach Implementation Issues Quantitative Evaluation Conclusions ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Wide-Issue Superscalar Processors
Current Generation Alpha 21264 Intel’s Merced Future Generation (IEEE Computer, Sept. ‘97) Superspeculative Processors Trace Processors ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Multi-Ported Data Caches
Cache Built with Multi-Ported Cells Replicated Cache Alpha 21164 Interleaved Cache MIPS R10K Time-Division Multiplexing Alpha 21264 ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Replicated Cache Pros. Cons. Simple design Symmetric read ports
Doubled area Exclusive writes for data coherence ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Time-Division Multiplexed Cache
Pros. True 2-port cache Cons. Hardware design complexity Not scalable beyond 2 ports ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Interleaved Cache Pros. Cons. Scalable Asymmetric ports Bank conflicts
Constraints in number of banks ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Window Logic Complexity
Pointed out as the major hardware complexity (Palacharla et al., ISCA ‘97) More severe for Memory window Difficult to partition Thick network needed to connect RSs and LSUs ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Data Decoupling A Divide-and-Conquer approach
Instructions partitioned before entering RS Narrower networks Less ports to each cache ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Data Decoupling: Operating Issues
Memory Stream Partitioning Hardware classification Compiler classification Load Balancing Enough instructions in different groups? Are they well interleaved? ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Case for Decoupling Stack Accesses
Easily Identifiable Hardware Mechanism Simple 1-bit predictor with enough context information works well (>99.9%). Compiler Mechanism Helps reduce required prediction table space for good performance; but not essential. Many of Them 30% of loads, 48% of stores Well-Interleaved Continuous supply of stack references with reasonable window size Details are found in: Cho, Yew, and Lee. “Access Region Locality for High-Bandwidth Processor Memory System Design”, CSTR #99-004, Univ. of Minnesota. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Data Decoupling: Mechanism
Dynamically Predicting Access Regions for Partitioning Memory Instructions Utilize Access Region Locality Refer to context information, e.g., global branch history, call site identifier Dynamically Verifying Region Prediction Let TLB (i.e., page table) contain verification information such that memory access is reissued on mispredictions. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Data Decoupling: Mechanism, Cont’d
Access Region Locality ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Data Decoupling: Mechanism, Cont’d
Dynamic Partitioning Accuracy 8 KB 4 KB Unlimited 2 KB 1 KB go m88ksim gcc compress li ijpeg perl vortex Int.Avg FP.Avg tomcatv swim su2cor mgrid ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Data Decoupling: Optimizations
Fast Forwarding Uses offset (used with $sp) to resolve dependence Can shorten latency Access Combining Combines accesses to adjacent locations Can save bandwidth st r3, 8($sp) ... ld r4, 8($sp) st r3, 4($sp) st r4, 8($sp) st {r3,r4} {4,8($sp)} Addr Matched! ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Benchmark Programs ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Program’s Memory Accesses
go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor mgrid Int.Avg FP.Avg ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Program’s Frame Size Distribution
4 8 12 16 Stack references tend to access small region. Average size of dynamic frames was around 3 words. Average size of static frames was around 7 words. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Base Machine Model ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Program’s Bandwidth Requirements
Integer FP Performance suffers greatly with less than 3 cache ports. We study 3 cases: Cache has 2 ports Cache has 3 ports Cache has 4 ports ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Impact of LVC Size 2KB and 4KB LVCs achieve high hit rates (~99.9%).
Set associativity less important if LVC is 2KB or more. Small, simple LVC works well. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Fast Data Forwarding 2KB and 4KB LVCs achieve high hit rates (~99.9%).
Set associativity less important if LVC is 2KB or more. Small, simple LVC works well. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Access Combining (3+1) (3+2) Effective (over 8% improvement) when LVC bandwidth is scarce. 2-way combining is enough. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Performance of Various Config.’s
ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Performance of 126.gcc (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) ISCA ‘99
May 1, 1999 Cho, Yew, and Lee

Performance of 130.li (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) ISCA ‘99

Performance of 102.swim (N+0) (N+1) (N+2) (N+3) (N+4) (N+5) ISCA ‘99

Other Findings LVC hit latency has less impact than data cache due to
Many loads hitting in LVAQ Out-of-order issuing Addition of LVC reduced conflict misses in 130.li (by 24%) and 147.vortex (by 7%) May reduce bandwidth requirements on bus to L2 cache ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Overall Performance go m88ksim gcc compress li ijpeg perl vortex
tomcatv swim su2cor mgrid Int.Avg FP.Avg ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

Conclusions Superscalar Processors will be around…
But its design complexity will call for architectural solutions. Memory bandwidth becomes critical. Data Decoupling is a way to Decrease hardware complexity of memory issue logic and cache. Provide additional bandwidth for decoupled stack accesses. ISCA ‘99 May 1, 1999 Cho, Yew, and Lee

‘99 ACM/IEEE International Symposium on Computer Architecture

Similar presentations

Presentation on theme: "‘99 ACM/IEEE International Symposium on Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

‘99 ACM/IEEE International Symposium on Computer Architecture

Similar presentations

Presentation on theme: "‘99 ACM/IEEE International Symposium on Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback