Presentation is loading. Please wait.

Presentation is loading. Please wait.

Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.

Similar presentations


Presentation on theme: "Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee."— Presentation transcript:

1 Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U 32nd Annual International Symposium on Microarchitecture

2 MICRO-32 November 17, 1999 Cho, Yew, and Lee2 Big Picture

3 On-Chip D-Cache Bandwidth Problem

4 MICRO-32 November 17, 1999 Cho, Yew, and Lee4 Wide-Issue Superscalar Processors n Current Generation –Alpha 21264 –Intel’s Merced n Future Generation (IEEE Computer, Sept. ‘97) –Superspeculative Processors –Trace Processors

5 MICRO-32 November 17, 1999 Cho, Yew, and Lee5 Multi-Ported Data Cache n Replicated Cache –Alpha 21164 n Time-Division Multiplexed Cache –Alpha 21264 n Interleaved Cache –MIPS R10K

6 MICRO-32 November 17, 1999 Cho, Yew, and Lee6 Window Logic Complexity n Pointed out as the major hardware complexity (Parlacharla et al., ISCA ‘97) n More severe for Memory window –Difficult to partition –Thick network needed to connect RSs and LSUs

7 Data Decoupling

8 MICRO-32 November 17, 1999 Cho, Yew, and Lee8 Data Decoupling: What is it? n A Divide-and-Conquer approach –Instruction stream partitioned before entering RS –Narrower networks –Less ports to each cache –Needs mechanism for proper partitioning

9 MICRO-32 November 17, 1999 Cho, Yew, and Lee9 Data Decoupling: Operating Issues n Memory Stream Partitioning –Hardware classification –Compiler classification n Load Balancing –Enough instructions in different groups? –Are they well interleaved?

10 Access Region Locality & Access Region Prediction

11 MICRO-32 November 17, 1999 Cho, Yew, and Lee11 Access Region: Overview n Access Region R –R = (L, U) n L: Lower Bound on Addr. n U: Upper Bound on Addr. n If (D<A) or (B<C), –Region R and Q are said to be exclusive or non-overlapping. n Locations in exclusive regions are independent.

12 MICRO-32 November 17, 1999 Cho, Yew, and Lee12 Access Region and Mem. Instructions

13 MICRO-32 November 17, 1999 Cho, Yew, and Lee13 Partitioning Memory Space n One way of partitioning memory space into regions: –Data Region / Heap Region / Stack Region n This work assumes this partitioning.

14 MICRO-32 November 17, 1999 Cho, Yew, and Lee14 Partitioning Memory Space, Cont’d n Many accesses are toward Data and Stack regions. n Some programs don’t access the Heap region at all. (%)

15 MICRO-32 November 17, 1999 Cho, Yew, and Lee15 Partitioning Memory Space, Cont’d n Accesses to Data region are less bursty than others. n Programs such as ijpeg have clustered region accesses. n Window Size = 32

16 MICRO-32 November 17, 1999 Cho, Yew, and Lee16 Partitioning Memory Space, Cont’d n W/ a large window, Stack accesses become less bursty. n Data and Stack regions have quite stable, constant demand. n Window Size = 64

17 MICRO-32 November 17, 1999 Cho, Yew, and Lee17 Partitioning Memory Space, Cont’d gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid 1.9%1.8% 51.1%50.4% 1.6% 16.2% 45.4% 31.6% n Many instructions access a single region (~98%). n Multi-region-accessing instructions account for 0 ~ 9.6% of dynamic memory references.

18 MICRO-32 November 17, 1999 Cho, Yew, and Lee18 Access Region Locality n “A memory reference instruction typically accesses a single region at run time” –Only about 2% of all static memory instructions access more than a single region. n “(Thus) the region it accesses is highly predictable” –Simple predictors with a small look-up table achieve high prediction accuracy.

19 MICRO-32 November 17, 1999 Cho, Yew, and Lee19 Predicting Regions: Unlimited Case n One predictor per memory instruction n Predictor types: –1-bit history saver (0: Data, 1: Stack) –2-bit saturating counter

20 MICRO-32 November 17, 1999 Cho, Yew, and Lee20 Predicting Regions: Adding Context n Run-time context –Caller’s ID (CID): in Link Register –Global Branch History (GBH) –Hybrid of above

21 MICRO-32 November 17, 1999 Cho, Yew, and Lee21 Predicting Regions: Utilizing Static Info. n Some instructions’ access regions are revealed through architecture and compiler conventions : –Use of Stack Pointer ( $SP ) or Frame Pointer ($FP) suggests that the region is Stack. –Use of Global Pointer ( $GP ) suggests that the region is non- Stack. –For others, assume non-Stack. n Directly exporting some high-level region information from compiler to processor may improve prediction accuracy.

22 MICRO-32 November 17, 1999 Cho, Yew, and Lee22 Region Pred. Result: Unlimited Case gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid Simple 1-bit w/ GBH w/ CID Static w/ Hybrid n 1-bit predictors do better than 2-bit predictors (not shown). n Hybrid context bits achieve the best prediction rate on average.

23 MICRO-32 November 17, 1999 Cho, Yew, and Lee23 Predicting Regions: Limited-Size ARPT n Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): –Table Entries Initialized to 0’s –1 to denote stack access –Decoding information exploited to save ARPT space

24 MICRO-32 November 17, 1999 Cho, Yew, and Lee24 Region Prediction Result: ARPT gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid Unlimited 8 KB4 KB 2 KB 1 KB n Over 99.9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. n Compiler hints relieve pressure due to smaller sizes.

25 Dynamic Data Decoupling

26 MICRO-32 November 17, 1999 Cho, Yew, and Lee26 Dynamic Data Decoupling

27 MICRO-32 November 17, 1999 Cho, Yew, and Lee27 Dynamic Data Decoupling, Cont’d n Dynamically predicting access regions to classify memory instructions: –Utilize Access Region Prediction Table (ARPT). –Utilize any region information revealed through instruction decoding. n Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. n Dynamically Verifying Region Prediction –Let TLB (i.e., page table) contain verification information such that memory access is reissued on mis-predictions.

28 MICRO-32 November 17, 1999 Cho, Yew, and Lee28 Base Machine Model

29 MICRO-32 November 17, 1999 Cho, Yew, and Lee29 Overall Performance gom88ksimgcccompressliijpegperlvortexInt.AvgFP.Avgtomcatvswimsu2cormgrid n Over (2+0) conf.

30 MICRO-32 November 17, 1999 Cho, Yew, and Lee30 Conclusions n Access Region Locality says –Memory instructions access few regions at run time. –Accessed regions are accurately predictable. n Access Region Locality leads to Access Region Prediction techniques. n Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches.

31 Now Any Questions?

32 MICRO-32 November 17, 1999 Cho, Yew, and Lee32 Impact of LVC Size n 2KB and 4KB LVCs achieve high hit rates. (~99.9%). n Set associativity less important if LVC is 2KB or more. n Small, simple LVC works well. 0.5K1K2K4K


Download ppt "Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee."

Similar presentations


Ads by Google