Presentation is loading. Please wait.

Presentation is loading. Please wait.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Similar presentations


Presentation on theme: "Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University."— Presentation transcript:

1 Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh

2 CMPMSI’07 02/11/07 Multicore distributed L2 caches  L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays”  (Distributed L2 caches + switched NoC)  NUCA  Hardware-based management schemes Private caching Shared caching Hybrid caching Local L2 Cache Processor Core Router

3 CMPMSI’07 02/11/07 Private and shared caching Private caching:  short hit latency (always local)  high on-chip miss rate  long miss resolution time  complex coherence enforcement Shared caching:  low on-chip miss rate  straightforward data location  simple coherence (no replication)  long average hit latency

4 CMPMSI’07 02/11/07 Other approaches  Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004]  Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005]  Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]

5 CMPMSI’07 02/11/07 Motivation Miss rate Hit latency What is the optimal balance between miss rate and hit latency?

6 CMPMSI’07 02/11/07 Talk roadmap  Data mapping, a key property [cho and Jin, Micro2006]  Two-dimensional (2D) page coloring algorithm  Evaluation and results  Conclusion and future works

7 CMPMSI’07 02/11/07 Data mapping  Data mapping Memory data  location in L2 cache  Private caching Data mapping determined by program location Mapping created at miss time No explicit control  Shared caching Data mapping determined by address slice number = (block address) % (N slice ) Mapping is static No explicit control

8 CMPMSI’07 02/11/07 Page Change mapping granularity slice number = (block address) % (N slice) Block granularityPage granularity Page slice number = (page address) % (N slice)

9 CMPMSI’07 02/11/07 OS controlled page mapping Memory pages Program 1 Program 2 OS PAGE ALLOCATION Virtual address spacePhysical address space

10 CMPMSI’07 02/11/07 2D page coloring: the problem Page accessmiss Page Network latency / hop = 3 cycles Memory latency = 300 cycles Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles) cost P

11 CMPMSI’07 02/11/07 2D coloring algorithm  Collect L2 reference trace  Derive conflict information [Sherwood et al., ICS1999] Page APage CPage B Reference 1Reference 2Reference 3Reference 4

12 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Reference Matrix ABC A000 B000 C000 Conflict Matrix ABC A000 B000 C000 1

13 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Reference Matrix ABC A000 B100 C100 Conflict Matrix ABC A000 B000 C000

14 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Reference Matrix ABC A000 B100 C100 Conflict Matrix ABC A000 B000 C

15 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Reference Matrix ABC A010 B100 C110 Conflict Matrix ABC A000 B000 C

16 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Reference Matrix ABC A010 B000 C110 Conflict Matrix ABC A000 B100 C000

17 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A010 B000 C110 Conflict Matrix ABC A000 B100 C

18 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  Derive conflict information Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A011 B001 C110 Conflict Matrix ABC A000 B100 C

19 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  2D Page coloring Page A Reference 1 Page B Reference 2 Page B Reference 3 Page C Reference 4 Reference Matrix ABC A011 B001 C000 Conflict Matrix ABC A000 B100 C110 Conflict Matrix ABC A000 B100 C110 Access Counter ABC 121

20 CMPMSI’07 02/11/07 2D coloring algorithm (cont’d)  2D Page coloring Conflict Matrix ABC A000 B100 C110 Access Counter ABC 121 #Conflict(color)#Access Cost(color, page#) = ( x mem latency) + x #hop(color) x hop delay) Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors} α x (1-α) x

21 CMPMSI’07 02/11/07 Experiments setup  Experiments were carried out using simulator derived from SimpleScalar toolset.  The simulator models a 16-core tile-based CMP.  Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB). Profiling2D coloring Timing Simulation Trace Page mapping Tuning α

22 CMPMSI’07 02/11/07 Optimal page mapping gcc α = 1/64 # of pages x y x y α = 1/256

23 CMPMSI’07 02/11/07 Access distribution α 1/32 – 1/2048

24 CMPMSI’07 02/11/07 Relative performance

25 CMPMSI’07 02/11/07 Value of α

26 CMPMSI’07 02/11/07 Conclusions  With cautious data placement, there is huge room for performance improvement.  Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform- ance improvement.  This method can also be applied to other optimization target.

27 CMPMSI’07 02/11/07 Current and future works  Dynamic mapping schemes Performance Power  Multiprogrammed and parallel workloads

28 CMPMSI’07 02/11/07 Thank you & Questions?

29 CMPMSI’07 02/11/07 Private caching 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss L1 miss Local L2 access  short hit latency (always local)  high on-chip miss rate  long miss resolution time  complex coherence enforcement

30 CMPMSI’07 02/11/07 Shared caching 1. L1 miss 2. L2 access Hit Miss L1 miss  low on-chip miss rate  straightforward data location  simple coherence (no replication)  long average hit latency

31 CMPMSI’07 02/11/07 Performance Performance improvement Over shared caching 141% 150%


Download ppt "Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University."

Similar presentations


Ads by Google