Presentation on theme: "Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University."— Presentation transcript:
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh
CMPMSI’07 02/11/07 Multicore distributed L2 caches L2 caches typically sub-banked and distributed IBM Power4/5: 3 banks Sun Microsystems T1: 4 banks Intel Itanium2 (L3): many “sub-arrays” (Distributed L2 caches + switched NoC) NUCA Hardware-based management schemes Private caching Shared caching Hybrid caching Local L2 Cache Processor Core Router
CMPMSI’07 02/11/07 Private and shared caching Private caching: short hit latency (always local) high on-chip miss rate long miss resolution time complex coherence enforcement Shared caching: low on-chip miss rate straightforward data location simple coherence (no replication) long average hit latency
CMPMSI’07 02/11/07 Other approaches Hybrid/flexible schemes “Core clustering” [Speight et al., ISCA2005] “Flexible CMP cache sharing” [Huh et al., ICS2004] “Flexible bank mapping” [Liu et al., HPCA2004] Improving shared caching “Victim replication” [Zhang and Asanovic, ISCA2005] Improving private caching “Cooperative caching” [Chang and Sohi, ISCA2006] “CMP-NuRAPID” [Chishti et al., ISCA2005]
CMPMSI’07 02/11/07 Motivation Miss rate Hit latency What is the optimal balance between miss rate and hit latency?
CMPMSI’07 02/11/07 Talk roadmap Data mapping, a key property [cho and Jin, Micro2006] Two-dimensional (2D) page coloring algorithm Evaluation and results Conclusion and future works
CMPMSI’07 02/11/07 Data mapping Data mapping Memory data location in L2 cache Private caching Data mapping determined by program location Mapping created at miss time No explicit control Shared caching Data mapping determined by address slice number = (block address) % (N slice ) Mapping is static No explicit control
CMPMSI’07 02/11/07 Conclusions With cautious data placement, there is huge room for performance improvement. Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform- ance improvement. This method can also be applied to other optimization target.
CMPMSI’07 02/11/07 Current and future works Dynamic mapping schemes Performance Power Multiprogrammed and parallel workloads
CMPMSI’07 02/11/07 Private caching 1. L1 miss 2. L2 access Hit Miss 3. Access directory A copy on chip Global miss L1 miss Local L2 access short hit latency (always local) high on-chip miss rate long miss resolution time complex coherence enforcement
CMPMSI’07 02/11/07 Shared caching 1. L1 miss 2. L2 access Hit Miss L1 miss low on-chip miss rate straightforward data location simple coherence (no replication) long average hit latency