By Islam Atta Supervised by Dr. Ihab Talkhan Dynamically Partitioned Hybrid Last Level Cache for Chip-Multiprocessors Islam Atta Masters Thesis Supervision of Dr. Ihab Talkhan title is ….. Understand name By Islam Atta Supervised by Dr. Ihab Talkhan
Agenda Domain: Chip-multiprocessors (CMP) Challenges Hybrid Last Level Cache Dynamic partitioning Evaluation Conclusion Further Discussion Topics © Copyright Islam Atta, Cairo University, 2010
Chip-Multiprocessor (CMP) Why CMPs? Advances in circuit integration technology made multi-core design the main stream in CPU designs CMPs will dominate commercial processor designs for at least the next decade Moore’s law is about to become “annual doubling of number of processor cores” on a single chip. “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year,” Gordon E. Moore, Intel co-founder, Electronics Magazine, 1965 Why important Increase the clock frequency & number of transistors per chip Limitations in Power & Temperature Solution is “Instruction Level Parallelism” Large Increase in performance with minimal difference in Power, and NO increase in Temperature. History & Roadmap for General Purpose Processors: IBM started at 2001 Intel & AMD joined at 2005 with dual-core Sun boosted at 2006 with 8-cores (32 threads) Intel Polaris is a prototype for 80-cores in 2011 Moore’s Law © Copyright Islam Atta, Cairo University, 2010
Challenges & Constraints Shared Resources Management Power management Network-on-chip (NoC) On-chip memory Constraints Slow main memory Limited off-chip bandwidth Scalable CMPs usually consist of several nodes where each node has a private L1 cache, and a distributed Last-level cache. For this reason, CMPs usually consist of several cores where each node has a private L1 cache. LLC affects area, power & performance © Copyright Islam Atta, Cairo University, 2010
LLC: Shared or Private? Very flexible & Slow Faster & Shared LLC for 4 cores Private LLC for each core Very flexible & Slow Faster & Coherency Problems © Copyright Islam Atta, Cairo University, 2010
Possible Solution: HYBRID Faster access of private caches Flexibility of shared caches © Copyright Islam Atta, Cairo University, 2010
NUCA Cache slices that are closer to a core have faster access than those further away forming a Non-Uniform Cache Access architecture © Copyright Islam Atta, Cairo University, 2010
Idea behind NUCA Distributed Memory Banks connected in a logical Mesh Optimal Design Issues: Heterogeneous OR Homogenous Sharing access control No Replication OR Coherency control One way to identify the best solution is to study the application requirements! P2 has access to 10, 11, 12, 13, 20, 21, 22, 23 Banks 10 & 11 are shared by P1 & P2 P1 has access to 00, 01, 10, 11
Application Domain Variation Caching requirements vary across different applications NUCA suffers from the amount of utilization among different applications Furthermore, some applications vary requirements across their run-time! More Sharing More Privacy © Copyright Islam Atta, Cairo University, 2010
Adaptation at Run-time Our attempt: Build an adaptive LLC cache that adapts the sizes of private and shared partitions according to the application requirements per core at run-time. © Copyright Islam Atta, Cairo University, 2010
Hybrid Cache Organization Physically Combined Each Node = processing core + L1 cache LLC cache is tightly coupled with the node. All LLCs are connected NoC Physically combined © Copyright Islam Atta, Cairo University, 2010
LLC Components Shared partition Private partition Directory cache Cache lines accessible by all cores Private partition Cache lines accessible only by local core Directory cache Contains address tags and node IDs that points to a private slice where this cache line is resident. © Copyright Islam Atta, Cairo University, 2010
Caching Mechanism Implementation Schemes Searching Fill policy Hit in remote private slice move to home shared slice Victim of private slice move to home shared slice Hit in local/home shared slice move to local/home private slice Local private/shared slice Home Shared cache Directory cache HIT: remote private slice MISS: send off-chip Add to local private slice Add entry in home directory cache Off-chip Memory Miss Tag + ID1 Data Scheme 3 = scheme 1 + 2 + 3 3 Hit Miss 1 Miss 2
Dynamic Partitioning Node basis partitioning No Replication Heterogonous vs. Homogenous No coherency protocol required Treat aggregate distributed LLC as ONE UNIT No Replication © Copyright Islam Atta, Cairo University, 2010
Dynamic Partitioning (2) Way-partitioning Associativity: Private = i Shared= j-i Total = j Total Associativity = Private + shared Modifying the separation boundary by modifying shared/private assoc. © Copyright Islam Atta, Cairo University, 2010
Decision Making WHEN to repartition? Periodical based on a Number of Misses threshold (Miss Trigger) WHICH partitions to increase/decrease? By comparing Hits in Shadow tags © Copyright Islam Atta, Cairo University, 2010
Shadow Tags * Per LLC node * EJECTED Cache line, Replacement corresponding shadow tag * Miss compare with shadow tag * If equal increment HIT IN SHADOW TAGS * Repartition : Compare Hit in shadow tags
Experimental Methodology Simulator SESC (SuperEScalar): Cycle-accurate detailed system simulator MIPS architecture Benchmarks SPLASH-2 Stanford ParalleL Applications for SHared-memory Used in study of centralized and distributed shared address-space multiprocessors © Copyright Islam Atta, Cairo University, 2010
Performance Evaluation Static Hybrid vs. Totally Shared Cache (Base) Average of 10% © Copyright Islam Atta, Cairo University, 2010
Performance Evaluation (2) Dynamic Partitioning Decision Analysis Private only Shared only Both None (initial value was good) © Copyright Islam Atta, Cairo University, 2010
Performance Evaluation (3) Shared vs. Static vs. Dynamic Hybrid >> shared Dynamic >> Static Variance in improvement MAX shared: 31% MAX static: 15% Average shared: 16% Average static: 7% © Copyright Islam Atta, Cairo University, 2010
Conclusion An optimized cache hierarchy for CMPs affects overall system performance. Hybrid cache is a good option. We evaluated a statically partitioned hybrid LLC on a cycle-accurate simulator. We then proposed a dynamically partitioned hybrid LLC plugged on top of the static counter. Based on evaluation, dynamic partitioning is beneficial for dealing with different application requirements to achieve optimal cache access. © Copyright Islam Atta, Cairo University, 2010
Further Discussion Possible Future Work Example on related work Separation boundary revisited Experimental Setup (CACTI 5.2) Performance Evaluation Bank access per core for shared cache Optimal partition size for static hybrid cache Parameters for Dynamic hybrid Summarize thesis work © Copyright Islam Atta, Cairo University, 2010
Possible Future Work Evaluation was based on only Multi-threaded benchmarks Modify SESC to enable execution of Multi-programmed workloads Scalability can be re-examined for many-cores with larger number of cores (16, 32, …) Plug proposed dynamic scheme on top of different NUCA cache organizations. © Copyright Islam Atta, Cairo University, 2010
Related Work Cooperative Caching L2 is private. Miss: search in other L2 rather than off-chip access. Replacement: instead of removing it off-chip, place it in another L2 that has room. Coherency protocol required All this is performed through a Centralized Cooperation Engine © Copyright Islam Atta, Cairo University, 2010
Important Definitions Home Node Output of Address Mapping Function Remote Node A cache line requested by one core is found in the private slice of a remote node Local Node A cache line is found in the local private/shared partition © Copyright Islam Atta, Cairo University, 2010
Separation Boundary Revisited 2-bit Valid flag © Copyright Islam Atta, Cairo University, 2010
Experimental Setup (CACTI 5.2)
Performance Evaluation (4) Bank-Access pattern per core
Performance Evaluation (5) Optimal partition size for static hybrid cache © Copyright Islam Atta, Cairo University, 2010
Performance Evaluation (6) Miss Trigger Repartition Factor © Copyright Islam Atta, Cairo University, 2010
Summary of Thesis work First steps Implementation Experimentation Literature survey on CMPs. Identify a hot topic (cache hierarchy) Survey on all possible solutions Propose a novel solution Implementation Investigate appropriate simulator Study SESC Identify required modifications to implement static and dynamic hybrid LLC Experimentation Study bank-access pattern of SPLASH-2 applications Identify optimal setup for static hybrid Compare static hybrid to shared cache Identify optimal setup for dynamic hybrid Compare shared, static , and dynamic hybrid LLC Final Documentation © Copyright Islam Atta, Cairo University, 2010