Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Islam Atta Supervised by Dr. Ihab Talkhan

Similar presentations


Presentation on theme: "By Islam Atta Supervised by Dr. Ihab Talkhan"— Presentation transcript:

1 By Islam Atta Supervised by Dr. Ihab Talkhan
Dynamically Partitioned Hybrid Last Level Cache for Chip-Multiprocessors Islam Atta Masters Thesis Supervision of Dr. Ihab Talkhan title is ….. Understand name By Islam Atta Supervised by Dr. Ihab Talkhan

2 Agenda Domain: Chip-multiprocessors (CMP) Challenges
Hybrid Last Level Cache Dynamic partitioning Evaluation Conclusion Further Discussion Topics © Copyright Islam Atta, Cairo University, 2010

3 Chip-Multiprocessor (CMP)
Why CMPs? Advances in circuit integration technology made multi-core design the main stream in CPU designs CMPs will dominate commercial processor designs for at least the next decade Moore’s law is about to become “annual doubling of number of processor cores” on a single chip. “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year,” Gordon E. Moore, Intel co-founder, Electronics Magazine, 1965 Why important Increase the clock frequency & number of transistors per chip Limitations in Power & Temperature Solution is “Instruction Level Parallelism”  Large Increase in performance with minimal difference in Power, and NO increase in Temperature. History & Roadmap for General Purpose Processors: IBM started at 2001 Intel & AMD joined at 2005 with dual-core Sun boosted at 2006 with 8-cores (32 threads) Intel Polaris is a prototype for 80-cores in 2011 Moore’s Law © Copyright Islam Atta, Cairo University, 2010

4 Challenges & Constraints
Shared Resources Management Power management Network-on-chip (NoC) On-chip memory Constraints Slow main memory Limited off-chip bandwidth Scalable CMPs usually consist of several nodes where each node has a private L1 cache, and a distributed Last-level cache. For this reason, CMPs usually consist of several cores where each node has a private L1 cache. LLC affects area, power & performance © Copyright Islam Atta, Cairo University, 2010

5 LLC: Shared or Private? Very flexible & Slow Faster &
Shared LLC for 4 cores Private LLC for each core Very flexible & Slow Faster & Coherency Problems © Copyright Islam Atta, Cairo University, 2010

6 Possible Solution: HYBRID
Faster access of private caches Flexibility of shared caches © Copyright Islam Atta, Cairo University, 2010

7 NUCA Cache slices that are closer to a core have faster access than those further away forming a Non-Uniform Cache Access architecture © Copyright Islam Atta, Cairo University, 2010

8 Idea behind NUCA Distributed Memory Banks connected in a logical Mesh
Optimal Design Issues: Heterogeneous OR Homogenous Sharing access control No Replication OR Coherency control One way to identify the best solution is to study the application requirements! P2 has access to 10, 11, 12, 13, 20, 21, 22, 23 Banks 10 & 11 are shared by P1 & P2 P1 has access to 00, 01, 10, 11

9 Application Domain Variation
Caching requirements vary across different applications NUCA suffers from the amount of utilization among different applications Furthermore, some applications vary requirements across their run-time! More Sharing More Privacy © Copyright Islam Atta, Cairo University, 2010

10 Adaptation at Run-time
Our attempt: Build an adaptive LLC cache that adapts the sizes of private and shared partitions according to the application requirements per core at run-time. © Copyright Islam Atta, Cairo University, 2010

11 Hybrid Cache Organization
Physically Combined Each Node = processing core + L1 cache LLC cache is tightly coupled with the node. All LLCs are connected  NoC Physically combined © Copyright Islam Atta, Cairo University, 2010

12 LLC Components Shared partition Private partition Directory cache
Cache lines accessible by all cores Private partition Cache lines accessible only by local core Directory cache Contains address tags and node IDs that points to a private slice where this cache line is resident. © Copyright Islam Atta, Cairo University, 2010

13 Caching Mechanism Implementation Schemes Searching Fill policy
Hit in remote private slice  move to home shared slice Victim of private slice  move to home shared slice Hit in local/home shared slice  move to local/home private slice Local private/shared slice Home Shared cache Directory cache HIT: remote private slice MISS: send off-chip Add to local private slice Add entry in home directory cache Off-chip Memory Miss Tag + ID1 Data Scheme 3 = scheme 3 Hit Miss 1 Miss 2

14 Dynamic Partitioning Node basis partitioning No Replication
Heterogonous vs. Homogenous No coherency protocol required Treat aggregate distributed LLC as ONE UNIT No Replication © Copyright Islam Atta, Cairo University, 2010

15 Dynamic Partitioning (2)
Way-partitioning Associativity: Private = i Shared= j-i Total = j Total Associativity = Private + shared Modifying the separation boundary by modifying shared/private assoc. © Copyright Islam Atta, Cairo University, 2010

16 Decision Making WHEN to repartition?
Periodical based on a Number of Misses threshold (Miss Trigger) WHICH partitions to increase/decrease? By comparing Hits in Shadow tags © Copyright Islam Atta, Cairo University, 2010

17 Shadow Tags * Per LLC node
* EJECTED Cache line, Replacement  corresponding shadow tag * Miss  compare with shadow tag * If equal  increment HIT IN SHADOW TAGS * Repartition : Compare Hit in shadow tags

18 Experimental Methodology
Simulator SESC (SuperEScalar): Cycle-accurate detailed system simulator MIPS architecture Benchmarks SPLASH-2 Stanford ParalleL Applications for SHared-memory Used in study of centralized and distributed shared address-space multiprocessors © Copyright Islam Atta, Cairo University, 2010

19 Performance Evaluation
Static Hybrid vs. Totally Shared Cache (Base) Average of 10% © Copyright Islam Atta, Cairo University, 2010

20 Performance Evaluation (2)
Dynamic Partitioning Decision Analysis Private only Shared only Both None (initial value was good) © Copyright Islam Atta, Cairo University, 2010

21 Performance Evaluation (3)
Shared vs. Static vs. Dynamic Hybrid >> shared Dynamic >> Static Variance in improvement MAX shared: 31% MAX static: 15% Average shared: 16% Average static: 7% © Copyright Islam Atta, Cairo University, 2010

22 Conclusion An optimized cache hierarchy for CMPs affects overall system performance. Hybrid cache is a good option. We evaluated a statically partitioned hybrid LLC on a cycle-accurate simulator. We then proposed a dynamically partitioned hybrid LLC plugged on top of the static counter. Based on evaluation, dynamic partitioning is beneficial for dealing with different application requirements to achieve optimal cache access. © Copyright Islam Atta, Cairo University, 2010

23 Further Discussion Possible Future Work Example on related work
Separation boundary revisited Experimental Setup (CACTI 5.2) Performance Evaluation Bank access per core for shared cache Optimal partition size for static hybrid cache Parameters for Dynamic hybrid Summarize thesis work © Copyright Islam Atta, Cairo University, 2010

24 Possible Future Work Evaluation was based on only Multi-threaded benchmarks Modify SESC to enable execution of Multi-programmed workloads Scalability can be re-examined for many-cores with larger number of cores (16, 32, …) Plug proposed dynamic scheme on top of different NUCA cache organizations. © Copyright Islam Atta, Cairo University, 2010

25 Related Work Cooperative Caching L2 is private.
Miss: search in other L2 rather than off-chip access. Replacement: instead of removing it off-chip, place it in another L2 that has room. Coherency protocol required All this is performed through a Centralized Cooperation Engine © Copyright Islam Atta, Cairo University, 2010

26 Important Definitions
Home Node Output of Address Mapping Function Remote Node A cache line requested by one core is found in the private slice of a remote node Local Node A cache line is found in the local private/shared partition © Copyright Islam Atta, Cairo University, 2010

27 Separation Boundary Revisited
2-bit Valid flag © Copyright Islam Atta, Cairo University, 2010

28 Experimental Setup (CACTI 5.2)

29 Performance Evaluation (4)
Bank-Access pattern per core

30 Performance Evaluation (5)
Optimal partition size for static hybrid cache © Copyright Islam Atta, Cairo University, 2010

31 Performance Evaluation (6)
Miss Trigger Repartition Factor © Copyright Islam Atta, Cairo University, 2010

32 Summary of Thesis work First steps Implementation Experimentation
Literature survey on CMPs. Identify a hot topic (cache hierarchy) Survey on all possible solutions Propose a novel solution Implementation Investigate appropriate simulator Study SESC Identify required modifications to implement static and dynamic hybrid LLC Experimentation Study bank-access pattern of SPLASH-2 applications Identify optimal setup for static hybrid Compare static hybrid to shared cache Identify optimal setup for dynamic hybrid Compare shared, static , and dynamic hybrid LLC Final Documentation © Copyright Islam Atta, Cairo University, 2010


Download ppt "By Islam Atta Supervised by Dr. Ihab Talkhan"

Similar presentations


Ads by Google