By Islam Atta Supervised by Dr. Ihab Talkhan

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

To Include or Not to Include? Natalie Enright Dana Vantrease.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Lecture 6: Multicore Systems

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Dynamic Cache Clustering for Chip Multiprocessors

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.

LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Background Computer System Architectures Computer System Software.

The University of Adelaide, School of Computer Science

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Lecture: Large Caches, Virtual Memory

Lecture: Large Caches, Virtual Memory

CS 147 – Parallel Processing

Architecture & Organization 1

Multiprocessor Cache Coherency

Building Expressive, Area-Efficient Coherence Directories

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Distributed Shared Memory

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches

Lecture: Cache Innovations, Virtual Memory

Cooperative Caching, Simplified

Lecture: Cache Hierarchies

Chapter 4 Multiprocessors

Cache coherence CEG 4131 Computer Architecture III

Cache - Optimization.

Presentation transcript:

By Islam Atta Supervised by Dr. Ihab Talkhan Dynamically Partitioned Hybrid Last Level Cache for Chip-Multiprocessors Islam Atta Masters Thesis Supervision of Dr. Ihab Talkhan title is ….. Understand name By Islam Atta Supervised by Dr. Ihab Talkhan

Agenda Domain: Chip-multiprocessors (CMP) Challenges Hybrid Last Level Cache Dynamic partitioning Evaluation Conclusion Further Discussion Topics © Copyright Islam Atta, Cairo University, 2010

Chip-Multiprocessor (CMP) Why CMPs? Advances in circuit integration technology made multi-core design the main stream in CPU designs CMPs will dominate commercial processor designs for at least the next decade Moore’s law is about to become “annual doubling of number of processor cores” on a single chip. “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year,” Gordon E. Moore, Intel co-founder, Electronics Magazine, 1965 Why important Increase the clock frequency & number of transistors per chip Limitations in Power & Temperature Solution is “Instruction Level Parallelism”  Large Increase in performance with minimal difference in Power, and NO increase in Temperature. History & Roadmap for General Purpose Processors: IBM started at 2001 Intel & AMD joined at 2005 with dual-core Sun boosted at 2006 with 8-cores (32 threads) Intel Polaris is a prototype for 80-cores in 2011 Moore’s Law © Copyright Islam Atta, Cairo University, 2010

Challenges & Constraints Shared Resources Management Power management Network-on-chip (NoC) On-chip memory Constraints Slow main memory Limited off-chip bandwidth Scalable CMPs usually consist of several nodes where each node has a private L1 cache, and a distributed Last-level cache. For this reason, CMPs usually consist of several cores where each node has a private L1 cache. LLC affects area, power & performance © Copyright Islam Atta, Cairo University, 2010

LLC: Shared or Private? Very flexible & Slow Faster & Shared LLC for 4 cores Private LLC for each core Very flexible & Slow Faster & Coherency Problems © Copyright Islam Atta, Cairo University, 2010

Possible Solution: HYBRID Faster access of private caches Flexibility of shared caches © Copyright Islam Atta, Cairo University, 2010

NUCA Cache slices that are closer to a core have faster access than those further away forming a Non-Uniform Cache Access architecture © Copyright Islam Atta, Cairo University, 2010

Idea behind NUCA Distributed Memory Banks connected in a logical Mesh Optimal Design Issues: Heterogeneous OR Homogenous Sharing access control No Replication OR Coherency control One way to identify the best solution is to study the application requirements! P2 has access to 10, 11, 12, 13, 20, 21, 22, 23 Banks 10 & 11 are shared by P1 & P2 P1 has access to 00, 01, 10, 11

Application Domain Variation Caching requirements vary across different applications NUCA suffers from the amount of utilization among different applications Furthermore, some applications vary requirements across their run-time! More Sharing More Privacy © Copyright Islam Atta, Cairo University, 2010

Adaptation at Run-time Our attempt: Build an adaptive LLC cache that adapts the sizes of private and shared partitions according to the application requirements per core at run-time. © Copyright Islam Atta, Cairo University, 2010

Hybrid Cache Organization Physically Combined Each Node = processing core + L1 cache LLC cache is tightly coupled with the node. All LLCs are connected  NoC Physically combined © Copyright Islam Atta, Cairo University, 2010

LLC Components Shared partition Private partition Directory cache Cache lines accessible by all cores Private partition Cache lines accessible only by local core Directory cache Contains address tags and node IDs that points to a private slice where this cache line is resident. © Copyright Islam Atta, Cairo University, 2010

Caching Mechanism Implementation Schemes Searching Fill policy Hit in remote private slice  move to home shared slice Victim of private slice  move to home shared slice Hit in local/home shared slice  move to local/home private slice Local private/shared slice Home Shared cache Directory cache HIT: remote private slice MISS: send off-chip Add to local private slice Add entry in home directory cache Off-chip Memory Miss Tag + ID1 Data Scheme 3 = scheme 1 + 2 + 3 3 Hit Miss 1 Miss 2

Dynamic Partitioning Node basis partitioning No Replication Heterogonous vs. Homogenous No coherency protocol required Treat aggregate distributed LLC as ONE UNIT No Replication © Copyright Islam Atta, Cairo University, 2010

Dynamic Partitioning (2) Way-partitioning Associativity: Private = i Shared= j-i Total = j Total Associativity = Private + shared Modifying the separation boundary by modifying shared/private assoc. © Copyright Islam Atta, Cairo University, 2010

Decision Making WHEN to repartition? Periodical based on a Number of Misses threshold (Miss Trigger) WHICH partitions to increase/decrease? By comparing Hits in Shadow tags © Copyright Islam Atta, Cairo University, 2010

Shadow Tags * Per LLC node * EJECTED Cache line, Replacement  corresponding shadow tag * Miss  compare with shadow tag * If equal  increment HIT IN SHADOW TAGS * Repartition : Compare Hit in shadow tags

Experimental Methodology Simulator SESC (SuperEScalar): Cycle-accurate detailed system simulator MIPS architecture Benchmarks SPLASH-2 Stanford ParalleL Applications for SHared-memory Used in study of centralized and distributed shared address-space multiprocessors © Copyright Islam Atta, Cairo University, 2010

Performance Evaluation Static Hybrid vs. Totally Shared Cache (Base) Average of 10% © Copyright Islam Atta, Cairo University, 2010

Performance Evaluation (2) Dynamic Partitioning Decision Analysis Private only Shared only Both None (initial value was good) © Copyright Islam Atta, Cairo University, 2010

Performance Evaluation (3) Shared vs. Static vs. Dynamic Hybrid >> shared Dynamic >> Static Variance in improvement MAX shared: 31% MAX static: 15% Average shared: 16% Average static: 7% © Copyright Islam Atta, Cairo University, 2010

Conclusion An optimized cache hierarchy for CMPs affects overall system performance. Hybrid cache is a good option. We evaluated a statically partitioned hybrid LLC on a cycle-accurate simulator. We then proposed a dynamically partitioned hybrid LLC plugged on top of the static counter. Based on evaluation, dynamic partitioning is beneficial for dealing with different application requirements to achieve optimal cache access. © Copyright Islam Atta, Cairo University, 2010

Further Discussion Possible Future Work Example on related work Separation boundary revisited Experimental Setup (CACTI 5.2) Performance Evaluation Bank access per core for shared cache Optimal partition size for static hybrid cache Parameters for Dynamic hybrid Summarize thesis work © Copyright Islam Atta, Cairo University, 2010

Possible Future Work Evaluation was based on only Multi-threaded benchmarks Modify SESC to enable execution of Multi-programmed workloads Scalability can be re-examined for many-cores with larger number of cores (16, 32, …) Plug proposed dynamic scheme on top of different NUCA cache organizations. © Copyright Islam Atta, Cairo University, 2010

Related Work Cooperative Caching L2 is private. Miss: search in other L2 rather than off-chip access. Replacement: instead of removing it off-chip, place it in another L2 that has room. Coherency protocol required All this is performed through a Centralized Cooperation Engine © Copyright Islam Atta, Cairo University, 2010

Important Definitions Home Node Output of Address Mapping Function Remote Node A cache line requested by one core is found in the private slice of a remote node Local Node A cache line is found in the local private/shared partition © Copyright Islam Atta, Cairo University, 2010

Separation Boundary Revisited 2-bit Valid flag © Copyright Islam Atta, Cairo University, 2010

Experimental Setup (CACTI 5.2)

Performance Evaluation (4) Bank-Access pattern per core

Performance Evaluation (5) Optimal partition size for static hybrid cache © Copyright Islam Atta, Cairo University, 2010

Performance Evaluation (6) Miss Trigger Repartition Factor © Copyright Islam Atta, Cairo University, 2010

Summary of Thesis work First steps Implementation Experimentation Literature survey on CMPs. Identify a hot topic (cache hierarchy) Survey on all possible solutions Propose a novel solution Implementation Investigate appropriate simulator Study SESC Identify required modifications to implement static and dynamic hybrid LLC Experimentation Study bank-access pattern of SPLASH-2 applications Identify optimal setup for static hybrid Compare static hybrid to shared cache Identify optimal setup for dynamic hybrid Compare shared, static , and dynamic hybrid LLC Final Documentation © Copyright Islam Atta, Cairo University, 2010