 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Cache Optimization Summary
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Router Architecture : Building high-performance routers Ian Pratt
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Systems I Locality and Caching
Module I Overview of Computer Architecture and Organization.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho.
Multi-core architectures. Single-core computer Single-core CPU chip.
 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 8 cores, 64 Threads  Key design issues Architecture.
Multi-Core Architectures
Lecture 19: Virtual Memory
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Lecture 13: Multiprocessors Kai Bu
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
By Islam Atta Supervised by Dr. Ihab Talkhan
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Lynn Choi School of Electrical Engineering
ASR: Adaptive Selective Replication for CMP Caches
Multiprocessor Cache Coherency
Cache Memory Presentation I
Lecture 13: Large Cache Design I
CMSC 611: Advanced Computer Architecture
Lecture 12: Cache Innovations
Overview of Computer Architecture and Organization
CS 6290 Many-core & Interconnect
Lecture: Cache Hierarchies
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
CSE 486/586 Distributed Systems Cache Coherence
Presentation transcript:

 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture Challenges and Tradeoffs Architecture Challenges and Tradeoffs Packaging and off-chip memory bandwidth Software and runtime environment Tara-Scale CMP CDA5155sp08 peir

Many-Core CMPs – High-level View Cores L2 L1I/D What are the key architecture issues in many-cores CMP CDA5155sp08 Peir 2  On-die interconnect  Cache organization & Cache coherence  I/O and Memory architecture

The General Block Diagram FFU: Fixed Function Unit, Mem C: Memory Controller, PCI-E C: PCI- based Controller, R: Router, ShdU: Shader Unit, Sys I/F: System Interface, TexU: Texture Unit CDA5155sp08 Peir 3

On-Die Interconnect 2D Embedding of a 64-core 3D-mesh network The longest hop of the topological distance is extended from 9 to 18!

On-Die Interconnect  Must satisfy bandwidth and latency within power/area  Ring or 2D mesh/torus  Ring or 2D mesh/torus are good candidate topology Wiring density, router complexity, design complexity Multiple source/dest. pairs can be switched together; avoid packets stop and buffered, save power, help throughput  Xbar, general router are power hungry  Fault-tolerant interconnect Provide spare modules, allow fault-tolerant routing  Partition for performance isolation

Performance Isolation in 2D mesh  Performance isolation in 2D mesh with partition 3 rectangular partitions Intra-communication confined within partition Traffic generated in a partition will not affect others  Virtualization of network interfaces Interconnect as an abstraction of applications Allow programmers fine-tune application’s inter-processor communication

Many-Core CMPs Cores L2 L1I/D How about on-die cache organization with so many cores?  Shared vs. Private  Cache capacity vs. accessibility  Data replication vs. block migration  Cache partition

CMP Cache Organization

Capacity vs. Accessibility, A Tradeoff  Capacity – favor Shared cache No data replication, no cache coherence No data replication, no cache coherence Longer access time, contention issue Longer access time, contention issue Flexible cache capacity sharing Flexible cache capacity sharing Fair sharing among cores – Cache partition Fair sharing among cores – Cache partition  Accessibility – favor Private cache Fast local access with data replication, capacity may suffer Fast local access with data replication, capacity may suffer Need maintain coherence among private caches Need maintain coherence among private caches Equal partition, inflexible Equal partition, inflexible  Many works to take advantage of both Capacity sharing on private– cooperative caching Capacity sharing on private– cooperative caching Utility-based cache partition on shared Utility-based cache partition on shared

Analytical Data Replication Model Reuse distance histogram f(x): # of accesses with distance x Cache size S: Total # hits => Area beneath the curve => Cache misses increase Capacity decreases Cache hits now Local hits increase R/S of hits to replica Local hits increase R/S of hits to replica L of replica hits: local P: Miss Penalty Cycles; G: Local Gain Cycles Net memory access cycle increase:

Get Histogram f(x) for OLTP Step 2: Matlab Curve Fitting Find math expr. Step 1: Stack simulation Collect discrete reuse distance X10 6

Data Replication Effects f(x) G =15 P = 400 L = 0.5 S = 2M S = 4M S = 8M: (R/S) S = 2M 0% best S = 4M 40% best S = 8M 65% best Data Replication Impacts vary with different cache sizes

Many-Core CMPs Cores L2 L1I/D How about Cache Coherence with so many cores&caches ?  Snooping bus: Broadcast requests  Directory-based: maintaining memory block information  Review Culler’s book

Simplicity: Shared L2, Write-through L1  Existing designs IBM Power4 & 5 Sun Niagara & Niagara 2  Small number of cores, Multiple L2 banks, Xbar  Still need L1 coherence!! Inclusive L2, use L2 directory record L1 sharers in Power4&5 Non-inclusive L2, Shadow L1 directory in Niagara  L2 (shared) coherence among multiple CMPs  Private L2 is assumed

Other Considerations  Broadcast Snooping Bus: loading, speed, space, power, scalability, etc. Ring: slow traversal, ordering, scalability  Memory-based directory Huge directory space Directory cache, extra penalty  Shadow L2 Directory: copy all local L2s Aggregated associativity = Cores * Ways/Core; 64*16 = 1024 way High power

Directory-Based Approach statelocation  Directory needs to maintain the state and location of all cached blocks  Directory is checked when the data cannot be accessed locally, e.g. cache miss, write-to-shared  Directory may route the request to remote cache to fetch the requested block

Sparse Directory Approach  Holds states for all cached blocks  Low-cost set- associative design  No backup  Key issues: Centralized vs. Distributed Centralized vs. Distributed Indirect accesses Indirect accesses Extra invalidation due to conflicts Extra invalidation due to conflicts Presence bit vs. duplicated blocks Presence bit vs. duplicated blocks

Conflict Issues in Coherence Directory  Coherence directory must be a superset of all cached blocks  Uneven distribution of cached blocks in each directory set cause invalidations  Potential solutions: High set associativity – costly Directory + victim directory Randomization and Skew associativity Bigger directory - Costly Others?

Impact of Invalidation due to Directory Conflict 8-core CMP, 1MB 8-way private L2 (total 8MB) Set-associative dir; # of dir entry = total # of cache blocks Each cached block occupies a directory entry 75% 96% 72% 93%

Presence bits Issue in Directory  Presence bits (or not?) Extra space, useless for multi-programs Coherence directory must cover all cached blocks (consider no sharing)  Potential solutions Coarse-granularity present bits, imprecise not suitable for CMP Sparse presence vectors – record core-ids Allow duplicated block addresses with few core-ids for each shared block, enable multiple hits on directory search Others?

Valid Blocks Presence Bit: Multiprogrammed -> No Multithreaded -> Yes Skew, and 10w-1/4 helps; No difference 64v

Challenge in Memory Bandwidth  Increase in off-chip memory bandwidth to sustain chip-level IPC Need power-efficient high-speed off-die I/O Need power-efficient high-bandwidth DRAM access  Potential Solutions: Embedded DRAM Integrated DRAM, GDDR inside processor package 3D stacking of multiple DRAM/processor dies Many technology issues to overcome

Memory Bandwidth Fundamental  BW = # of bits x bit rate A typical DDR2 bus is 16 bytes (128 bits) wide and operating at 800Mb/s. The memory bandwidth of that bus is 16 bytes x 800Mb/s, which is 12.8GB/s  Latency and Capacity Fast, but small capacity on-chip SRAM (caches) Slow large capacity off-chip DRAM

Memory Bus vs. System Bus Bandwidth  Scaling of bus capability has usually involved a combination of increasing the bus width while simultaneously increasing the bus speed

Integrated CPU with Memory Controller  Eliminate off-chip controller delay Fast, but difficult to adapt new DRAM technology  The entire burden of pin count and interconnect speed to sustain increases in memory bandwidth requirements now falls on the CPU package alone

Challenge in Memory Bandwidth and Pin Count

Challenge in Memory Bandwidth  Historical trend for memory bandwidth demand Current generation: GB/s Next generation: >100GB/s and could go 1TB/s

New Packaging