Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) C. Liu, A. Sivasubramaniam, M. Kandemir The Pennsylvania.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Performance of Cache Memory

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

Memory Organization.

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Multi-Core Architectures

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Sunpyo Hong, Hyesoon Kim

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

CMSC 611: Advanced Computer Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

COSC3330 Computer Architecture

Cache Memory and Performance

ASR: Adaptive Selective Replication for CMP Caches

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Cache Memory Presentation I

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ECE 445 – Computer Organization

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Cache - Optimization.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) C. Liu, A. Sivasubramaniam, M. Kandemir The Pennsylvania State University

Outline CMPs and L2 organization Shared Processor-based Split L2 Evaluation using SpecOMP/Specjbb Summary of Results

Why CMPs? Can exploit coarser granularity of parallelism Better use of anticipated billion transistor designs –Multiple and simpler cores Commercial and research prototypes –Sun MAJC –Piranha –IBM Power 4/5 –Stanford Hydra –….

Higher pressure on memory system Multiple active threads => larger working set Solution? –Bigger Cache. –Faster interconnect. What if we have to go off-chip? The cores need to share the limited pins. Impact of off-chip accesses may be much worse than incurring a few extra cycles on- chip Needs a close scrutiny of on-chip caches.

On-chip Cache Hierarchy Assume 2 levels –L1 (I/D) is private –What about L2? L2 is the last line of defense before going off-chip, and is the focus of this paper.

Private (P) L2 I$D$I$D$ L2 $ I NT ER CO NN EC T Coherence Protocol Offchip Memory Advantages: Less interconnect traffic Insulates L2 units Disadvantages: Duplication Load imbalance L1

Shared-Interleaved (SI) L2 Disadvantages: Interconnect traffic Interference between cores Advantages: No duplication Balance the load I$D$I$D$ I NT ER CO NN EC T Coherence Protocol L1 L2

Desirables –Approach the behavior of private L2s, when the sharing is not significant –Approach the behavior of private L2 when load is balanced or when there is interference –Approach behavior of shared L2 when there is significant sharing –Approach behavior of shared L2 when demands are uneven.

Shared Processor-based Split L2 I NT ER CO NN EC T $$$$$$$$$$$$ Table and Split Select I$D$I$D$ L1 L2 Processors/cores are allocated L2 splits

Lookup Look up all splits allocated to requesting core simultaneously. If not found, then look at all other splits (extra latency). If found, move block over to one of its splits (chosen randomly), and removing it from the other split. Else, go off-chip and place block in one of its splits (chosen randomly).

Note … Note, a core cannot place blocks that evict blocks useful to another (as in Private case) A core can look at (shared) blocks of other cores – at a slightly higher cost without being as high as off-chip accesses (as in Shared case). There is at most 1 copy of a block in L2.

Shared Split Uniform (SSU) I NT ER CO NN EC T $$$$$$$$$$$$ Table and Split Select I$D$I$D$ L1 L2

Shared Split Non-Uniform (SSN) I NT ER CO NN EC T $$$$$$$$$$$$ Table and Split Select I$D$I$D$ L1 L2

Split Table $$$$$$$$$$ XXX XXXX XX X P3P3 P2P2 P1P1 P0P0

Evaluation Using Simics complete system simulator Benchmarks: SpecOMP Specjbb Reference dataset used Several billion instructions were simulated. A bus interconnect was simulated with MESI.

Default configuration # of proc8L2 Assoc4-way L1 Size8KBL2 Latency10 cycles L1 Line Size32 Byte# L2 Splits8 (SI, SSU) L1 Assoc4-way# L2 Splits16 (SSN) L1 Latency1 cycle MEM Access 120 cycles L2 Size2MB total Bus Arbitration 5 cycles L2 Line Size64 Byte Replacement Strict LRU

Benchmarks (SpecOMP + Specjbb) Benchmark L1L2 # of Inst (m) # MissRate# MissRate ammp53.1m m ,528 applu111.2m m ,519 apsi378.9m m ,713 art_m66.1m m ,967 fma3d18.9m m ,189 galgel111.4m m ,051 swim261.6m m0.2967,761 mgrid333.2m m ,294 specjbb828.5m m0.0839,413

SSN Terminology With a total L2 of 2MB (16 splits of 128K each) to be allocated to 8 cores, SSN-152 refers to –512K (4 splits) allocated to 1 CPU –256K (2 splits) allocated to each of 5 CPUs –128K (1 split) allocated to each of 2 CPUs Determining how much to allocate to each CPU (and when) – postpone for future work. Here, we use a profile based approach based on L2 demands.

Application behavior Intra-application heterogeneity –Spatial: (among CPUs) allocate non-uniform splits to different CPUs. –Temporal: (for each CPU) change the number of splits allocated to a CPU at different points of time. Inter-application heterogeneity –Different applications running at same time can have different L2 demands.

Definition SHF (Spatial Heterogeneity Factor) THF (Temporal Heterogeneity Factor)

Spatial heterogeneity Factor

Temporal Heterogeneity Factor

Results: SI

Results: SSU

Results: SSN

Summary of Results When P does better than S (e.g. apsi), SSU/SSN does as well (if not better) as P. When S does better than P (e.g. swim, mgrid, specjbb), SSU/SSN does as well (if not better) as S. In nearly all cases (except applu), some configuration of SSU/SSN does the best. On the average we get over 11% improvement in IPC over the best S/P configuration(s).

Inter-application Heterogenity Different applications have different L2 demands These applications could even be running concurrently on different CPUs.

Inter-application results ammp+apsi, low+high. ammp+fma3d, both low swim+apsi, both high, imbalanced + balanced. swim+mgrid, both high, imbalanced + imbalanced

Inter-application: ammp+apsi SSN MB dynamically allocated to apsi, 0.75MB to ammp. Graph shows the rough 5:3 allocation. Better overall IPC value. Low miss rate for apsi and not affecting the miss rate of ammp.

Concluding Remarks Shared Processor-based Split L2 is a flexible way of approaching the behavior of shared or private L2 (based on what is preferable) It accommodates spatial and temporal heterogeneity in L2 demands both within an application and across applications. Becomes even more important with higher off-chip accesses.

Future Work How to configure the split sizes – statically, dynamically and a combination of the two?

Backup Slides

Meaning Capture the heterogeneity between CPUs (spatial) or over the epochs (temporal) of the load imposed on the L2 structure. Weighted by L1 accesses reflect the effect on the overall IPC. –If the overall access are low, there is not going to be a significant impact on the IPC even though the standard deviation is high.

Results: P

Results: SI

Results: SSU

Results Except applu, shared split L2 perform the best. In swim, mgrid, specjbb with high L1 miss rate means higher pressure on L2, which results significant IPC improvement (30.9% to 42.5%)

Why private L2 does better in some? L2 performance: –The degree of sharing –The imbalance of load imposed on L2 For applu and swim+apsi, –Only 12% of the blocks are shared at any time, mainly shared between 2 CPUs. –Not much spatial/temporal heterogeneity.

Why we use IPC instead of the execution time? We could not finish any of the benchmark, since we are using the “reference” dataset. Another possible indicator is the number of iterations executed of certain loop (for example, the dominating loop) for unit amount of time. We did this and find the direct correlation between the IPC value and the number of iterations. PrivateSSU Average timeipcAverage timeipc apsi loop calling dctdx() (mainloop) 3,349m cycles ,048m cycles 3.79

Results

Closer look: specjbb SSU is over 31% better than the private L2. Direct correlation between the L2 misses and the IPC values. P never exceeds 2.5, while SSU sometimes push over 3.0

Sensitivity: Larger L2 2MB -> 4MB -> 8MB –Miss rates go down, difference arising from miss rate diminish. ‘swim’ still get considerable savings. –If application size keep growing up, the split shared L2 is still going to help. –More splits of L2 -> finer granularity -> could help SSN.

Sensitivity: Longer memory access 120 cycles -> 240 cycles Benefits are amplified