Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
SE-292 High Performance Computing
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
To Include or Not to Include? Natalie Enright Dana Vantrease.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
Cache Optimization Summary
EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 8 cores, 64 Threads  Key design issues Architecture.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
By Islam Atta Supervised by Dr. Ihab Talkhan
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.
 Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip  Sun’s Niagara2 (T2) 8 cores, 64 Threads  Key design issues Architecture.
The University of Adelaide, School of Computer Science
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
CS161 – Design and Architecture of Computer
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
CS161 – Design and Architecture of Computer
Architecture and Design of AlphaServer GS320
Lecture: Large Caches, Virtual Memory
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
Lecture 13: Large Cache Design I
CMSC 611: Advanced Computer Architecture
Lecture 12: Cache Innovations
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Outline Midterm results summary Distributed file systems – continued
Improving Multiple-CMP Systems with Token Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
DDM – A Cache-Only Memory Architecture
/ Computer Architecture and Design
High Performance Computing
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison

What is Server Consolidation? Multiple server applications are deployed onto Virtual Machines (VMs), running on a single, more powerful server. Feasibility Virtualization Technology (VT) – Hardware and software Many-core CMPs – Suns Niagara (32 threads); Intels Tera-scale project (100s tiles)

CMP Running Consolidated Servers

Characteristics Isolating the function of VMs Isolating the performance of consolidated servers Facilitating dynamic reassignment of VM resources (processor, memory) Supporting inter-VM memory sharing (content-based page sharing)

How Memory System Optimized? Minimize AMAT by servicing misses within a VM Minimize interference among separate VMs to isolate performance Facilitate dynamic reassignment of cores, caches, and memory to VMs Inter-VM page sharing

Current CMP Memory Systems Global broadcast – Not viable for such a large number of tiles Global directory – Forcing memory accesses to cross chip, failing to minimize AMAT and isolate performance Statically distributing dir among tiles – Better, complicating memory allocation, VM reassignment & scheduling, limiting sharing opportunity

DRAM Dir with Dir Cache (DRAM-DIR) Main dir in DRAM; Dir cache in Memory Controller Each tile is a sharer of the data Any miss issues a request to dir. 1. Failing to minimize AMAT -Significant latency to reach dir, even data is near 2. Allows performance of one VM to affect others -due to interconnect and directory contention.

Duplicate Tag Directory (TAG-DIR) Centrally located Fails to minimize AMAT Dir contentions Challenging as the number of cores increases (64 cores, 16-way => 1024-way)

Static Cache Bank Dir (STATIC-BANK-DIR) Home tile (decided by block address or page frame no.) Home tile maintains sharer & states A local miss asks for home tile A replacement from home tile invalidates all copies Fails to meet minimizing AMAT, VM isolation (Even worse, due to invalidations.)

Solution: Two-level virtual hierarchy Level 1 directory for intra-VM coherence Minimizing memory access time Isolating performance Two alternative global level two protocols for inter-VM coherence Allowing for inter-VM sharing due to migration, reconfiguration, page sharing VH A and VH B

Level 1 Intra-VM Dir Protocol Home tile within the VM Who is home? Not necessarily power of 2 Dynamic reassignment Dynamic home tiles by VM config Table (64-entry) 64 bit vector for each dir entry

Level 2 – Option 1: VH A Dir in DRAM and Dir Cache in Memory Controller Each entry contains a full 64-bit vector Why not home tile ID?

Brief Summary Level-one Intra-VM protocol handles most of the coherence Level-two protocol will only be used for inter-VM sharing and dynamic reconfiguration of VMs Can we reduce the complexity of Level-two protocol?

Level 2 – Option 2: VH B A single bit tracks whether a block has any cached copies. Broadcast for misses for inter-VM sharing if bit is set.

Advantage of Level 2 Broadcast Reduce the complexity of protocol, get rid of many transient states Enables level 1 proto to be inexact Using limited or coarse-grain vector Even no state with broadcast within VM No home tag for private data Victimize a tag without invalidating sharers Accessing memory with prediction without checking the home tile first

Uncontended L1-to-L1 Sharing latency

Normalized Runtime: Homogenous STATIC-BANK-DIR & VHA consumes tag space in static or dynamic home tiles VHB: no home tiles for private data

Memory System Stall Cycle

Cycle per Transaction for Mixed VHB best overall performance, lowest cpt DRAM-DIR: 45%-55% hit rate in the 8MB Dir Cache (no partition) STATIC: slightly better for oltp, worse for jbb in mixed1, allow interference, allow oltp to use other VMs resource

Conclusion Future memory system should be optimized for workload consolidation as well as single-workload. Maximize shared memory accesses serviced within a VM Minimize interference among separate VMs Facilitate dynamic reassignment of resource