Memshare: a Dynamic Multi-tenant Key-value Cache

Slides:

Advertisements

Similar presentations

Cache and Virtual Memory Replacement Algorithms

Advertisements

ARC: A SELF-TUNING, LOW OVERHEAD REPLACEMENT CACHE

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Performance of Cache Memory

Predictive Parallelization: Taming Tail Latencies in

Understanding Operating Systems Fifth Edition

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Scheduling. Main Points Scheduling policy: what to do next, when there are multiple threads ready to run – Or multiple packets to send, or web requests.

Improving Cache Performance by Exploiting Read-Write Disparity

Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.

Multiprocessing Memory Management

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

A Hybrid Caching Strategy for Streaming Media Files Jussara M. Almeida Derek L. Eager Mary K. Vernon University of Wisconsin-Madison University of Saskatchewan.

Web Caching Schemes For The Internet – cont. By Jia Wang.

Operating Systems ECE344 Ding Yuan Page Replacement Lecture 9: Page Replacement.

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.

Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,

Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001

Embedded System Lab. 오명훈 Memory Resource Management in VMware ESX Server Carl A. Waldspurger VMware, Inc. Palo Alto, CA USA

RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,

CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.

The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Operating Systems CMPSC 473 Virtual Memory Management (4) November – Lecture 22 Instructor: Bhuvan Urgaonkar.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.

Embedded System Lab. 서동화 The Design and Implementation of a Log-Structured File System - Mendel Rosenblum and John K. Ousterhout.

Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Local Filesystems (part 1) CPS210 Spring Papers  The Design and Implementation of a Log- Structured File System  Mendel Rosenblum  File System.

The Evicted-Address Filter

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 7 – Buffer Management.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Operating Systems ECE344 Ding Yuan Page Replacement Lecture 9: Page Replacement.

Virtual Memory By CS147 Maheshpriya Venkata. Agenda Review Cache Memory Virtual Memory Paging Segmentation Configuration Of Virtual Memory Cache Memory.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

CS 540 Database Management Systems

Improving Cache Performance using Victim Tag Stores

Memory COMPUTER ARCHITECTURE

Java 9: The Quest for Very Large Heaps

Cloud-Assisted VR.

Lecture: Large Caches, Virtual Memory

Chapter 21 Virtual Memoey: Policies

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

Task Scheduling for Multicore CPUs and NUMA Systems

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Cloud-Assisted VR.

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Some challenges in heterogeneous multi-core systems

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Cooperative Caching, Simplified

Adapted from slides by Sally McKee Cornell University

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Group Based Management of Distributed File Caches

LHD: Improving Cache Hit Rate By Maximizing Hit Density

LHD: Improving Cache Hit Rate by Maximizing Hit Density

Fast Accesses to Big Data in Memory and Storage Systems

Virtual Memory CSE451 Andrew Whitaker.

Scheduling of Regular Tasks in Linux

Presentation transcript:

Memshare: a Dynamic Multi-tenant Key-value Cache Asaf cidon*, daniel rushton†, stephen m. rumble‡, ryan stutsman† *Stanford University, †University of utah, ‡google inc.

Cache is 100X Faster Than Database Web Server 10 ms 100 us

Cache Hit Rate Drives Cloud Performance Small improvements to cache hit rate make big difference: At 98% cache hit rate: +1% hit rate  35% speedup Facebook study [Atikoglu ’12]

Static Partitioning  Low Hit Rates Cache providers statically partition their memory among applications Examples: Facebook Amazon Elasticache Memcachier

Partitioned Memory Over Time Static Partition No Partition App A App B App C

Partitioned vs No Partition Hit Rates Application Hit Rate Partitioned Hit Rate No Partition Combined 87.8% 88.8% A 97.6% 96.6% B 98.8% 99.1% C 30.1% 39.2%

Partitioned Memory: Pros and Cons Disadvantages: Lower hit rate due to low utilization Higher TCO Advantages: Isolated performance and predictable hit rate “Fairness”: customers get what they pay for

Memshare: the Best of Both Worlds Optimize memory allocation to maximize overall hit rate While providing minimal guaranteed memory allocation and performance isolation

Multi-tenant Cache Design Challenges Decide application memory allocation to optimize hit rate Enforce memory allocation among applications

Estimate Hit Rate Curve Gradient to Optimize Hit Rate

Estimate Hit Rate Curve Gradient to Optimize Hit Rate

Estimating Hit Rate Gradient Track access frequency to recently evicted objects to determine gradient at working point Can be further improved with full hit rate curve estimation SHARDS [Waldspurger 2015, 2017] AET [Hu 2016]

Multi-tenant Cache Design Challenges Decide application memory allocation to optimize hit rate Enforce memory allocation among applications

Multi-tenant Cache Design Challenges Decide application memory allocation to optimize hit rate Enforce memory allocation among applications Not so simple

Slab Allocation Primer Memcached Server App 1 App 2

Slab Allocation Primer Memcached Server App 1 App 2

Slab Allocation Primer Memcached Server LRU Queues App 1 App 2

Goal: Move 4KB from App 2 to App 1 Memcached Server App 1 App 2

Goal: Move 4KB from App 2 to App 1 Problems: Need to evict 1MB Contains many small objects, some are hot App 1 can only use extra space for objects of certain size Memcached Server App 1 App 2

Goal: Move 4KB from App 2 to App 1 Problems: Need to evict 1MB Contains many small objects, some are hot App 1 can only use extra space for objects of certain size Memcached Server App 1 App 2 Problematic even for one application, see Cliffhanger [Cidon 2016]

Instead of Slabs: Log-structured Memory Log segments Log Head

Instead of Slabs: Log-structured Memory Log segments Log Head Newly written object

Instead of Slabs: Log-structured Memory Log segments Log Head

Applications are Physically Intermixed Log segments Log Head App 1 App 2

Memshare’s Sharing Model Reserved Memory: guaranteed static memory Pooled Memory: application’s share of pooled memory Target Memory = Reserved Memory + Pooled Memory

Cleaning Priority Determines Eviction Priority Q: When does Memshare evict? A: Newly written objects evict old objects, but not in critical path Cleaner keeps 1% of cache empty Cleaner tries to enforce actual memory allocation to be equal to Target Memory No application is ever denied new space Emphasize that the cleaner is an enforcement engine

Cleaner Pass n - 1 survivor segments (n = 2) n candidate segments (n = 2) App 1 App 2

Cleaner Pass n - 1 survivor segments (n = 2) n candidate segments (n = 2) App 1 App 2

Cleaner Pass n - 1 survivor segments (n = 2) n candidate segments (n = 2) App 1 App 2

Cleaner Pass n - 1 survivor segments (n = 2) n candidate segments (n = 2) App 1 App 2

Cleaner Pass n - 1 survivor segments (n = 2) n candidate segments (n = 2) App 1 App 2

Cleaner Pass n - 1 survivor segments (n = 2) n candidate segments (n = 2) 1 free segment App 1 App 2

Cleaner Pass (n = 4): Twice the Work 3 survivor segments (n = 4) 4 candidate segments (n = 4) App 1 App 2 1 free segment

Application Need: How Far is Memory Allocation from Target Memory? Log segments Log Head App 1 App 2

Within Each Application, Evict by Rank Log segments 1 12 5 7 15 4 3 8 Log Head To implement LRU: rank = last access time App 1 App 2

Cleaning: Max Need and then Max Rank n segments n-1 segments Rank 2 Rank 1 Rank 0 Rank 3 Max Need? Max Rank? Need App 1 0.8 App 2 1.4 App 3 0.9

Cleaning: Max Need and then Max Rank n segments n-1 segments Rank 2 Rank 1 Rank 0 Rank 3 Max Need?  App 2 Max Rank? Need App 1 0.8 App 2 1.4 App 3 0.9

Cleaning: Max Need and then Max Rank n segments n-1 segments Rank 2 Rank 1 Rank 0 Rank 3 Max Need?  App 2 Max Rank?  Rank 2 Need App 1 0.8 App 2 1.4 App 3 0.9

Cleaning: Max Need and then Max Rank n segments n-1 segments Rank 1 Rank 0 Rank 3 Rank 2 Max Need? Max Rank? Need App 1 0.9 App 2 0.8 App 3 1.2

Cleaning: Max Need and then Max Rank n segments n-1 segments Rank 1 Rank 0 Rank 3 Rank 2 Max Need?  App 3 Max Rank? Need App 1 0.9 App 2 0.8 App 3 1.2

Cleaning: Max Need and then Max Rank n segments n-1 segments Rank 1 Rank 0 Rank 3 Rank 2 Max Need?  App 3 Max Rank?  Rank 1 Need App 1 0.9 App 2 0.8 App 3 1.2

Trading Off Eviction Accuracy and Cleaning Cost Eviction accuracy is determined by n For example: rank = time of last access When n  # segments: ideal LRU Intuition: n is similar to cache associativity CPU consumption is determined by n

Trading Off Eviction Accuracy and Cleaning Cost Eviction accuracy is determined by n For example: rank = time of last access When n  ∞: ideal LRU Intuition: n is similar to cache associativity CPU consumption is determined by n “In practice Memcached is never CPU-bound in our data centers. Increasing CPU to improve the hit rate would be a good trade off.” - Nathan Bronson, Facebook

Implementation Implemented in C++ on top of Memcached Reuse Memcached’s hash table, transport, request processing Implemented log-structured memory allocator

Partitioned vs. Memshare Application Hit Rate Partitioned Hit Rate Memshare (50% Reserved) Combined 87.8% 89.2% A 97.6% 99.4% B 98.8% C 30.1% 34.5%

Reserved vs. Pooled Behavior Combined Hit Rates 90.2% 89.2% 88.8% App A App B App C

State-of-the-art Hit rate Misses reduced by 40% Combined hit rate increase: 6% (85%  91%)

State-of-the-art Hit Rate Even for Single Tenant Applications Policy Memcached Memshare (100% Reserved) Average Single Tenant Hit Rate 88.3% 95.5%

Cleaning Overhead is Minimal

Cleaning Overhead is Minimal Modern servers have 10GB/s or more!

Related Work Optimizing memory allocation using shadow queues Cliffhanger [Cidon 2016] Log-structured single-tenant key-value stores RAMCloud [Rumble 2014] and MICA [Lim 2014] Taxing idle memory ESX Server [Waldspurger 2002]

Summary First multi-tenant key-value cache that: Optimizes share for highest hit rate Provides minimal guarantees Novel log-structured design Use cleaner as enforcer

Appendix

Idle Tax for Selfish Applications Some sharing models do not support pooled memory, each application is selfish For example: Memcachier’s Cache-as-a-Service Idle tax: reserved memory can be reassigned if idle Tax rate: determines portion of idle memory that can be reassigned If all memory is active: target memory = reserved memory

Partitioned vs. Idle Tax Application Hit Rate Partitioned Hit Rate Memshare Idle Tax Combined 87.8% 88.8% A 97.6% 99.4% B 98.8% 98.6% C 30.1% 31.3%

State-of-the-art Hit rate

Nearly Identical Latency