MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

Slides:



Advertisements
Similar presentations
Fast Crash Recovery in RAMCloud
Advertisements

SILT: A Memory-Efficient, High-Performance Key-Value Store
FAWN: Fast Array of Wimpy Nodes Developed By D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, V. Vasudevan Presented by Peter O. Oliha.
Cuckoo Filter: Practically Better Than Bloom
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2014.
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum,
Cache Craftiness for Fast Multicore Key-Value Storage
RAMCloud Scalable High-Performance Storage Entirely in DRAM John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières,
CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.
RAMCloud 1.0 John Ousterhout Stanford University (with Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin.
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Presenter.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.
5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.
Optimizing RAM-latency Dominated Applications
RAMCloud Design Review Recovery Ryan Stutsman April 1,
What We Have Learned From RAMCloud John Ousterhout Stanford University (with Asaf Cidon, Ankita Kejriwal, Diego Ongaro, Mendel Rosenblum, Stephen Rumble,
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
Bin Fan, David G. Andersen, Michael Kaminsky
© 2011 Cisco All rights reserved.Cisco Confidential 1 APP server Client library Memory (Managed Cache) Memory (Managed Cache) Queue to disk Disk NIC Replication.
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Silo: Speedy Transactions in Multicore In-Memory Databases
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Forum January, 2015.
Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
Properties of Layouts Single failure correcting: no two units of same stripe are mapped to same disk –Enables recovery from single disk crash Distributed.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya Akella University.
RAMCloud Overview and Status John Ousterhout Stanford University.
Embedded System Lab. 서동화 The Design and Implementation of a Log-Structured File System - Mendel Rosenblum and John K. Ousterhout.
C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
Implementing Linearizability at Large Scale and Low Latency Collin Lee, Seo Jin Park, Ankita Kejriwal, † Satoshi Matsushita, John Ousterhout Platform Lab.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.
1 Lecture 20: Big Data, Memristors Today: architectures for big data, memristors.
CSci8211: Distributed Systems: RAMCloud 1 Distributed Shared Memory/Storage Case Study: RAMCloud Developed by Stanford Platform Lab  Key Idea: Scalable.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
Presented by: Xianghan Pei
1 Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears Yahoo! Research.
RAMCloud and the Low-Latency Datacenter John Ousterhout Stanford Platform Laboratory.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Memshare: a Dynamic Multi-tenant Key-value Cache
Alternative system models
Concurrent Data Structures for Near-Memory Computing
MemCache Widely used for high-performance Easy to use.
FAWN: A Fast Array of Wimpy Nodes
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
HashKV: Enabling Efficient Updates in KV Storage via Hashing
COS 518: Advanced Computer Systems Lecture 9 Michael Freedman
Be Fast, Cheap and in Control
Benchmarking Cloud Serving Systems with YCSB
FAWN: A Fast Array of Wimpy Nodes
Getting to the root of concurrent binary search tree performance
Caching 50.5* + Apache Kafka
Fast Accesses to Big Data in Memory and Storage Systems
Sean Choi, Seo Jin Park, Muhammad Shahbaz,
Presentation transcript:

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Hyeontaek Lim1 Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3 1Carnegie Mellon University 2KAIST, 3Intel Labs

Goal: Fast In-Memory Key-Value Store Improve per-node performance (op/sec/node) Less expensive Easier hotspot mitigation Lower latency for multi-key queries Target: small key-value items (fit in single packet) Non-goals: cluster architecture, durability

Q: How Good (or Bad) are Current Systems? Workload: YCSB [SoCC 2010] Single-key operations In-memory storage Logging turned off in our experiments End-to-end performance over the network Single server node

End-to-End Performance Comparison Throughput (M operations/sec) - Published results; Logging on RAMCloud/Masstree - Using Intel DPDK (kernel bypass I/O); No logging - (Write-intensive workload)

End-to-End Performance Comparison Throughput (M operations/sec) - Published results; Logging on RAMCloud/Masstree - Using Intel DPDK (kernel bypass I/O); No logging - (Write-intensive workload) Performance collapses under heavy writes

End-to-End Performance Comparison Throughput (M operations/sec) Maximum packets/sec attainable using UDP 4x 13.5x

3. Key-value data structures (cache & store) MICA Approach MICA: Redesigning in-memory key-value storage Applies new SW architecture and data structures to general-purpose HW in a holistic way Server node CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures (cache & store)

3. Key-value data structures Parallel Data Access Modern CPUs have many cores (8, 15, …) How to exploit CPU parallelism efficiently? Server node CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures

Parallel Data Access Schemes Concurrent Read Concurrent Write Exclusive Read Exclusive Write CPU core Memory CPU core Partition CPU core CPU core Partition + Good load distribution - Limited CPU scalability (e.g., synchronization) - Cross-NUMA latency + Good CPU scalability - Potentially low performance under skewed workloads Concurrent Read/Write (CRCW) CRCW: Any core can read/write any part of memory Can distribute load to multiple cores Memcached, RAMCloud, MemC3, Masstree Limited CPU scalability Lock contention Expensive cacheline transfer caused by concurrent writes on the same memory location Partition data using the hash of keys Exclusive Read/Write (EREW) Only one core accesses a particular partition Avoids synchronization/inter-core communication H-Store/VoltDB Can be slow under skewed key popularity A hot item cannot be served by multiple cores In MICA, this is not so bad Keyhash-based partitioning Sharding; horizontal partitioning Cache coherency protocol: MESI protocol

In MICA, Exclusive Outperforms Concurrent Throughput (Mops) Why CRCW is so slow? Not that slow in my own software? MICA reveals system bottlenecks clearly as it has simpler structures and higher throughput there is a difference between end-to-end tput and local benchmark mica crcw > masstree crcw End-to-end performance with kernel bypass I/O

3. Key-value data structures Request Direction Server node Sending requests to appropriate CPU cores for better data access locality Exclusive access benefits from correct delivery Each request must be sent to corresp. partition’s core CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures

Request Direction Schemes Flow-based Affinity Object-based Affinity Server node Server node Key 1 Client CPU Client CPU NIC Key 2 NIC Key 1 Client CPU Client CPU Key 2 Classification using 5-tuple Classification depends on request content + Good locality for flows (e.g., HTTP over TCP) - Suboptimal for small key-value processing q: what happens if a client sends a request to a wrong core? a: the receiving CPU can redirect it to the right one Flow-based affinity Well supported by NICs with multi-queues Useful for flow-based protocols (e.g., TCP) Does not work well with K-V stores; a single client can request keys being handled by different cores Commodity NICs do not understand variable-length key requests in packet payload Object-based affinity MICA overcomes commodity NICs’ limited programmability by using client assist + Good locality for key access - Client assist or special HW support needed for efficiency

Crucial to Use NIC HW for Request Direction Throughput (Mops) EREW, 50% GET Software-only: CPHash, Chronos Using exclusive access for parallel data access

Key-Value Data Structures Server node Significant impact on key-value processing speed New design required for very high op/sec for both read and write “Cache” and “store” modes CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures

MICA’s “Cache” Data Structures Each partition has: Circular log (for memory allocation) Lossy concurrent hash index (for fast item access) Exploit Memcached-like cache semantics Lost data is easily recoverable (not free, though) Favor fast processing Provide good memory efficiency & item eviction

Circular Log Allocates space for key-value items of any length Conventional logs + Circular queues Simple garbage collection/free space defragmentation New item is appended at tail Head Tail (fixed log size) Insufficient space for new item? Evict oldest item at head (FIFO) Tail Head Support LRU by reinserting recently accessed items

Lossy Concurrent Hash Index Indexes key-value items stored in the circular log Set-associative table Full bucket? Evict oldest entry from it Fast indexing of new key-value items bucket 0 Hash index bucket 1 hash(Key) … bucket N-1 Circular log Key,Val

MICA’s “Store” Data Structures Required to preserve stored items Achieve similar performance by trading memory Circular log -> Segregated fits Lossy index -> Lossless index (with bulk chaining) See our paper for details Segregated fits – 16 bytes of metadata data per item, potentially vulnerable to memory fragmentation compared to circular logs Lossless index – need to provision about 10% of additional buckets for bulk chaining Note that the cache mode could be fast and memory efficiency by using new data structures.

Evaluation Going back to end-to-end evaluation… Throughput & latency characteristics

Throughput Comparison Throughput (Mops) Similar performance regardless of skew/write Large performance gap Bad at high write ratios 1 – 95 vs 50 perf 2 – perf gap //3 – store perf End-to-end performance with kernel bypass I/O

Throughput-Latency on Ethernet Average latency (μs) 200x+ throughput 50% GET Error bar: 5th, 95th percentile 24-52 us Throughput (Mops) Original Memcached using standard socket I/O; both use UDP

3. Key-value data structures (cache & store) MICA Redesigning in-memory key-value storage 65.6+ Mops/node even for heavy skew/write Source code: github.com/efficient/mica Server node CPU Memory Client NIC CPU New software architecture & data structures Consistently high performance for diverse workloads Good latency on Ethernet 2. Request direction 1. Parallel data access 3. Key-value data structures (cache & store)

Reference [DPDK] http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/packet-processing-is-enhanced-with-software-from-intel-dpdk.html [FacebookMeasurement] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proc. SIGMETRICS 2012. [Masstree] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache Craftiness for Fast Multicore Key-Value Storage. In Proc. EuroSys 2012. [MemC3] Bin Fan, David G. Andersen, and Michael Kaminsky. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In Proc. NSDI 2013. [Memcached] http://memcached.org/ [RAMCloud] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. Fast Crash Recovery in RAMCloud. In Proc. SOSP 2011. [YCSB] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proc. SoCC 2010.