Presentation is loading. Please wait.

Presentation is loading. Please wait.

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

Similar presentations


Presentation on theme: "MICA: A Holistic Approach to Fast In-Memory Key-Value Storage"— Presentation transcript:

1 MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim1 Dongsu Han,2 David G. Andersen,1 Michael Kaminsky3 1Carnegie Mellon University 2KAIST, 3Intel Labs

2 Goal: Fast In-Memory Key-Value Store
Improve per-node performance (op/sec/node) Less expensive Easier hotspot mitigation Lower latency for multi-key queries Target: small key-value items (fit in single packet) Non-goals: cluster architecture, durability

3 Q: How Good (or Bad) are Current Systems?
Workload: YCSB [SoCC 2010] Single-key operations In-memory storage Logging turned off in our experiments End-to-end performance over the network Single server node

4 End-to-End Performance Comparison
Throughput (M operations/sec) - Published results; Logging on RAMCloud/Masstree - Using Intel DPDK (kernel bypass I/O); No logging - (Write-intensive workload)

5 End-to-End Performance Comparison
Throughput (M operations/sec) - Published results; Logging on RAMCloud/Masstree - Using Intel DPDK (kernel bypass I/O); No logging - (Write-intensive workload) Performance collapses under heavy writes

6 End-to-End Performance Comparison
Throughput (M operations/sec) Maximum packets/sec attainable using UDP 4x 13.5x

7 3. Key-value data structures (cache & store)
MICA Approach MICA: Redesigning in-memory key-value storage Applies new SW architecture and data structures to general-purpose HW in a holistic way Server node CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures (cache & store)

8 3. Key-value data structures
Parallel Data Access Modern CPUs have many cores (8, 15, …) How to exploit CPU parallelism efficiently? Server node CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures

9 Parallel Data Access Schemes
Concurrent Read Concurrent Write Exclusive Read Exclusive Write CPU core Memory CPU core Partition CPU core CPU core Partition + Good load distribution - Limited CPU scalability (e.g., synchronization) - Cross-NUMA latency + Good CPU scalability - Potentially low performance under skewed workloads Concurrent Read/Write (CRCW) CRCW: Any core can read/write any part of memory Can distribute load to multiple cores Memcached, RAMCloud, MemC3, Masstree Limited CPU scalability Lock contention Expensive cacheline transfer caused by concurrent writes on the same memory location Partition data using the hash of keys Exclusive Read/Write (EREW) Only one core accesses a particular partition Avoids synchronization/inter-core communication H-Store/VoltDB Can be slow under skewed key popularity A hot item cannot be served by multiple cores In MICA, this is not so bad Keyhash-based partitioning Sharding; horizontal partitioning Cache coherency protocol: MESI protocol

10 In MICA, Exclusive Outperforms Concurrent
Throughput (Mops) Why CRCW is so slow? Not that slow in my own software? MICA reveals system bottlenecks clearly as it has simpler structures and higher throughput there is a difference between end-to-end tput and local benchmark mica crcw > masstree crcw End-to-end performance with kernel bypass I/O

11 3. Key-value data structures
Request Direction Server node Sending requests to appropriate CPU cores for better data access locality Exclusive access benefits from correct delivery Each request must be sent to corresp. partition’s core CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures

12 Request Direction Schemes
Flow-based Affinity Object-based Affinity Server node Server node Key 1 Client CPU Client CPU NIC Key 2 NIC Key 1 Client CPU Client CPU Key 2 Classification using 5-tuple Classification depends on request content + Good locality for flows (e.g., HTTP over TCP) - Suboptimal for small key-value processing q: what happens if a client sends a request to a wrong core? a: the receiving CPU can redirect it to the right one Flow-based affinity Well supported by NICs with multi-queues Useful for flow-based protocols (e.g., TCP) Does not work well with K-V stores; a single client can request keys being handled by different cores Commodity NICs do not understand variable-length key requests in packet payload Object-based affinity MICA overcomes commodity NICs’ limited programmability by using client assist + Good locality for key access - Client assist or special HW support needed for efficiency

13 Crucial to Use NIC HW for Request Direction
Throughput (Mops) EREW, 50% GET Software-only: CPHash, Chronos Using exclusive access for parallel data access

14 Key-Value Data Structures
Server node Significant impact on key-value processing speed New design required for very high op/sec for both read and write “Cache” and “store” modes CPU Memory Client NIC CPU 2. Request direction 1. Parallel data access 3. Key-value data structures

15 MICA’s “Cache” Data Structures
Each partition has: Circular log (for memory allocation) Lossy concurrent hash index (for fast item access) Exploit Memcached-like cache semantics Lost data is easily recoverable (not free, though) Favor fast processing Provide good memory efficiency & item eviction

16 Circular Log Allocates space for key-value items of any length
Conventional logs + Circular queues Simple garbage collection/free space defragmentation New item is appended at tail Head Tail (fixed log size) Insufficient space for new item? Evict oldest item at head (FIFO) Tail Head Support LRU by reinserting recently accessed items

17 Lossy Concurrent Hash Index
Indexes key-value items stored in the circular log Set-associative table Full bucket? Evict oldest entry from it Fast indexing of new key-value items bucket 0 Hash index bucket 1 hash(Key) bucket N-1 Circular log Key,Val

18 MICA’s “Store” Data Structures
Required to preserve stored items Achieve similar performance by trading memory Circular log -> Segregated fits Lossy index -> Lossless index (with bulk chaining) See our paper for details Segregated fits – 16 bytes of metadata data per item, potentially vulnerable to memory fragmentation compared to circular logs Lossless index – need to provision about 10% of additional buckets for bulk chaining Note that the cache mode could be fast and memory efficiency by using new data structures.

19 Evaluation Going back to end-to-end evaluation…
Throughput & latency characteristics

20 Throughput Comparison
Throughput (Mops) Similar performance regardless of skew/write Large performance gap Bad at high write ratios 1 – 95 vs 50 perf 2 – perf gap //3 – store perf End-to-end performance with kernel bypass I/O

21 Throughput-Latency on Ethernet
Average latency (μs) 200x+ throughput 50% GET Error bar: 5th, 95th percentile 24-52 us Throughput (Mops) Original Memcached using standard socket I/O; both use UDP

22 3. Key-value data structures (cache & store)
MICA Redesigning in-memory key-value storage 65.6+ Mops/node even for heavy skew/write Source code: github.com/efficient/mica Server node CPU Memory Client NIC CPU New software architecture & data structures Consistently high performance for diverse workloads Good latency on Ethernet 2. Request direction 1. Parallel data access 3. Key-value data structures (cache & store)

23 Reference [DPDK] [FacebookMeasurement] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proc. SIGMETRICS 2012. [Masstree] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache Craftiness for Fast Multicore Key-Value Storage. In Proc. EuroSys 2012. [MemC3] Bin Fan, David G. Andersen, and Michael Kaminsky. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In Proc. NSDI 2013. [Memcached] [RAMCloud] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. Fast Crash Recovery in RAMCloud. In Proc. SOSP 2011. [YCSB] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proc. SoCC 2010.


Download ppt "MICA: A Holistic Approach to Fast In-Memory Key-Value Storage"

Similar presentations


Ads by Google