Presentation on theme: "MICA: A Holistic Approach to Fast In-Memory Key-Value Storage"— Presentation transcript:
1 MICA: A Holistic Approach to Fast In-Memory Key-Value Storage Hyeontaek Lim1Dongsu Han,2 David G. Andersen,1 Michael Kaminsky31Carnegie Mellon University2KAIST, 3Intel Labs
2 Goal: Fast In-Memory Key-Value Store Improve per-node performance (op/sec/node)Less expensiveEasier hotspot mitigationLower latency for multi-key queriesTarget: small key-value items (fit in single packet)Non-goals: cluster architecture, durability
3 Q: How Good (or Bad) are Current Systems? Workload: YCSB [SoCC 2010]Single-key operationsIn-memory storageLogging turned off in our experimentsEnd-to-end performance over the networkSingle server node
4 End-to-End Performance Comparison Throughput (M operations/sec)- Published results; Logging on RAMCloud/Masstree- Using Intel DPDK (kernel bypass I/O); No logging- (Write-intensive workload)
5 End-to-End Performance Comparison Throughput (M operations/sec)- Published results; Logging on RAMCloud/Masstree- Using Intel DPDK (kernel bypass I/O); No logging- (Write-intensive workload)Performance collapses under heavy writes
7 3. Key-value data structures (cache & store) MICA ApproachMICA: Redesigning in-memory key-value storageApplies new SW architecture and data structures to general-purpose HW in a holistic wayServer nodeCPUMemoryClientNICCPU2. Request direction1. Parallel data access3. Key-value data structures (cache & store)
8 3. Key-value data structures Parallel Data AccessModern CPUs have many cores (8, 15, …)How to exploit CPU parallelism efficiently?Server nodeCPUMemoryClientNICCPU2. Request direction1. Parallel data access3. Key-value data structures
9 Parallel Data Access Schemes Concurrent Read Concurrent WriteExclusive Read Exclusive WriteCPU coreMemoryCPU corePartitionCPU coreCPU corePartition+ Good load distribution- Limited CPU scalability (e.g., synchronization)- Cross-NUMA latency+ Good CPU scalability- Potentially low performance under skewed workloadsConcurrent Read/Write (CRCW)CRCW: Any core can read/write any part of memoryCan distribute load to multiple coresMemcached, RAMCloud, MemC3, MasstreeLimited CPU scalabilityLock contentionExpensive cacheline transfer caused by concurrent writes on the same memory locationPartition data using the hash of keysExclusive Read/Write (EREW)Only one core accesses a particular partitionAvoids synchronization/inter-core communicationH-Store/VoltDBCan be slow under skewed key popularityA hot item cannot be served by multiple coresIn MICA, this is not so badKeyhash-based partitioningSharding; horizontal partitioningCache coherency protocol: MESI protocol
10 In MICA, Exclusive Outperforms Concurrent Throughput (Mops)Why CRCW is so slow? Not that slow in my own software?MICA reveals system bottlenecks clearly as it has simpler structures and higher throughputthere is a difference between end-to-end tput and local benchmarkmica crcw > masstree crcwEnd-to-end performance with kernel bypass I/O
11 3. Key-value data structures Request DirectionServer nodeSending requests to appropriate CPU cores for better data access localityExclusive access benefits from correct deliveryEach request must be sent to corresp. partition’s coreCPUMemoryClientNICCPU2. Request direction1. Parallel data access3. Key-value data structures
12 Request Direction Schemes Flow-based AffinityObject-based AffinityServer nodeServer nodeKey 1ClientCPUClientCPUNICKey 2NICKey 1ClientCPUClientCPUKey 2Classification using 5-tupleClassification depends on request content+ Good locality for flows (e.g., HTTP over TCP)- Suboptimal for small key-value processingq: what happens if a client sends a request to a wrong core?a: the receiving CPU can redirect it to the right oneFlow-based affinityWell supported by NICs with multi-queuesUseful for flow-based protocols (e.g., TCP)Does not work well with K-V stores; a single client can request keys being handled by different coresCommodity NICs do not understand variable-length key requests in packet payloadObject-based affinityMICA overcomes commodity NICs’ limited programmability by using client assist+ Good locality for key access- Client assist or special HW support needed for efficiency
13 Crucial to Use NIC HW for Request Direction Throughput (Mops)EREW, 50% GETSoftware-only: CPHash, ChronosUsing exclusive access for parallel data access
14 Key-Value Data Structures Server nodeSignificant impact on key-value processing speedNew design required for very high op/sec for both read and write“Cache” and “store” modesCPUMemoryClientNICCPU2. Request direction1. Parallel data access3. Key-value data structures
15 MICA’s “Cache” Data Structures Each partition has:Circular log (for memory allocation)Lossy concurrent hash index (for fast item access)Exploit Memcached-like cache semanticsLost data is easily recoverable (not free, though)Favor fast processingProvide good memory efficiency & item eviction
16 Circular Log Allocates space for key-value items of any length Conventional logs + Circular queuesSimple garbage collection/free space defragmentationNew item is appended at tailHeadTail(fixed log size)Insufficient space for new item?Evict oldest item at head (FIFO)TailHeadSupport LRU by reinserting recently accessed items
17 Lossy Concurrent Hash Index Indexes key-value items stored in the circular logSet-associative tableFull bucket? Evict oldest entry from itFast indexing of new key-value itemsbucket 0Hash indexbucket 1hash(Key)…bucket N-1Circular logKey,Val
18 MICA’s “Store” Data Structures Required to preserve stored itemsAchieve similar performance by trading memoryCircular log -> Segregated fitsLossy index -> Lossless index (with bulk chaining)See our paper for detailsSegregated fits – 16 bytes of metadata data per item, potentially vulnerable to memory fragmentation compared to circular logsLossless index – need to provision about 10% of additional buckets for bulk chainingNote that the cache mode could be fast and memory efficiency by using new data structures.
19 Evaluation Going back to end-to-end evaluation… Throughput & latency characteristics
20 Throughput Comparison Throughput (Mops)Similar performance regardless of skew/writeLarge performance gapBad at high write ratios1 – 95 vs 50 perf2 – perf gap//3 – store perfEnd-to-end performance with kernel bypass I/O
21 Throughput-Latency on Ethernet Average latency (μs)200x+ throughput50% GETError bar: 5th, 95th percentile24-52 usThroughput (Mops)Original Memcached using standard socket I/O; both use UDP
22 3. Key-value data structures (cache & store) MICARedesigning in-memory key-value storage65.6+ Mops/node even for heavy skew/writeSource code: github.com/efficient/micaServer nodeCPUMemoryClientNICCPUNew software architecture & data structuresConsistently high performance for diverse workloadsGood latency on Ethernet2. Request direction1. Parallel data access3. Key-value data structures (cache & store)
23 Reference[DPDK][FacebookMeasurement] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In Proc. SIGMETRICS 2012.[Masstree] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache Craftiness for Fast Multicore Key-Value Storage. In Proc. EuroSys 2012.[MemC3] Bin Fan, David G. Andersen, and Michael Kaminsky. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In Proc. NSDI 2013.[Memcached][RAMCloud] Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. Fast Crash Recovery in RAMCloud. In Proc. SOSP 2011.[YCSB] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking Cloud Serving Systems with YCSB. In Proc. SoCC 2010.
Your consent to our cookies if you continue to use this website.