Cache Craftiness for Fast Multicore Key-Value Storage

Name: Cache Craftiness for Fast Multicore Key-Value Storage
Uploaded: 2017-07-16T21:46:22+00:00
Duration: PTM19S2
Channel: Britney Smith
Description: Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage
Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Let’s build a fast key-value store
KV store systems are important Google Bigtable, Amazon Dynamo, Yahoo! PNUTS Single-server KV performance matters Reduce cost Easier management Goal: fast KV store for single multi-core server Assume all data fits in memory Redis, VoltDB KV

Feature wish list Clients send queries over network
Persist data across crashes Range query Perform well on various workloads Including hard ones! clin

Hard workloads Skewed key popularity Small key-value pairs Many puts
Hard! (Load imbalance) Small key-value pairs Hard! Many puts Arbitrary keys String (e.g. or integer Some work handle integers much better than string keys, we want to handle both keys well Small KV: the per-query processing overhead is dominating, and it requires the server to process queries efficiently

First try: fast binary tree
140M short KV, cores Throughput (req/sec, millions) Network/disk not bottlenecks High-BW NIC Multiple disks 3.7 million queries/second! Better? What bottleneck remains? DRAM! Where in the talk is? I am showing this because of ???

Cache craftiness goes 1.5X farther
140M short KV, cores Throughput (req/sec, millions) To address DRAM bottleneck, we applied cache-craftiness. By cache-craftiness, I mean … We find that … can 1.5X, which is the focus of the rest of the talk. Cache-craftiness: careful use of cache and memory

Contributions Masstree achieves millions of queries per second across various hard workloads Skewed key popularity Various read/write ratios Variable relatively long keys Data >> on-chip cache New ideas Trie of B+ trees, permuter, etc. Full system New ideas + best practices (network, disk, etc.) .. hard workloads, including workloads w/ skewed….

Experiment environment
A 16-core server three active DRAM nodes Single 10Gb Network Interface Card (NIC) Four SSDs 64 GB DRAM A cluster of load generators Let me first introduce the experiment environment. The performance numbers in this talk is

Potential bottlenecks in Masstree
Network DRAM … … Less best practice log log Disk Single multi-core server

NIC bottleneck can be avoided
Single 10Gb NIC Multiple queue, scale to many cores Target: 100B KV pair => 10M/req/sec Use network stack efficiently Pipeline requests Avoid copying cost Our target is 10M

Disk bottleneck can be avoided
10M/puts/sec => 1GB logs/sec! Single disk Multiple disks: split log See paper for details Write throughput Cost Mainstream Disk MB/sec 1 $/GB High performance SSD up to 4.4GB/sec > 40 $/GB Single multi-core server Fusion io drive, 750M/s, $7,100 FusionIO ioDrive Octal, Fastest SSD, see

DRAM bottleneck – hard to avoid
140M short KV, cores Cache-craftiness goes 1.5X father, including the cost of: Network Disk Throughput (req/sec, millions)

DRAM bottleneck – w/o network/disk
140M short KV, cores Throughput (req/sec, millions) Cache-craftiness goes 1.7X father! To focus on DRAM bottleneck, let’s forget about network/disk, and now cache-craftiness can do 1.7X better! Now this factor of 1.7 is what rest of this talk is about. Now let me remind you that the rest of the talk is about DRAM bottleneck. What I am going to do is to show you each of the steps we took before arriving at Masstree

DRAM latency – binary tree
140M short KV, cores Throughput (req/sec, millions) B A C O 𝑙𝑜𝑔2𝑁 serial DRAM latencies! Y X Z … The numbers does not include the cost of network/logging. We start with a binary tree, which is a reasonable tree. 2.7 us/lookup 380K lookups/core/sec 10M keys => VoltDB

DRAM latency – Lock-free 4-way tree
Concurrency: same as binary tree One cache line per node => 3 KV / 4 children X Y Z ½ levels as binary tree ½ DRAM latencies as binary tree A B … … …

4-tree beats binary tree by 40%
140M short KV, cores Throughput (req/sec, millions) 4-tree does not work well! No balance, terrible performance! -> balanced tree, we still want wide fanout => B+tree

4-tree may perform terribly!
Unbalanced: O 𝑁 serial DRAM latencies e.g. sequential inserts Want balanced tree w/ wide fanout A B C D E F O(N) levels! G H I …

B+tree – Wide and balanced
Concurrent main memory B+tree [OLFIT] Optimistic concurrency control: version technique Lookup/scan is lock-free Puts hold ≤ 3 per-node locks

Wide fanout B+tree is 11% slower!
140M short KV, put-only Fanout=15, fewer levels than 4-tree, but # cache lines from DRAM >= 4-tree 4-tree: each internal node is full B+tree: nodes are ~75% full Serial DRAM latencies >= 4-tree Throughput (req/sec, millions) 11%

B+tree – Software prefetch
Same as [pB+-trees] Masstree: B+tree w/ fanout 15 => 4 cache lines Always prefetch whole node when accessed Result: one DRAM latency per node vs. 2, 3, or 4 4 lines = 1 line

140M short KV, put-only, @16 cores
B+tree with prefetch 140M short KV, cores Beats 4-tree by 9% Balanced beats unbalanced! Throughput (req/sec, millions) Summarize, Better + balanced!

Concurrent B+tree problem
Lookups retry in case of a concurrent insert Lock-free 4-tree: not a problem keys do not move around but unbalanced insert(B) Intermediate state! A C D A C D A B C D

B+tree optimization - Permuter
Keys stored unsorted, define order in tree nodes A concurrent lookup does not need to retry Lookup uses permuter to search keys Insert appears atomic to lookups Permuter: 64-bit integer insert(B) A C D A C D B A C D B … 1 2 3 1 2 … Keys are stored unsorted, and Permuter defines the order of keys in the tree node More clear

140M short KV, put-only, @16 cores
B+tree with permuter 140M short KV, cores Improve by 4% Throughput (req/sec, millions)

Performance drops dramatically when key length increases
Short values, 50% cores, no logging Throughput (req/sec, millions) Why? Stores key suffix indirectly, thus each key comparison compares full key extra DRAM fetch One common problem with most B+tree design is that … Here is an example B+tree Keys differ in last 8B Key length

Masstree – Trie of B+trees
Trie: a tree where each level is indexed by fixed-length key fragment Masstree: a trie with fanout 264, but each trie node is a B+tree Compress key prefixes! … B+tree, indexed by k[0:7] … B+tree, indexed by k[8:15] Explain trie B+tree, indexed by k[16:23]

Case Study: Keys share P byte prefix – Better than single B+tree
𝑃 8 trie levels each has one node only A single B+tree with 8B keys … Complexity DRAM access Masstree O(log N) Single B+tree O(P log N)

Masstree performs better for long keys with prefixes
Short values, 50% cores, no logging Throughput (req/sec, millions) 8B key comparison vs. full key comparison Key length

Does trie of B+trees hurt short key performance?
140M short KV, cores 8% faster! More efficient code – internal node handle 8B keys only Throughput (req/sec, millions)

Evaluation Masstree compare to other systems?
Masstree compare to partitioned trees? How much do we pay for handling skewed workloads? Masstree compare with hash table? How much do we pay for supporting range queries? Masstree scale on many cores? Move setup earlier, simplify expr. I have shown you some evaluation in previous slides. Now there are bunch of other interesting questions to answer about the final design

Masstree performs well even with persistence and range queries
20M short KV, uniform dist., cores, w/ network Memcached: not persistent and no range queries Throughput (req/sec, millions) Redis: no range queries Unfair: both have a richer data and query model 0.04 0.22

Multi-core – Partition among cores?
Multiple instances, one unique set of keys per inst. Memcached, Redis, VoltDB Masstree: a single shared tree each core can access all keys reduced imbalance B Y A C X Z B A C Y X Z

A single Masstree performs better for skewed workloads
140M short KV, cores, w/ network Throughput (req/sec, millions) No remote DRAM access No concurrency control Partition: 80% idle time 1 partition: 40% 15 partitions: 4% Hit the point Start x-axis from 1 Mention disabled all concurrency control One partition receives δ times more queries δ

Cost of supporting range queries
Without range query? One can use hash table No resize cost: pre-allocate a large hash table Lock-free: update with cmpxchg Only support 8B keys: efficient code 30% full, each lookup = 1.1 hash probes Measured in the Masstree framework 2.5X the throughput of Masstree Range query costs 2.5X in performance Range

Scale to 12X on 16 cores Short KV, w/o logging
Perfect scalability Throughput (req/sec/core, millions) Scale to 12X Put scales similarly Limited by the shared memory system May scale better w/ fewer DRAM nodes Change to get, say put similarly Number of cores

Related work [OLFIT]: Optimistic Concurrency Control
[pB+-trees]: B+tree with software prefetch [pkB-tree]: store fixed # of diff. bits inline [PALM]: lock-free B+tree, 2.3X as [OLFIT] Masstree: first system combines them together, w/ new optimizations Trie of B+trees, permuter OLFIT: Does not support variable length keys pB+-trees: Not concurrent pkB-tree: Not concurrent, Masstree is 20% 1 core PALM: higher latency (100us v.s. Masstree’s 1us) PALM has nice … Masstree builds upon on several previous idea

Summary Masstree: a general-purpose high-performance persistent KV store 5.8 million puts/sec, 8 million gets/sec More comparisons with other systems in paper Using cache-craftiness improves performance by 1.5X

Thank you!

Cache Craftiness for Fast Multicore Key-Value Storage

Similar presentations

Presentation on theme: "Cache Craftiness for Fast Multicore Key-Value Storage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cache Craftiness for Fast Multicore Key-Value Storage

Similar presentations

Presentation on theme: "Cache Craftiness for Fast Multicore Key-Value Storage"— Presentation transcript:

Similar presentations

About project

Feedback