Cache Craftiness for Fast Multicore Key-Value Storage

Slides:



Advertisements
Similar presentations
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Phase Reconciliation for Contended In-Memory Transactions Neha Narula, Cody Cutler, Eddie Kohler, Robert Morris MIT CSAIL and Harvard 1.
SILT: A Memory-Efficient, High-Performance Key-Value Store
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
1 Fast Routing Table Lookup Based on Deterministic Multi- hashing Zhuo Huang, David Lin, Jih-Kwon Peir, Shigang Chen, S. M. Iftekharul Alam Department.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
BTrees & Bitmap Indexes
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
More on protocol implementation Packet parsing Memory management Data structures for lookup.
Hash Tables1 Part E Hash Tables  
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
FAWN: A Fast Array of Wimpy Nodes Presented by: Clint Sbisa & Irene Haque.
Primary Indexes Dense Indexes
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Optimizing RAM-latency Dominated Applications
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
RAMCloud: a Low-Latency Datacenter Storage System John Ousterhout Stanford University
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Silo: Speedy Transactions in Multicore In-Memory Databases
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Lecture 5 Cost Estimation and Data Access Methods.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Review CS File Systems - Partitions What is a hard disk partition?
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Ofir Luzon Supervisor: Prof. Michael Segal Longest Prefix Match For IP Lookup.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Hash 2004, Spring Pusan National University Ki-Joune Li.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Parallel Databases.
Succinct: Enabling Queries on Compressed Data
CSE-291 Cloud Computing, Fall 2016 Kesden
CSE-291 (Cloud Computing) Fall 2016
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
EECS 498 Introduction to Distributed Systems Fall 2017
Packet Classification Using Coarse-Grained Tuple Spaces
Database Design and Programming
Fast Accesses to Big Data in Memory and Storage Systems
Presentation transcript:

Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT)

Let’s build a fast key-value store KV store systems are important Google Bigtable, Amazon Dynamo, Yahoo! PNUTS Single-server KV performance matters Reduce cost Easier management Goal: fast KV store for single multi-core server Assume all data fits in memory Redis, VoltDB KV

Feature wish list Clients send queries over network Persist data across crashes Range query Perform well on various workloads Including hard ones! clin

Hard workloads Skewed key popularity Small key-value pairs Many puts Hard! (Load imbalance) Small key-value pairs Hard! Many puts Arbitrary keys String (e.g. www.wikipedia.org/...) or integer Some work handle integers much better than string keys, we want to handle both keys well Small KV: the per-query processing overhead is dominating, and it requires the server to process queries efficiently

First try: fast binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) Network/disk not bottlenecks High-BW NIC Multiple disks 3.7 million queries/second! Better? What bottleneck remains? DRAM! Where in the talk is? I am showing this because of ???

Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores Throughput (req/sec, millions) To address DRAM bottleneck, we applied cache-craftiness. By cache-craftiness, I mean … We find that … can 1.5X, which is the focus of the rest of the talk. Cache-craftiness: careful use of cache and memory

Contributions Masstree achieves millions of queries per second across various hard workloads Skewed key popularity Various read/write ratios Variable relatively long keys Data >> on-chip cache New ideas Trie of B+ trees, permuter, etc. Full system New ideas + best practices (network, disk, etc.) .. hard workloads, including workloads w/ skewed….

Experiment environment A 16-core server three active DRAM nodes Single 10Gb Network Interface Card (NIC) Four SSDs 64 GB DRAM A cluster of load generators Let me first introduce the experiment environment. The performance numbers in this talk is

Potential bottlenecks in Masstree Network DRAM … … Less best practice log log Disk Single multi-core server

NIC bottleneck can be avoided Single 10Gb NIC Multiple queue, scale to many cores Target: 100B KV pair => 10M/req/sec Use network stack efficiently Pipeline requests Avoid copying cost Our target is 10M

Disk bottleneck can be avoided 10M/puts/sec => 1GB logs/sec! Single disk Multiple disks: split log See paper for details Write throughput Cost Mainstream Disk 100-300 MB/sec 1 $/GB High performance SSD up to 4.4GB/sec > 40 $/GB Single multi-core server Fusion io drive, 750M/s, $7,100 http://www.sabrepc.com/p-2510-fusion-io-fs1-004-160-es-0001-160gb-iodrive-single-level-cell.aspx FusionIO ioDrive Octal, Fastest SSD, see http://www.fastestssd.com/featured/ssd-rankings-the-fastest-solid-state-drives/

DRAM bottleneck – hard to avoid 140M short KV, put-only, @16 cores Cache-craftiness goes 1.5X father, including the cost of: Network Disk Throughput (req/sec, millions)

DRAM bottleneck – w/o network/disk 140M short KV, put-only, @16 cores Throughput (req/sec, millions) Cache-craftiness goes 1.7X father! To focus on DRAM bottleneck, let’s forget about network/disk, and now cache-craftiness can do 1.7X better! Now this factor of 1.7 is what rest of this talk is about. Now let me remind you that the rest of the talk is about DRAM bottleneck. What I am going to do is to show you each of the steps we took before arriving at Masstree

DRAM latency – binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) B A C O 𝑙𝑜𝑔2𝑁 serial DRAM latencies! Y X Z … The numbers does not include the cost of network/logging. We start with a binary tree, which is a reasonable tree. 2.7 us/lookup 380K lookups/core/sec 10M keys => VoltDB

DRAM latency – Lock-free 4-way tree Concurrency: same as binary tree One cache line per node => 3 KV / 4 children X Y Z ½ levels as binary tree ½ DRAM latencies as binary tree A B … … …

4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 4-tree does not work well! No balance, terrible performance! -> balanced tree, we still want wide fanout => B+tree

4-tree may perform terribly! Unbalanced: O 𝑁 serial DRAM latencies e.g. sequential inserts Want balanced tree w/ wide fanout A B C D E F O(N) levels! G H I …

B+tree – Wide and balanced Concurrent main memory B+tree [OLFIT] Optimistic concurrency control: version technique Lookup/scan is lock-free Puts hold ≤ 3 per-node locks

Wide fanout B+tree is 11% slower! 140M short KV, put-only Fanout=15, fewer levels than 4-tree, but # cache lines from DRAM >= 4-tree 4-tree: each internal node is full B+tree: nodes are ~75% full Serial DRAM latencies >= 4-tree Throughput (req/sec, millions) 11%

B+tree – Software prefetch Same as [pB+-trees] Masstree: B+tree w/ fanout 15 => 4 cache lines Always prefetch whole node when accessed Result: one DRAM latency per node vs. 2, 3, or 4 4 lines = 1 line

140M short KV, put-only, @16 cores B+tree with prefetch 140M short KV, put-only, @16 cores Beats 4-tree by 9% Balanced beats unbalanced! Throughput (req/sec, millions) Summarize, Better + balanced!

Concurrent B+tree problem Lookups retry in case of a concurrent insert Lock-free 4-tree: not a problem keys do not move around but unbalanced insert(B) Intermediate state! A C D A C D A B C D

B+tree optimization - Permuter Keys stored unsorted, define order in tree nodes A concurrent lookup does not need to retry Lookup uses permuter to search keys Insert appears atomic to lookups Permuter: 64-bit integer insert(B) A C D A C D B A C D B … 1 2 3 1 2 … Keys are stored unsorted, and Permuter defines the order of keys in the tree node More clear

140M short KV, put-only, @16 cores B+tree with permuter 140M short KV, put-only, @16 cores Improve by 4% Throughput (req/sec, millions)

Performance drops dramatically when key length increases Short values, 50% updates, @16 cores, no logging Throughput (req/sec, millions) Why? Stores key suffix indirectly, thus each key comparison compares full key extra DRAM fetch One common problem with most B+tree design is that … Here is an example B+tree Keys differ in last 8B Key length

Masstree – Trie of B+trees Trie: a tree where each level is indexed by fixed-length key fragment Masstree: a trie with fanout 264, but each trie node is a B+tree Compress key prefixes! … B+tree, indexed by k[0:7] … B+tree, indexed by k[8:15] Explain trie B+tree, indexed by k[16:23]

Case Study: Keys share P byte prefix – Better than single B+tree 𝑃 8 trie levels each has one node only A single B+tree with 8B keys … Complexity DRAM access Masstree O(log N) Single B+tree O(P log N)

Masstree performs better for long keys with prefixes Short values, 50% updates, @16 cores, no logging Throughput (req/sec, millions) 8B key comparison vs. full key comparison Key length

Does trie of B+trees hurt short key performance? 140M short KV, put-only, @16 cores 8% faster! More efficient code – internal node handle 8B keys only Throughput (req/sec, millions)

Evaluation Masstree compare to other systems? Masstree compare to partitioned trees? How much do we pay for handling skewed workloads? Masstree compare with hash table? How much do we pay for supporting range queries? Masstree scale on many cores? Move setup earlier, simplify expr. I have shown you some evaluation in previous slides. Now there are bunch of other interesting questions to answer about the final design

Masstree performs well even with persistence and range queries 20M short KV, uniform dist., read-only, @16 cores, w/ network Memcached: not persistent and no range queries Throughput (req/sec, millions) Redis: no range queries Unfair: both have a richer data and query model 0.04 0.22

Multi-core – Partition among cores? Multiple instances, one unique set of keys per inst. Memcached, Redis, VoltDB Masstree: a single shared tree each core can access all keys reduced imbalance B Y A C X Z B A C Y X Z

A single Masstree performs better for skewed workloads 140M short KV, read-only, @16 cores, w/ network Throughput (req/sec, millions) No remote DRAM access No concurrency control Partition: 80% idle time 1 partition: 40% 15 partitions: 4% Hit the point Start x-axis from 1 Mention disabled all concurrency control One partition receives δ times more queries δ

Cost of supporting range queries Without range query? One can use hash table No resize cost: pre-allocate a large hash table Lock-free: update with cmpxchg Only support 8B keys: efficient code 30% full, each lookup = 1.1 hash probes Measured in the Masstree framework 2.5X the throughput of Masstree Range query costs 2.5X in performance Range

Scale to 12X on 16 cores Short KV, w/o logging Perfect scalability Throughput (req/sec/core, millions) Scale to 12X Put scales similarly Limited by the shared memory system May scale better w/ fewer DRAM nodes Change to get, say put similarly Number of cores

Related work [OLFIT]: Optimistic Concurrency Control [pB+-trees]: B+tree with software prefetch [pkB-tree]: store fixed # of diff. bits inline [PALM]: lock-free B+tree, 2.3X as [OLFIT] Masstree: first system combines them together, w/ new optimizations Trie of B+trees, permuter OLFIT: Does not support variable length keys pB+-trees: Not concurrent pkB-tree: Not concurrent, Masstree is 20% faster @ 1 core PALM: higher latency (100us v.s. Masstree’s 1us) PALM has nice … Masstree builds upon on several previous idea

Summary Masstree: a general-purpose high-performance persistent KV store 5.8 million puts/sec, 8 million gets/sec More comparisons with other systems in paper Using cache-craftiness improves performance by 1.5X

Thank you!