Distributed Systems CS 425 / ECE 428 Fall 2012

Slides:



Advertisements
Similar presentations
CASSANDRA-A Decentralized Structured Storage System Presented By Sadhana Kuthuru.
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik.
NoSQL Databases: MongoDB vs Cassandra
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
CS346: Advanced Databases
Distributed storage for structured data
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
Lecture 20-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Nov 1, 2012 NoSQL/Key-value Stores Lecture.
CS 525 Advanced Distributed Systems Spring 2014
IBM Almaden Research Center © 2011 IBM Corporation 1 Spinnaker Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao Eugene.
Cloud Computing Cloud Data Serving Systems Keke Chen.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
Dynamo: Amazon’s Highly Available Key-value Store
Cassandra - A Decentralized Structured Storage System
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
University of Illinois at Urbana-Champaign
Bigtable: A Distributed Storage System for Structured Data
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
Plan for Final Lecture What you may expect to be asked in the Exam?
Cassandra The Fortune Teller
CSCI5570 Large Scale Data Processing Systems
and Big Data Storage Systems
Amit Ohayon, seminar in databases, 2017
University of Illinois at Urbana-Champaign
Lecture 9-11: Key-Value/NoSQL Stores
Cloud Computing and Architecuture
Cassandra - A Decentralized Structured Storage System
HBase Mohamed Eltabakh
Hadoop.
CS 525 Advanced Distributed Systems Spring 2017
Introduction to Cassandra
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
A free and open-source distributed NoSQL database
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
Trade-offs in Cloud Databases
Dynamo: Amazon’s Highly Available Key-value Store
CS 525 Advanced Distributed Systems Spring 2013
CSE-291 (Cloud Computing) Fall 2016
Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2012
Cassandra Transaction Processing
NOSQL.
The NoSQL Column Store used by Facebook
Introduction to NewSQL
Google Filesystem Some slides taken from Alan Sussman.
NOSQL databases and Big Data Storage Systems
Apache Cassandra for the SQLServer DBA
CS 525 Advanced Distributed Systems Spring 2018
Massively Parallel Cloud Data Storage Systems
1 Demand of your DB is changing Presented By: Ashwani Kumar
湖南大学-信息科学与工程学院-计算机与科学系
NoSQL Databases An Overview
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
CSE 482 Lecture 5: NoSQL.
Lecture 21: Replication Control
Lecture 9-11: Key-Value/NoSQL Stores
Transaction Properties: ACID vs. BASE
Lecture 21: Replication Control
CSE 486/586 Distributed Systems Consistency --- 3
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Presentation transcript:

Distributed Systems CS 425 / ECE 428 Fall 2012 NoSQL/Key-value Stores Slides by Prof. Indranil Gupta Based mostly on Cassandra NoSQL presentation Cassandra 1.0 documentation at datastax.com Cassandra Apache project wiki HBase  2012, I. Gupta

Batch vs. Real-time Batch computations (MapReduce/Hadoop) Run a single computation on entire data set E.g., calculate PageRank Focus: throughput Real-time computations Read or update a data set E.g., perform a web search E.g., show a Facebook page Focus: availability, latency [, throughput, consistency] Systems Dynamo (Amazon) BigTable (Google)/HBase (Apache) Cassandra (Facebook -> Apache)

Cassandra Originally designed at Facebook Open-sourced Some of its myriad users:

Why Key-value Store? (Business) Key -> Value (twitter.com) tweet id -> information about tweet (kayak.com) Flight number -> information about flight, e.g., availability (yourbank.com) Account number -> information about it (amazon.com) item number -> information about it Search is usually built on top of a key-value store

Isn’t that just a database? Yes Relational Databases (RDBMSs) have been around for ages MySQL is the most popular among them Data stored in tables Schema-based, i.e., structured tables Queried using SQL SQL queries: SELECT user_id from users WHERE username = “jbellis” Example’s Source

Issues with today’s workloads Data: Large and unstructured Lots of random reads and writes Foreign keys rarely needed Need Incremental Scalability Speed No Single point of failure Low TCO and admin Scale out, not up

CAP Theorem Proposed by Eric Brewer (Berkeley) Subsequently proved by Gilbert and Lynch In a distributed system you can satisfy at most 2 out of the 3 guarantees Consistency: all nodes have same data at any time Availability: the system allows operations all the time Partition-tolerance: the system continues to work in spite of network partitions Cassandra Eventual (weak) consistency, Availability, Partition-tolerance Traditional RDBMSs Strong consistency over availability under a partition

Cassandra Data Model Column Families: Like SQL tables but may be unstructured (client-specified) Can have index tables Hence “column-oriented databases”/ “NoSQL” No schemas Some columns missing from some entries “Not Only SQL” Supports get(key) and put(key, value) operations Often write-heavy workloads

Let’s go Inside: Key -> Server Mapping How do you decide which server(s) a key-value resides on?

Cassandra uses a Ring-based DHT but without routing (Remember this?) Say m=7 N112 N16 Primary replica for key K13 N96 N32 Read/write K13 N80 N45 Coordinator (typically one per DC) Backup replicas for key K13 Cassandra uses a Ring-based DHT but without routing

Writes Need to be lock-free and fast (no reads or disk seeks) Client sends write to one front-end node in Cassandra cluster (Coordinator) Which (via Partitioning function) sends it to all replica nodes responsible for key Always writable: Hinted Handoff If any replica is down, the coordinator writes to all other replicas, and keeps the write until down replica comes back up. When all replicas are down, the Coordinator (front end) buffers writes (for up to an hour). Provides Atomicity for a given key (i.e., within ColumnFamily) One ring per datacenter Coordinator can also send write to one replica per remote datacenter

Writes at a replica node On receiving a write 1. log it in disk commit log 2. Make changes to appropriate memtables In-memory representation of multiple key-value pairs Later, when memtable is full or old, flush to disk Data File: An SSTable (Sorted String Table) – list of key value pairs, sorted by key Index file: An SSTable – (key, position in data sstable) pairs And a Bloom filter Compaction: Data udpates accumulate over time and sstables and logs need to be compacted Merge key updates, etc. Reads need to touch log and multiple SSTables May be slower than writes

Bloom Filter Compact way of representing a set of items Checking for existence in set is cheap Some probability of false positives: an item not in set may check true as being in set Never false negatives Large Bit Map On insert, set all hashed bits. On check-if-present, return true if all hashed bits set. False positives False positive rate low k=4 hash functions 100 items 3200 bits FP rate = 0.02% 1 2 3 Hash1 Key-K Hash2 . 69 Hashk 111 127

Deletes and Reads Delete: don’t delete item right away add a tombstone to the log Compaction will remove tombstone and delete item Read: Similar to writes, except Coordinator can contact closest replica (e.g., in same rack) Coordinator also fetches from multiple replicas check consistency in the background, initiating a read-repair if any two values are different Makes read slower than writes (but still fast) Read repair: uses gossip (remember this?)

Cassandra uses Quorums (Remember this?) Reads Wait for R replicas (R specified by clients) In background check for consistency of remaining N-R replicas, and initiate read repair if needed (N = total number of replicas for this key) Writes come in two flavors Block until quorum is reached Async: Write to any node Quorum Q = N/2 + 1 R = read replica count, W = write replica count If W+R > N and W > N/2, you have consistency Allowed (W=1, R=N) or (W=N, R=1) or (W=Q, R=Q)

Cassandra uses Quorums In reality, a client can choose one of these levels for a read/write operation: ZERO: do not wait for write (N/A for read) ANY: any node (may not be replica) ONE: at least one replica QUORUM: quorum across all replicas in all datacenters LOCAL_QUORUM: in coordinator’s DC EACH_QUORUM: quorum in every DC ALL: all replicas all DCs

Cluster Membership (Remember this?) 1 10118 64 2 10110 3 10090 58 4 10111 65 1 10120 66 2 10103 62 3 10098 63 4 10111 65 2 1 Address Time (local) 1 10120 70 2 10110 64 3 10098 4 10111 65 Heartbeat Counter 4 Protocol: Nodes periodically gossip their membership list On receipt, the local membership list is updated 3 Current time : 70 at node 2 (asynchronous clocks) Cassandra uses gossip-based cluster membership Fig and animation by: Dongyun Jin and Thuy Ngyuen

Cluster Membership, contd. (Remember this?) Suspicion mechanisms Accrual detector: FD outputs a value (PHI) representing suspicion Apps set an appropriate threshold PHI = 5 => 10-15 sec detection time PHI calculation for a member Inter-arrival times for gossip messages PHI(t) = - log(CDF or Probability(t_now – t_last))/log 10 PHI basically determines the detection timeout, but is sensitive to actual inter-arrival time variations for gossiped heartbeats Cassandra uses gossip-based cluster membership Fig and animation by: Dongyun Jin and Thuy Ngyuen

Vs. SQL MySQL is the most popular (and has been for a while) On > 50 GB data MySQL Writes 300 ms avg Reads 350 ms avg Cassandra Writes 0.12 ms avg Reads 15 ms avg

Cassandra Summary While RDBMS provide ACID (Atomicity Consistency Isolation Durability) Cassandra provides BASE Basically Available Soft-state Eventual Consistency Prefers Availability over consistency Other NoSQL products MongoDB, Riak (look them up!) Next: HBase Prefers (strong) Consistency over Availability

HBase Google’s BigTable was first “blob-based” storage system Yahoo! Open-sourced it -> HBase Major Apache project today Facebook uses HBase internally API Get/Put(row) Scan(row range, filter) – range queries MultiPut

HBase Architecture Small group of servers running Zab, a Paxos-like protocol HDFS Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

HBase Storage hierarchy HBase Table Split it into multiple regions: replicated across servers One Store per ColumnFamily (subset of columns with similar query patterns) per region Memstore for each Store: in-memory updates to Store; flushed to disk when full StoreFiles for each store for each region: where the data lives - Blocks HFile SSTable from Google’s BigTable

HFile (For a census table example) Ethnicity SSN:000-00-0000 Demographic Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/

Strong Consistency: HBase Write-Ahead Log Write to HLog before writing to MemStore Can recover from failure Source: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

Log Replay After recovery from failure, or upon bootup (HRegionServer/HMaster) Replay any stale logs (use timestamps to find out where the database is w.r.t. the logs) Replay: add edits to the MemStore Why one HLog per HRegionServer rather than per region? Avoids many concurrent writes, which on the local file system may involve many disk seeks

Cross-data center replication HLog Zookeeper actually a file system for control information 1. /hbase/replication/state 2. /hbase/replication/peers /<peer cluster number> 3. /hbase/replication/rs/<hlog>

Summary Key-value stores and NoSQL faster but provide weaker guarantees