Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cassandra – A Decentralized Structured Storage System

Similar presentations


Presentation on theme: "Cassandra – A Decentralized Structured Storage System"— Presentation transcript:

1 Cassandra – A Decentralized Structured Storage System
A. Lakshaman1, P.Malik1 1Facebook SIGOPS ‘10 Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University

2 The Rise of NoSQL Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earlier 2009 when Johan Oskarsson of Last.fm wanted to organize an event to discuss open-source distributed databases. The name attempted to label the emergence of growing distributed data stores that often did not attempt to provide ACID guarantees Refer to

3 NoSQL Database Based on Key-value Based on Column Based on Document
memchached, Dynamo, Volemort, Tokyo Cabinet Based on Column Google BigTable, Cloudata, Hbase, Hypertable, Cassandra Based on Document MongoDB, CouchDB Based on Graph Meo4j, FlockDB, InfiniteGraph

4 NoSQL BigData Database
Based on Key-Value memchached, Dynamo, Volemort, Tokyo Cabinet Based on Column Google BigTable, Cloudata, Hbase, Hypertable, Cassandra Based onDocument MongoDB, CouchDB Based on Graph Meo4j, FlockDB, InfiniteGraph

5 Refer to http://blog.nahurst.com/visual-guide-to-nosql-systems

6 Contents Introduction Data Model System Architecture Operations
Remind: Dynamo Cassandra Data Model System Architecture Partitioning Replication Membership Bootstrapping Operations WRITE READ Consistency level Performance Benchmark Case Study Conclusion

7 Remind: Dynamo Distributed Hash Table BASE
Basically Available Soft-state Eventually Consistent Client Tunable consistency/availability NRW Configuration W=N, R=1 Read optimized strong consistency W=1, R=N Write optimized strong consistency W+R ≦ N Weak eventual consistency W+R > N Strong consistency

8 Cassandra Dynamo-Bigtable lovechild Properties Column-based data model
Distributed Hash Table Tunable tradeoff Consistency vs. Latency Properties No single point of Failure Linearly scalable Flexible partitioning, replica placement High Availability (eventually consistency)

9 Column Family: “Article”
Data Model Cluster Key Space is corresponding to db or table space Column Family is corresponding to table Column is unit of data stored in Cassandra Row Key Column Family: “User” Column Family: “Article” “userid1” name: Username, value: uname1 name: , value: name: Tel, value: “userid2” name: Username, value: uname2 name: , value: name: Tel, value: name: ArticleId, value:userid2-1 name: ArticleId, value:userid2-2 name: ArticleId, value:userid2-3 “userid3” name: Username, value: uname3 name: , value: name: Tel, value:

10 System Architecture Partitioning Replication Membership Bootstraping

11 Partitioning Algorithm
Distributed Hash Table Data and Server are located in the same address space Consistent Hashing Key Space Partition: arrangement of the key Overlay Networking: Routing Mechanism N1 N3 N2 Hash(key1) value N3 N2 hash(key1) N1 high low N2 is deemed the coordinator of key 1

12 Partitioning Algorithm (cont’d)
Challenges Non-uniform data and load distribution Oblivious to the heterogenity in the performance of nodes Solutions Nodes get assigned to multiple positions in the circle (like Dynamo) Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra) N1 N3 N2 N1 N3 N2

13 Replication RackUnware RackAware DataCenterShared E A B C D F G H I J
Coordinator of data 1

14 Cluster Membership Gossip Protocol is used for cluster membership
Super lightweight with mathematically provable properties State disseminated in O(logN) rounds Every T Seconds each member increments its heartbeat counter and selects one other member send its list to A member merges the list with its own list

15 Gossip Protocol t1 t2 t3 t4 t6 t5 server 1 server1: t1 server 1

16 Accrual Failure Detector
Valuable for system management, replication, load balancing Designed to adapt to changing network conditions The value output, PHI, represents a suspicion level Applications set an appropriate threshold, trigger suspicions and perform appropriate actions In Cassandra the average time taken to detect a failure is seconds with the PHI threshold set at 5 where

17 Bootstraping New node gets assigned a token such that it can alleviate a heavily loaded node N1 N2 N1 N3 N2

18 WRITE Interface WRITE Opertation Always Writable
Simple: put(key,col,value) Complex: put(key,[col:val,…,col:val]) Batch WRITE Opertation Commit log for durability Configurable fsync Sequential writes only MemTable Nodisk access (no reads and seek) Sstables are final Read-only indexes Always Writable

19 READ Interface READ get(key,column) get_slice(key,SlicePredicate)
Get_range_sllices(keyRange,SlicePredicate) READ Practically lock-free Sstable proliferation Row cache Key cache

20 Consistency Level Tuning the consistency level for each WRITE/READ operation Level Description ZERO Hail Mary ANY 1 replica ONE QUORUM (N/2)+1 ALL All replica Level Description ZERO N/A ANY ONE 1 replica QUORUM (N/2)+1 ALL All replica Write Operation Read Operation

21 Performance Benchmark
Random and Sequential Writes Limited by bandwidth Facebook Inbox Search Two kinds of Search Term Search Interactions 50+TB on 150 node cluster Latency Stat Search Interactions Term Search Min 7.69ms 7.78ms Median 15.69ms 18.27ms Max 26.13ms 44.41ms

22 vs MySQL with 50GB Data MySQL Cassandra ~300ms write ~350ms read

23 Case Study Cassandra as primary data store
Datacenter and rack-aware replication ~1,000,000 ops/s high sharding and low replication Inbox Search 100TB 5,000,000,000 writes per day

24 Conclusions Cassandra Future works Scalability High Performance
Wide Applicability Future works Compression Atomicity Secondary Index


Download ppt "Cassandra – A Decentralized Structured Storage System"

Similar presentations


Ads by Google