Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cassandra – A Decentralized Structured Storage System A. Lakshaman 1, P.Malik 1 1 Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il.

Similar presentations


Presentation on theme: "Cassandra – A Decentralized Structured Storage System A. Lakshaman 1, P.Malik 1 1 Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il."— Presentation transcript:

1 Cassandra – A Decentralized Structured Storage System A. Lakshaman 1, P.Malik 1 1 Facebook SIGOPS ‘ Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University

2 Copyright  2010 by CEBT The Rise of NoSQL 2 Refer to  Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earlie r 2009 when Johan Oskarsson of Last.fm wanted to organize an event t o discuss open-source distributed databases.Last.fm  The name attempted to label the emergence of growing distributed data stores that often did not attempt to provide ACID guaranteesACID

3 Copyright  2010 by CEBT NoSQL Database  Based on Key-value memchached, Dynamo, Volemort, Tokyo Cabinet  Based on Column Google BigTable, Cloudata, Hbase, Hypertable, Cassandra  Based on Document MongoDB, CouchDB  Based on Graph Meo4j, FlockDB, InfiniteGraph 3

4 Copyright  2010 by CEBT NoSQL BigData Database  Based on Key-Value memchached, Dynamo, Volemort, Tokyo Cabinet  Based on Column Google BigTable, Cloudata, Hbase, Hypertable, Cassandra  Based onDocument MongoDB, CouchDB  Based on Graph Meo4j, FlockDB, InfiniteGraph 4

5 Copyright  2010 by CEBT 5 Refer to

6 Copyright  2010 by CEBT Contents  Introduction Remind: Dynamo Cassandra  Data Model  System Architecture Partitioning Replication Membership Bootstrapping 6  Operations WRITE READ Consistency level  Performance Benchmark  Case Study  Conclusion

7 Copyright  2010 by CEBT Remind: Dynamo  Distributed Hash Table  BASE Basically Available Soft-state Eventually Consistent  Client Tunable consistency/availability 7 NRW Configuration W=N, R=1Read optimized strong consistency W=1, R=NWrite optimized strong consistency W+R ≦ N Weak eventual consistency W+R > NStrong consistency

8 Copyright  2010 by CEBT Cassandra  Dynamo-Bigtable lovechild Column-based data model Distributed Hash Table Tunable tradeoff – Consistency vs. Latency  Properties No single point of Failure Linearly scalable Flexible partitioning, replica placement High Availability (eventually consistency) 8

9 Copyright  2010 by CEBT Data Model  Cluster  Key Space is corresponding to db or table space  Column Family is corresponding to table  Column is unit of data stored in Cassandra 9 Row KeyColumn Family: “User”Column Family: “Article” “userid1” name: Username, value: uname1 name: , value: name: Tel, value: “userid2” name: Username, value: uname2 name: , value: name: Tel, value: name: ArticleId, value:userid2-1 name: ArticleId, value:userid2-2 name: ArticleId, value:userid2-3 “userid3” name: Username, value: uname3 name: , value: name: Tel, value:

10 Copyright  2010 by CEBT System Architecture  Partitioning  Replication  Membership  Bootstraping 10

11 Copyright  2010 by CEBT Partitioning Algorithm  Distributed Hash Table Data and Server are located in the same address space Consistent Hashing Key Space Partition: arrangement of the key Overlay Networking: Routing Mechanism 11 N1 N3N2 Hash(key1) value N3 N2 hash(key1) N1 high low N2 is deemed the coordinator of key 1

12 Copyright  2010 by CEBT Partitioning Algorithm (cont’d)  Challenges Non-uniform data and load distribution Oblivious to the heterogenity in the performance of nodes  Solutions Nodes get assigned to multiple positions in the circle (like Dynamo) Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra) 12 N1 N3N2 N1 N3N2 N1 N2 N3 N2 N1 N3

13 Copyright  2010 by CEBT Replication  RackUnware  RackAware  DataCenterShared 13 E A B C D F G H I J data1 Coordinator of data 1

14 Copyright  2010 by CEBT Cluster Membership  Gossip Protocol is used for cluster membership  Super lightweight with mathematically provable properties  State disseminated in O(logN) rounds  Every T Seconds each member increments its heartbeat counter and selects one other member send its list to  A member merges the list with its own list 14

15 Copyright  2010 by CEBT Gossip Protocol 15 server 1 server1: t1 t1 server 1 server1: t1 server 2 server2: t2 t2 server 1 server1: t1 server2: t2 server 2 server2: t2 t3 server 1 server1: t4 server2: t2 server 2 server1: t4 server2: t2 t4 server 1 server1: t4 server2: t2 server3 :t5 server 2 server1: t4 server2: t2 t5 server 3 server3: t5 server 1 server1: t6 server2: t2 server3 :t5 server 2 server1: t6 server2: t6 server3: t5 t6 server 3 server1: t6 server2: t6 server3: t5

16 Copyright  2010 by CEBT Accrual Failure Detector  Valuable for system management, replication, load balancing  Designed to adapt to changing network conditions  The value output, PHI, represents a suspicion level  Applications set an appropriate threshold, trigger suspicions and perform appropriate actions  In Cassandra the average time taken to detect a failure is s econds with the PHI threshold set at 5 16 where

17 Copyright  2010 by CEBT Bootstraping  New node gets assigned a token such that it can alleviate a heav ily loaded node 17 N1 N2 N1 N3N2

18 Copyright  2010 by CEBT WRITE  Interface Simple: put(key,col,value) Complex: put(key,[col:val,…,col:val]) Batch  WRITE Opertation Commit log for durability – Configurable fsync – Sequential writes only MemTable – Nodisk access (no reads and seek) Sstables are final – Read-only – indexes Always Writable 18

19 Copyright  2010 by CEBT READ  Interface get(key,column) get_slice(key,SlicePredicate ) Get_range_sllices(keyRang e,SlicePredicate)  READ Practically lock-free Sstable proliferation Row cache Key cache 19

20 Copyright  2010 by CEBT Consistency Level LevelDescription ZEROHail Mary ANY1 replica ONE1 replica QUORUM(N/2)+1 ALLAll replica 20 LevelDescription ZERON/A ANYN/A ONE1 replica QUORUM(N/2)+1 ALLAll replica Write OperationRead Operation  Tuning the consistency level for each WRITE/READ operation

21 Copyright  2010 by CEBT Performance Benchmark  Random and Sequential Writes Limited by bandwidth  Facebook Inbox Search Two kinds of Search – Term Search – Interactions 50+TB on 150 node cluster 21 Latency StatSearch InteractionsTerm Search Min7.69ms7.78ms Median15.69ms18.27ms Max26.13ms44.41ms

22 Copyright  2010 by CEBT vs MySQL with 50GB Data  MySQL ~300ms write ~350ms read  Cassandra ~0.12ms write ~15ms read 22

23 Copyright  2010 by CEBT Case Study  Cassandra as primary data store  Datacenter and rack-aware replication  ~1,000,000 ops/s  high sharding and low replication  Inbox Search 100TB 5,000,000,000 writes per day 23

24 Copyright  2010 by CEBT Conclusions  Cassandra Scalability High Performance Wide Applicability  Future works Compression Atomicity Secondary Index 24


Download ppt "Cassandra – A Decentralized Structured Storage System A. Lakshaman 1, P.Malik 1 1 Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il."

Similar presentations


Ads by Google