Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19.

Similar presentations


Presentation on theme: "Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19."— Presentation transcript:

1 Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19

2 Outline Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010

3 Cassandra overview Highly scalable, distributed Eventually consistent Structured key-value store Dynamo + bigtable P2P Random reads and random writes Java

4 Data Model KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : TimeStamp : t1 Name : tid2 Value : TimeStamp : t2 Name : tid3 Value : TimeStamp : t3 Name : tid4 Value : TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront Columns are added and modified dynamically SuperColumns are added and modified dynamically Columns are added and modified dynamically

5 Cassandra Architecture

6 Cassandra API Data structures Exceptions Service API ConsistencyLevel(4) Retrieval methods(5) Range query: returns matching keys(1) Modification methods(3) Others

7 Cassandra commands

8 Partitioning and replication(1) Consistent hashing DHT Balance Monotonicity Spread Load Virtual nodes Coordinator Preference list

9 01 1/2 F E D C B A N=3 h(key2) h(key1) 9 Partitioning and replication(2)

10 Data Versioning Always writeable Mulitple versions – put() return before all replicas – get() many versions Vector clocks Reconciliation during reads by clients

11 Vector clock List of (node, counter) pairs E.g. [x,2][y,3] vs. [x,3][y,4][z,1] [x,1][y,3] vs. [z,1][y,3] Use timestamp E.g. D([x,1]:t1,[y,1]:t2) Remove the oldest version when reach a thresthold

12 Vector clock Return all the objects at the leaves D3,4([Sx,2],[Sy,1],[Sz,1]) Single new version

13 Excution operations Two strategies – A generic load balancer based on load balance Easy,not have to link any code specific – Directory to the node Achieve lower latency

14 Put() operation client coordinator PN-1 P2 P1 w-1 responses Object with vector clock

15 Cluster Membership Gossip protocol State disseminated in O(logN) rounds Increase its heartbeat counter and send its list to another every T seconds Merge operations

16

17

18

19 Failure Data center(s) failure – Multiple data centers Temporary failure Permanent failure – Merkle tree

20 Temporary failure

21 Merkle tree

22 Boolom filter a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive

23 Compactions K1 K2 K3 -- Sorted K2 K10 K30 -- Sorted K4 K5 K10 -- Sorted MERGE SORT K1 K2 K3 K4 K5 K10 K30 Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D

24 Write Key (CF1, CF2, CF3) Commit Log Binary serialized Key ( CF1, CF2, CF3 ) Memtable ( CF1) Memtable ( CF2) Data size Number of Objects Lifetime Dedicated Disk --- BLOCK Index Offset, Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk

25 Read Query Closest replica Cassandra Cluster Replica A Result Replica BReplica C Digest Query Digest Response Result Client Read repair if digests differ

26 Outline Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010

27 Sigmod contest 2009 Task overview API Data structure Architecture Test

28 Task overview Index system for main memory data Running on multi-core machine Many threads with multiple indices Serialize execution of user-specified transactions Basic function exact match queries,range queries, updates inserts, deletes

29 API

30 Record

31 HashTable

32 HashShared

33 TxnState

34 IdxState Keep track of an index Created openIndex() Destroyed closeIndex() Inherited by IdxStateType Contains pointers pointing to – a hashtable – a FixedAllocator – a Allocator – a array with the type of action

35 Architecture

36 IndexManager

37 DeadLockDetector

38 Transactor a HashOnlyGet object with type TxnState

39 Allocator Allocate the memory for the payloads Use pools and linked list Pool sized --the max length of payload is 100 The payloads with the same payload are in the same list

40 Unit Tests three threads, run over three indices the primary thread – create the primary index – inserts, deletes and accesses data in the primary index the second thread – simultaneously runs some basic tests over a separate index the third thread – ensure the transactional guarantees – Continuously queries the primary index

41 Outline Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010

42 Task overview Implement a simple distributed query executor with the help of the in-memory index Given centralized query plans, translate them into distributed query plans Given a parsed SQL query, return the right results Data stored on disk, the indexes are all in memory Measure the total time costs

43 SQL query form SELECT alias_name.field_name,... FROM table_name AS alias_name,… WHERE condition1 AND... AND conditionN Condition alias_name.field_name = fixed value alias_name.field_name > fixed value alias_name.field_name1 =alias_name.field_name2

44 Initialization phase

45 Connection phase

46 Query phase

47 Closing phase

48 Tests An initial computation On synthetic and real-world datasets Tested on a single machine Tested on an ad-hoc cluster of peers Passed a collection of unit tests, provided with an Amazon Web Services account of a 100 USD value

49 Benchmarks(stag1) Assume a partition always cover the entire table, the data is not replicated. Unit-tests Benchmarks – On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on two different nodes

50 Benchmarks(stag2) Tables are now stored on multiple nodes Part of a table, or the whole table may be replicated on multiple nodes Queries will be sent in parallel up to 50 simultaneous connections Benchmarks – Selects with an equal condition on the primary key, the values being uniformly distributed – Selects with an equal condition on the primary key, the values being non- uniformly distributed – Multiple joins on tables separated on different nodes

51 Important Dates

52 Thank you!!!


Download ppt "Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19."

Similar presentations


Ads by Google