Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction & Data Modeling

Similar presentations


Presentation on theme: "Introduction & Data Modeling"— Presentation transcript:

1 Introduction & Data Modeling
Cassandra Training Introduction & Data Modeling

2 Aims By the end of today you should know: How Cassandra organises data
How to configure replicas How to choose between consistency and availability How to efficiently model data for both reads and writes You need to consider Active-Active scenarios Who to ask to help you & sign off on your data model HINT: Ask Neil directly or Introduction to Cassandra

3 Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra

4 Agenda - Introduction Elevator Pitch
Brewer’s Theorem & Tuneable Consistency Distributed Hash Table 101 Write path Read path TTL, Deletion & Tombstones Background Processes Data Model in 5mins Thrift vs CQL Maintaining Consistency Scaling Cassandra Introduction to Cassandra

5 Agenda – Advanced Topics
Data Modelling Key Concepts Time Series Modelling Wide rows Compound Keys Code example Performance Tuning Levers What is DataStax Enterprise? Multi DC Support Virtual Nodes Nodetool Introduction to Cassandra

6 What? Elevator Pitch Write path optimised Eventually consistent (ms)
Distributed Hash Table Highly durable Tunable consistency Introduction to Cassandra

7 let me choose my tradeoff
Elevator Pitch Why? Linear horizontal read & write scaling Data is important and should always be there Often times we don’t need consistency guarantee let me choose my tradeoff Introduction to Cassandra

8 How? Elevator Pitch Data partitioned internally across nodes
Writes must just hit the commit log Store data read-optimised to minimise read & write work: no indexes to update, no query to plan Specify agreement (consistency) per query Introduction to Cassandra

9 Not a silver bullet - easy to design a poorly-performing data model
Elevator Pitch What it’s Not No support for transactions - atomicity, isolation mostly not available Not a silver bullet - easy to design a poorly-performing data model Introduction to Cassandra

10 DHT 101 Each physical node is assigned a token
Nodes own the range from the previous token Introduction to Cassandra

11 Cassandra Write Path The coordinator will send the update to two nodes, starting at the owning node and working clockwise Introduction to Cassandra

12 Cassandra Write Path 128-bit hash used to compute partition key
Keys are therefore distributed randomly around the ring If Unavailable - Hinted Handoff Introduction to Cassandra

13 Random Partitioner – key -> token
Cassandra Write Path Concepts The Snitch – proximity Random Partitioner – key -> token Replication Factor – how many replicas Gossip – discovery protocol Introduction to Cassandra

14 Cassandra Write Path SSTables are sequential and immutable
Data may reside across SSTables SSTables are periodically compacted together Introduction to Cassandra

15 Cassandra Read Path Data read command sent to closest replica - snitch
Digest commands sent to other replicas – CL Read Repair Chance 10% - digest all replicas Introduction to Cassandra

16 Start & Interrogate C* vagrant box add dse.box mkdir ~/vagrant curl > ~/vagrant/dse.tar.gz cd ~/vagrant && tar xzvf dse.tar.gz cd dse && vagrant up vagrant ssh node1 nodetool ring Introduction to Cassandra

17 Find Candidate SSTables - Bloom Filters Seek Through SSTables
Cassandra Read Path Read Mechanics Find Candidate SSTables - Bloom Filters Seek Through SSTables Memory Mapped Files Check Memtable -> minimise sstables for best efficiency Introduction to Cassandra

18 Deleted data marked as removed – tombstone
Deletion & Tombstones Deleted data marked as removed – tombstone Stops zombie data – distributed system Tombstones collected after a few days – configurable Introduction to Cassandra

19 Distributed Data – only 2 at a time – Consistency Availability
Brewer’s Theorem Distributed Data – only 2 at a time – Consistency Availability Partition Tolerance Introduction to Cassandra

20 Brewer’s Theorem CA - normal operation, no partition, consistency and availability provided Introduction to Cassandra

21 Brewer’s Theorem AP - partition occurs, maintaining two mutable, disconnected state copies breaks consistency, availability is conserved Introduction to Cassandra

22 Brewer’s Theorem CP - partition occurs, to maintain consistency we need to take one side offline, sacrificing availability Introduction to Cassandra

23 Cassandra Consistency Level
Tuneable Consistency Cassandra Consistency Level Specify node number to agree on read/write Choose consistency or availability: CL.LOCAL_QUORUM, CL.ONE Eventual consistency will bring both sides into agreement eventually Introduction to Cassandra

24 SSTables Compacted Periodically Size-Tiered Compaction
Background Processes SSTables Compacted Periodically Size-Tiered Compaction – default, no compaction guarantee Leveled-Compaction – better chance of tombstone compaction – more continual compaction, 2x I/O – impact on online – use for update-heavy workloads – creates many SSTables Introduction to Cassandra

25 Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra

26 Keyspace Data Model Analogous to Database/Schema
Segregate Applications Replication configured at this level Introduction to Cassandra

27 Caches configurable at this level
Data Model Column Family Analogous to Table Contains many rows Caches configurable at this level Introduction to Cassandra

28 Row Data Model Each one has a partition key - hash
Has many columns – up to 2Bn Columns don’t have to be defined ahead of time Rows in the same CF can have different columns No sorting by rows, model ordering in rows Introduction to Cassandra

29 Columns Data Model Sorted by name before being written to SSTable
Name and Value are typed Values can be type-validated Column update is timestamped Can have TTL Introduction to Cassandra

30 Counter Columns Data Model Distributed counters Can get false counts
Introduction to Cassandra

31 Super Columns – Don’t Use
Data Model Super Columns – Don’t Use Blob of columns stored inside a single column Have to read and write whole blob Memory intensive Conflicts resolved for whole blob - bad Introduction to Cassandra

32 Can define an index on a column
Secondary Indices Can define an index on a column Cassandra will maintain an inverted index Use sparingly Low Cardinality Columns Only Often times better to maintain own view Introduction to Cassandra

33 Thrift CQL Thrift vs CQL Original interface, hash style syntax
SQL-like syntax but highly limited Sent over Thrift but plans for own protocol Introduction to Cassandra

34 Maintaining Consistency
Consistency Level Used on read & write operations ONE, TWO, LOCAL_QUORUM, ALL, ANY Do you really need consistency guarantee? Introduction to Cassandra

35 Imagine RF=3, Quorum, Nodes=6 Each query impacts 2 nodes sync
Scaling Cassandra Imagine RF=3, Quorum, Nodes=6 Each query impacts 2 nodes sync Each write will touch all 3 nodes, though async To scale writes add more nodes To scale reads, add more replicas Introduction to Cassandra

36 Solr 4 & Hadoop Integration
Advanced Topics Advanced Topics Data Modelling Wide Rows & Clustering Performance Solr 4 & Hadoop Integration Introduction to Cassandra

37 Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra

38 Data Modelling Data Modelling Concepts that Drive Data Modeling
Time-series Modeling Wide Rows (Composite Columns) Compound Keys & CQL3 Introduction to Cassandra

39 Data Modelling - Concepts
Rows in same CF will live on different nodes High cost of multi-get De-normalise your data into rows Don’t Put Consistent Load on Single Row Will heat up replica nodes Introduction to Cassandra

40 Data Modelling - Concepts
Writes to Single Row Atomic & Isolated Columns are Ordered Column Range Slicing Efficient Mutating data often needs compaction tuning Introduction to Cassandra

41 Efficient Reads Wide Rows Store how you want to fetch
Fetch most efficient over few rows Store what you want to fetch in few rows Introduction to Cassandra

42 Use Timestamp for Column Name – ordered Range slicing efficient
Time Series Use Timestamp for Column Name – ordered Range slicing efficient Can limit row length by using date partition key e.g Introduction to Cassandra

43 Composite Column Composite Columns
e.g. time1:log_class, time1:log_message, time2:log_class, time2:log_message Introduction to Cassandra

44 Writing to a Single Row Hotspots Use Round Robin Over Rows
Time Series Writing to a Single Row Hotspots Use Round Robin Over Rows e.g :1, :2, etc… Introduction to Cassandra

45 Compound Key in CQL3 Compound Keys Partition Key is the row key
Compound Key = Partition Key + Composite Key e.g. partition key = , composite key = time1 => time1:name, time1:msg, time2:name, time2:msg Introduction to Cassandra

46 Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra

47 Working with CQL cqlsh -3 192.168.33.21 CREATE KEYSPACE my_app_data
WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 2; DESCRIBE KEYSPACE my_app_data; Introduction to Cassandra

48 Compound Keys USE my_app_data; CREATE COLUMNFAMILY logs (
day text, -- partition key log_id timeuuid, -- clustering column log_class text, log_message text, primary key (day, log_id) ); DESCRIBE columnfamilies; Introduction to Cassandra

49 Compound Keys INSERT INTO logs (day,log_id,log_class,log_message)
VALUES (‘ ’, ‘ :05:00’, ‘error’, ‘it broke’) USING CONSISTENCY ONE; VALUES (‘ ’, ‘ :05:00’, ‘error’, ‘it broke again’) USING CONSISTENCY QUORUM; Introduction to Cassandra

50 Compound Keys SELECT * FROM logs USING CONSISTENCY ONE WHERE day=‘ ’; SELECT * FROM logs USING CONSISTENCY QUORUM WHERE day=‘ ’ AND log_id > ‘ :00:00’; TRY WITH CL.TWO: vagrant suspend node2 Setting CL and range querying columns, losing consistency Introduction to Cassandra

51 See the raw Cassandra data
Compound Keys cassandra-cli -h use my_app_data; list logs; See the raw Cassandra data Introduction to Cassandra

52 Hector Code Example - Clients Solid Java Client In Use in Production
Round Robin Node Discovery Introduction to Cassandra

53 Netflix Open Source Library
Code Example - Clients Astyanax Netflix Open Source Library Simpler APIs Introduction to Cassandra

54 Example: Storing Payment Methods
Code Example Example: Storing Payment Methods Introduction to Cassandra

55 Store 1-10 payment methods
Code Example Requirements Store 1-10 payment methods Use a single row Introduction to Cassandra

56 Define a composite column class
Code Example Non-CQL Define a composite column class public static final class Composite { = 0) String paymentUuid; = 1) String field; Introduction to Cassandra

57 Writing Data Code Example
UUID paymentUUID = TimeUUIDUtils.getUniqueTimeUUIDinMillis(); String sPaymentUUID = paymentUUID.toString(); batch.withRow(PAYMENTS_CF, userId) .putColumn(new Composite(sPaymentUUID, "pvtoken"), paymentInfo.pvToken, null) .putColumn(new Composite(sPaymentUUID, "name"), paymentInfo.name, null) .putColumn(new Composite(sPaymentUUID, "number"), paymentInfo.number, null) Introduction to Cassandra

58 Need some logic to handle record boundaries
Code Example Reading Data Need some logic to handle record boundaries //handle the payment info boundary if (lastSeen != null && !column.getName().getPaymentUuid().equals(lastSeen)) { payments.add(payment); payment = new PaymentInfo(); payment.paymentUUID = UUID.fromString(column.getName().paymentUuid); } lastSeen = column.getName().getPaymentUuid(); Introduction to Cassandra

59 Code Example A Bit Messy Introduction to Cassandra

60 Cassandra needs it to split up the row for us
Code Example CQL3 Need to define a Schema Cassandra needs it to split up the row for us Introduction to Cassandra

61 Schema Code Example create table paymentinfo_cql ( user text,
paymentid timeuuid, name text, number text, pvtoken text, primary key (user,paymentid) ); Introduction to Cassandra

62 Inserting Data Code Example insert into paymentinfo_cql (
user, paymentid, name, number, pvtoken ) values ( '%1$s','%2$s','%3$s','%4$s','%5$s’ ) Introduction to Cassandra

63 Reading Data Code Example select * from paymentinfo_cql where user='%s
Introduction to Cassandra

64 Multi Datacentre Support
Cassandra RF=2 (availability), Solr RF=1 (offline search) RFs set per Column Family and per logical datacentre Introduction to Cassandra

65 Multi Datacentre Support
Both DCs participate in same ring Cassandra walks clockwise as normal to fulfill RFs Introduction to Cassandra

66 Performance Tuning Levers
Memory Mapped Files SSTables memory mapped Visible as high virtual memory consumption Read fastest when working set fits in free RAM Introduction to Cassandra

67 Performance Tuning Levers
Row Cache Saves locating SSTables, seeking, reconciliation Off-heap – IPC marshaling penalty Whole row in memory Good for small numbers of hot rows – Gaussian dist. Introduction to Cassandra

68 Performance Tuning Levers
Key Cache Saves seeking through SSTables Beneficial for large SSTables - tiered compaction On-heap Introduction to Cassandra

69 Performance Tuning Levers
Cache hit-rates exposed over JMX Introduction to Cassandra

70 Performance Tuning Levers
Take care using memory that might be stolen from the read path (VirtMem) Introduction to Cassandra

71 Solr 4.0 Integration DataStax Enterprise Near-realtime indexing
Columns are available to Solr to index Indexes maintained in original file format Supports distributed search Use Cassandra API or Solr API Introduction to Cassandra

72 Hadoop Integration DataStax Enterprise
DataStax impements the HDFS on Cassandra – CFS Use H* or C* API No ETL Map operations are sent to replicas Reduce back to the task owner Introduction to Cassandra

73 Problem #1: Adding New Nodes
Virtual Nodes Problem #1: Adding New Nodes Introduction to Cassandra

74 Minimise streaming caused by moves
Virtual Nodes Wish to add node Ring already loaded Minimise streaming caused by moves Could put it in between 2 existing nodes – only helps a small range (this sucks) Introduction to Cassandra

75 Don’t want to have to buy 2 x servers each time (also sucks)
Virtual Nodes Double size of ring Minimise streaming caused by moves Don’t want to have to buy 2 x servers each time (also sucks) Introduction to Cassandra

76 Choose to rebalance the ring
Virtual Nodes Choose to rebalance the ring Load already warranted expansion Now adding streaming load Introduction to Cassandra

77 Problem #2: Replacing Failed Nodes
Virtual Nodes Problem #2: Replacing Failed Nodes Introduction to Cassandra

78 Remaining replica heats up
Virtual Nodes Node fails Remaining replica heats up Introduction to Cassandra

79 Now node 20 starts streaming => FIRE!
Virtual Nodes Bootstrap another Now node 20 starts streaming => FIRE! Introduction to Cassandra

80 Virtual Nodes The Solution Introduction to Cassandra

81 Slice each node into 256 token ranges
Virtual Nodes Slice each node into 256 token ranges Introduction to Cassandra

82 Randomly distribute tokens to other nodes
Virtual Nodes Randomly distribute tokens to other nodes Introduction to Cassandra

83 Each colour represents a node
Virtual Nodes Each colour represents a node Each node owns an even, random distribution of the ring Introduction to Cassandra

84 Can stream from every node
Virtual Nodes Replacing a node Can stream from every node Introduction to Cassandra

85 Do stuff with your deployment watch “nodetool ring”
Nodetool & Opscenter Do stuff with your deployment watch “nodetool ring” Useful overview of the ring – tokens, health Opscenter Introduction to Cassandra

86 Aims By the end of today you should know: How Cassandra organises data
How to configure replicas How to choose between consistency and availability How to efficiently model data for both reads and writes You need to consider Active-Active scenarios Who to ask to help you & sign off on your data model HINT: Ask Neil directly or Introduction to Cassandra

87 Questions Code Example
htraining.s3.amazonaws.com/cassandra-training.pptx Introduction to Cassandra


Download ppt "Introduction & Data Modeling"

Similar presentations


Ads by Google