Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamo Recap Hadoop & Spark

Similar presentations


Presentation on theme: "Dynamo Recap Hadoop & Spark"— Presentation transcript:

1 Dynamo Recap Hadoop & Spark
6.830 Lecture 19 11/15/2017

2 Dynamo Recap Key ideas: Get/put KV-store w/ single key atomicity
All data replicated on N nodes Data stored in “ring” representing space of hash values from say, 02128 “Overlay network”: Nodes are not actually in a physical ring, but are just machines on the Internet Each node occupies one (or multiple) random locations on ring N=3 k A key hashed to location k is stored on N successors in ring

3 Joining the Ring Administrators explicitly add / remove nodes
When a node joins, it contacts a list of “seed nodes” Other nodes periodically “gossip” to learn about ring structure When a node i learns about new node j, i sends j any keys j is responsible for

4 Quorum Reads N replicas Write to R, read from W
Always try to read or write to N nodes, but only require R or W to succeed If R + W > N, then all reads will see at least one copy of most recent write Need some way to ensure that if fewer than N nodes written to, write eventually propagates

5 Sloppy Quorum & Hinted Handoff
If fewer than N writes succeed, continue around ring, past successors K=x Hint: Owner=E N=3 k 2 our of 3 writes succeed Continue around ring, write to B

6 Sloppy Quorum  Divergence
If network is partitioned, hinted handoff can lead to divergent replicas E.g., suppose N=3, W=2, R=2, Partitioned ✔(sloppy) ✔(sloppy) k Client 1

7 Sloppy Quorum  Divergence
If network is partitioned, hinted handoff can lead to divergent replicas E.g., suppose N=3, W=2, R=2, Partitioned (sloppy) ✔ Two different versions of key k, k1 and k2 now exist k Client 2

8 Vector Clocks k Each node keeps a monotonic version counter that increments for every write it masters Each data item has a clock, consisting of a list of the most recent version it includes from each master A B C D E F

9 Vector Clocks k Each node keeps a monotonic version counter that increments for every write it masters Each data item has a clock, consisting of a list of the most recent version it includes from each master A B C D E F 1 [C,1]

10 Vector Clocks k C1 Each node keeps a monotonic version counter that increments for every write it masters Each data item has a clock, consisting of a list of the most recent version it includes from each master A B C D E F 1 [C,1] 2 [C,2]

11 Vector Clocks k C2 Each node keeps a monotonic version counter that increments for every write it masters Each data item has a clock, consisting of a list of the most recent version it includes from each master A B C D E F 1 [C,1] 2 [C,2] 3 [C,1][D,1] Incomparable (can’t totally order)

12 Read Repair Possible for a client to read 2 incomparable versions
Need reconciliation; options: Latest writer wins Application specific reconciliation (e.g., shopping cart union) After reconciliation, perform write back, so replicas know about new state

13 Anti-entropy Once a partition heals, or a node recovers, need a way to patch up Could rely on gossip & hinted handoff But system also periodically compares with other nodes responsible for each key range Comparison done via hashing Here, for EA range, B and C are also responsible

14 MapReduce Word Count Example
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word – called once per word in all docs // values: a “1” for each occurrence of this work in all docs int result = length(values) Emit(AsString(result));

15 Spark Example //Find error lines lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() //count all errors errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split(’\t’)(3)) .collect()

16 Lineage Graph

17 Types of Dependencies

18 Scheduling Stages Not Cached RDD Cached RDD

19 Spark Performance


Download ppt "Dynamo Recap Hadoop & Spark"

Similar presentations


Ads by Google