Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina.

Similar presentations


Presentation on theme: "CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina."— Presentation transcript:

1 CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina

2 CS 347Lecture 12 In CS245: Centralized DB system Software: Application SQL Front End Query Processor Transaction Proc. File Access P M... Simplifications: single front end one place to keep locks if processor fails, system fails,...

3 CS 347Lecture 13 In CS347 Multiple processors ( + memories) Heterogeneity and autonomy of “components”

4 CS 347Lecture 14 Multiple processors Opportunity for parallelism Opportunity for reliability Synchronization issues  To illustrate synchronization problems: Two Generals Problem

5 CS 347Lecture 15 The one general problem (Trivial!)  Battlefield G Troops

6 CS 347Lecture 16 The two general problem: messengers Blue armyRed army Blue G Red G Enemy

7 CS 347Lecture 17 Blue and red army must attack at same time Blue and red generals synchronize through messengers Messengers can be lost Rules:

8 CS 347Lecture 18 How Many Messages Do We Need? BGRG attack at 9am assume blue starts... Is this enough??

9 CS 347Lecture 19 How Many Messages Do We Need? BGRG attack at 9am assume blue starts... Is this enough?? ack (red goes at 9am)

10 CS 347Lecture 110 How Many Messages Do We Need? BGRG attack at 9am assume blue starts... Is this enough?? ack (red goes at 9am) got ack

11 CS 347Lecture 111 Stated problem is Impossible! Theorem: There is no protocol that uses a finite number of messages that solves the two-generals problem (as stated here) Alternatives??

12 CS 347Lecture 112 Probabilistic Approach? Send as many messages as possible, hope one gets through... BGRG attack at 9am assume blue starts... attack at 9am

13 CS 347Lecture 113 Eventual Commit Eventually both sides attack... BGRG attack ASAP assume blue starts... on my way! retransmits

14 CS 347Lecture 114 Eventual Commit One message sent every time unit Probability of success one message is p What is probability that red commits by time t? BG RG attack ASAP on my way! retransmits

15 CS 347Lecture 115 Eventual Commit BG RG attack ASAP on my way! retransmits C(1) = p

16 CS 347Lecture 116 Eventual Commit BG RG attack ASAP on my way! retransmits C(1) = p C(2) = p + (1-p)p

17 CS 347Lecture 117 Eventual Commit BG RG attack ASAP on my way! retransmits C(1) = p C(2) = p + (1-p)p C(3) = p + (1-p)p + (1-p) 2 p C(4) = p + (1-p)p + (1-p) 2 p + (1-p) 3 p

18 Eventual Commit CS 347Lecture 118 C(t) t p

19 CS 347Lecture 119 Eventual Commit BG RG attack ASAP on my way! retransmits How expensive is protocol? E = expected number of messages Homework: compute E (function of p)

20 CS 347Lecture 120 2-Phase Eventual Commit Eventually both sides attack... BGRG ready to attack? assume blue starts... yes, at your disposal attack ASAP ack retransmits phase 1 phase 2

21 CS 347Lecture 121 Commit Protocols Will study commit protocols like these...

22 CS 347Lecture 122 Heterogeneity Select new investments Application RDBMS Files Stock ticker tape Portfolio History of dividends, ratios,...

23 CS 347Lecture 123 Autonomy Example: unable to get statistics for query optimization Example: blue general may have mind of his (or her) own!

24 CS 347Lecture 124 So, in CS347 we study data management with multiple processors and possible autonomy, heterogeneity –Impact on: Data organization Query processing Access structures Concurrency control Recovery

25 CS 347Lecture 125 Renewed Interest in Distributed/Parallel Data Processing! –Massive web data, manage with many computers –How to crawl and search the web? –Peer-to-peer systems manage huge amounts of data –Data from many sources (e.g., comparison shopping): how to integrate? –Sensor Networks: data generated an many sensors/devices, need to analyze –Multi-player games (e.g., World of Warcraft): tons of distributed data

26 CS 347Lecture 126 It’s the Economy, Stupid! Example: Multi-player games Data state P P P P P P P P P P

27 CS 347Lecture 127 It’s the Economy, Stupid! Example: Multi-player games Data state P P P P P P P P P P

28 CS 347Lecture 128 Logistics LECTURES: Mondays and Wednesdays 12:50pm to 2:05pm, Gates B01 INSTRUCTOR: Brian Cooper; Email: brianfrankcooper@gmail.com; Office Hours: Mondays, Wednesdays 11:30 to 12:30. TEACHING ASSISTANT: –Vasilis Verroios, Email: verroios@stanford.edu Piazza forum (tentative): https://piazza.com/class#spring2015/cs347. SECRETARY: Marianne Siroker; Office: Gates Hall 435; Email: siroker@cs.stanford.edu; Phone: (650) 723-0872

29 CS 347Lecture 129 Logistics TEXTBOOK: No required textbook. You'll be expected to read several research papers. CLASS WEB PAGE: http://www.stanford.edu/class/cs347 Will contain homework assignments, course news, etc. Be sure to check it periodically. ASSIGNMENTS: about 5 homeworks GRADING: Homeworks: 20%, Midterm 30%, Final: 50%.

30 CS 347Lecture 130 Tentative Syllabus 2014 (Part I) DATETOPIC Monday March 30Introduction [N01] Wednesday April 1 Data Fragmentation [N02] Monday April 6Query processing [N03] Wednesday April 8Query processing & Optimization [N04] Monday April 13Concurrency Control, Failures [N05] Wednesday April 15Reliable Data Management [N06] Monday April 20Reliable Data Management [N06] Wednesday April 22Guest Lecture – Cloudera Impala Monday April 27Replicated Data Management [N07] Wednesday April 29Midterm

31 CS 347Lecture 131 Tentative Syllabus 2014 (Part II) DATETOPIC Monday May 4Partitions, Entity Resolution [N11] Wednesday May 6Peer to Peer Systems [N08] Monday May 11Map-Reduce & Pig [N09] Wednesday May 13Map-Reduce & Pig [N09] Monday May 18Other Open Source Systems [N09b] Wednesday May 20Other Open Source Systems [N09b] Wednesday May 27Distributed IR [N10] Monday June 1Time [N12] Wednesday June 3Publish/Subscribe Systems [N13] Monday June 88:30 am!!! FINAL EXAM

32 Interesting New Systems Storm (from Twitter) S4 (from Yahoo) Cassandra (key-value store) Hive (SQL over Hadoop) Pregel (graph execution) Kestrel (queues?) ZooKeeper (replicated data) Sparkl or Spark (Berkeley?) H-Base HyRacks (UC Irvine) CS 347Lecture 132 MemCache-D Pnuts (Yahoo) Dynamo (Amazon) Spanner (Google) Paxos G-Store (UC Santa Barbara) Elastras (UC Santa Barbara) Tao (Facebook) Kafka (LinkedIn) Impala (Cloudera)

33 CS 347Lecture 133 Concepts you should be familiar with: CS245: query plan, cost estimation, join algorithms, recovery, logging,… Interconnection networks (bus, mesh, hypercube,…) Computer networks (LAN, WAN,…)

34 CS 347Lecture 134 Introductory topics Database architectures Client-server systems Distributed vs. parallel DB systems Cloud Computing

35 CS 347Lecture 135 DB architectures (1) Shared memory PPP... M

36 CS 347Lecture 136 DB architectures (2) Shared disk... P M PP MM

37 CS 347Lecture 137 DB architectures (2B) Shared data storage (disk or file?)... P M PP MM storage area network (SAN) Hadoop/Google file system

38 CS 347Lecture 138 DB architectures (3) Shared nothing P M P M P M...

39 CS 347Lecture 139 DB architectures (4) Hybrid example M PPP... M PPP

40 CS 347Lecture 140 DB architectures (4) Hybrid example 2 P M P M P M P M... WAN LAN #1 RR LAN #2

41 CS 347Lecture 141 DB architectures (4) Hybrid Tandem-like also in: Microsoft SQLServer Parallel Data Warehouse P M P M P M P M...

42 CS 347Lecture 142 DB architectures (5) Unusual? Datacycle (Broadcast disks) PPP MMM Entire DB broadcast

43 CS 347Lecture 143 (5) Unusual Sorting network Sort net P M P M...

44 CS 347Lecture 144 (5) Unusual — processor per track or processor per disk M P P’P’ P’P’ P’P’... “small” processors + “tiny” memories Related idea in Oracle Exadata "DB machine"

45 CS 347Lecture 145 (6) Unusual — sensor networks P’P’ M M P B M P B M P B M P B M P B data collection node sensor battery

46 CS 347Lecture 146 Issues for selecting architecture Reliability Scalability Geographic distribution of data Data “clusters” Performance Cost

47 CS 347Lecture 147 Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access client server

48 CS 347Lecture 148 Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access client server

49 CS 347Lecture 149 Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access client server

50 CS 347Lecture 150 Transaction Servers Clients ship transactions consisting of 1 or more SQL commands E.g., Open DataBase Connectivity (ODBC) (standard API)

51 CS 347Lecture 151 Data Servers Client requests pages or records Popular for OODB systems

52 CS 347Lecture 152 Issues Object granularity Where is data cached? Where is locking done?

53 CS 347Lecture 153 Basic Tradeoff Offloading work to clients Data transmitted CC SS Get pages Reserve hotel room

54 CS 347Lecture 154 Note: Similar issues arise when we partition software/functionality within server Reserve hotel room P M P M P M... Where is data cached? Where is locking done?

55 CS 347Lecture 155 Parallel or distributed DB system? More similarities than differences!

56 CS 347Lecture 156 Typically, parallel DBs: –Fast interconnect –Homogeneous software –High performance is goal –Transparency is goal

57 CS 347Lecture 157 Typically, distributed DBs: –Geographically distributed –Data sharing is goal (may run into heterogeneity, autonomy) –Disconnected operation possible

58 Cloud Computing Is CC just a marketing term?? –utility (like power) –data or CPU cycles? –many processors, many storage units –business model CS 347Lecture 158

59 Is CC a subset, superset, disjoint from, or overlaps with: grid computing distributed computing Web 2.0 Cluster Computing Peer-to-peer computing software as a service client-server computing data center as a computer massively parallel computing CS 347Lecture 1 59 (A) CC (B) CC (C) CC (D) CC

60 Clash of the Clouds (Economist April 4, 2009) CS 347Lecture 160

61 Silver lining (cloud price war) (Economist, August 30, 2014) CS 347Lecture 161

62 CC Issues Customer lock-in Privacy Standards Software licensing CS 347Lecture 162

63 CS 347Lecture 163 Next How to describe distributed data Query processing in parallel DBs Query processing in distributed DBs

64 CS 347Lecture 164 Query processing in parallel DBs: Typically: we can distribute/partition/ sort…. data to make certain DB operations (e.g., Join) fast

65 CS 347Lecture 165 Query processing in distributed DBs: Typically: we are given data distribution; we need to find query processing strategy to minimize cost (e.g., communication cost)


Download ppt "CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina."

Similar presentations


Ads by Google