Presentation on theme: "CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina."— Presentation transcript:
CS 347Lecture 11 CS 347: Parallel and Distributed Data Management Notes 01: Introduction Brian Cooper Based on Material by Hector Garcia-Molina
CS 347Lecture 12 In CS245: Centralized DB system Software: Application SQL Front End Query Processor Transaction Proc. File Access P M... Simplifications: single front end one place to keep locks if processor fails, system fails,...
CS 347Lecture 13 In CS347 Multiple processors ( + memories) Heterogeneity and autonomy of “components”
CS 347Lecture 14 Multiple processors Opportunity for parallelism Opportunity for reliability Synchronization issues To illustrate synchronization problems: Two Generals Problem
CS 347Lecture 15 The one general problem (Trivial!) Battlefield G Troops
CS 347Lecture 16 The two general problem: messengers Blue armyRed army Blue G Red G Enemy
CS 347Lecture 17 Blue and red army must attack at same time Blue and red generals synchronize through messengers Messengers can be lost Rules:
CS 347Lecture 18 How Many Messages Do We Need? BGRG attack at 9am assume blue starts... Is this enough??
CS 347Lecture 19 How Many Messages Do We Need? BGRG attack at 9am assume blue starts... Is this enough?? ack (red goes at 9am)
CS 347Lecture 110 How Many Messages Do We Need? BGRG attack at 9am assume blue starts... Is this enough?? ack (red goes at 9am) got ack
CS 347Lecture 111 Stated problem is Impossible! Theorem: There is no protocol that uses a finite number of messages that solves the two-generals problem (as stated here) Alternatives??
CS 347Lecture 112 Probabilistic Approach? Send as many messages as possible, hope one gets through... BGRG attack at 9am assume blue starts... attack at 9am
CS 347Lecture 113 Eventual Commit Eventually both sides attack... BGRG attack ASAP assume blue starts... on my way! retransmits
CS 347Lecture 114 Eventual Commit One message sent every time unit Probability of success one message is p What is probability that red commits by time t? BG RG attack ASAP on my way! retransmits
CS 347Lecture 115 Eventual Commit BG RG attack ASAP on my way! retransmits C(1) = p
CS 347Lecture 116 Eventual Commit BG RG attack ASAP on my way! retransmits C(1) = p C(2) = p + (1-p)p
CS 347Lecture 117 Eventual Commit BG RG attack ASAP on my way! retransmits C(1) = p C(2) = p + (1-p)p C(3) = p + (1-p)p + (1-p) 2 p C(4) = p + (1-p)p + (1-p) 2 p + (1-p) 3 p
CS 347Lecture 119 Eventual Commit BG RG attack ASAP on my way! retransmits How expensive is protocol? E = expected number of messages Homework: compute E (function of p)
CS 347Lecture 120 2-Phase Eventual Commit Eventually both sides attack... BGRG ready to attack? assume blue starts... yes, at your disposal attack ASAP ack retransmits phase 1 phase 2
CS 347Lecture 121 Commit Protocols Will study commit protocols like these...
CS 347Lecture 122 Heterogeneity Select new investments Application RDBMS Files Stock ticker tape Portfolio History of dividends, ratios,...
CS 347Lecture 123 Autonomy Example: unable to get statistics for query optimization Example: blue general may have mind of his (or her) own!
CS 347Lecture 124 So, in CS347 we study data management with multiple processors and possible autonomy, heterogeneity –Impact on: Data organization Query processing Access structures Concurrency control Recovery
CS 347Lecture 125 Renewed Interest in Distributed/Parallel Data Processing! –Massive web data, manage with many computers –How to crawl and search the web? –Peer-to-peer systems manage huge amounts of data –Data from many sources (e.g., comparison shopping): how to integrate? –Sensor Networks: data generated an many sensors/devices, need to analyze –Multi-player games (e.g., World of Warcraft): tons of distributed data
CS 347Lecture 126 It’s the Economy, Stupid! Example: Multi-player games Data state P P P P P P P P P P
CS 347Lecture 127 It’s the Economy, Stupid! Example: Multi-player games Data state P P P P P P P P P P
CS 347Lecture 128 Logistics LECTURES: Mondays and Wednesdays 12:50pm to 2:05pm, Gates B01 INSTRUCTOR: Brian Cooper; Email: firstname.lastname@example.org; Office Hours: Mondays, Wednesdays 11:30 to 12:30. TEACHING ASSISTANT: –Vasilis Verroios, Email: email@example.com Piazza forum (tentative): https://piazza.com/class#spring2015/cs347. SECRETARY: Marianne Siroker; Office: Gates Hall 435; Email: firstname.lastname@example.org; Phone: (650) 723-0872
CS 347Lecture 129 Logistics TEXTBOOK: No required textbook. You'll be expected to read several research papers. CLASS WEB PAGE: http://www.stanford.edu/class/cs347 Will contain homework assignments, course news, etc. Be sure to check it periodically. ASSIGNMENTS: about 5 homeworks GRADING: Homeworks: 20%, Midterm 30%, Final: 50%.
CS 347Lecture 130 Tentative Syllabus 2014 (Part I) DATETOPIC Monday March 30Introduction [N01] Wednesday April 1 Data Fragmentation [N02] Monday April 6Query processing [N03] Wednesday April 8Query processing & Optimization [N04] Monday April 13Concurrency Control, Failures [N05] Wednesday April 15Reliable Data Management [N06] Monday April 20Reliable Data Management [N06] Wednesday April 22Guest Lecture – Cloudera Impala Monday April 27Replicated Data Management [N07] Wednesday April 29Midterm
CS 347Lecture 131 Tentative Syllabus 2014 (Part II) DATETOPIC Monday May 4Partitions, Entity Resolution [N11] Wednesday May 6Peer to Peer Systems [N08] Monday May 11Map-Reduce & Pig [N09] Wednesday May 13Map-Reduce & Pig [N09] Monday May 18Other Open Source Systems [N09b] Wednesday May 20Other Open Source Systems [N09b] Wednesday May 27Distributed IR [N10] Monday June 1Time [N12] Wednesday June 3Publish/Subscribe Systems [N13] Monday June 88:30 am!!! FINAL EXAM
Interesting New Systems Storm (from Twitter) S4 (from Yahoo) Cassandra (key-value store) Hive (SQL over Hadoop) Pregel (graph execution) Kestrel (queues?) ZooKeeper (replicated data) Sparkl or Spark (Berkeley?) H-Base HyRacks (UC Irvine) CS 347Lecture 132 MemCache-D Pnuts (Yahoo) Dynamo (Amazon) Spanner (Google) Paxos G-Store (UC Santa Barbara) Elastras (UC Santa Barbara) Tao (Facebook) Kafka (LinkedIn) Impala (Cloudera)
CS 347Lecture 133 Concepts you should be familiar with: CS245: query plan, cost estimation, join algorithms, recovery, logging,… Interconnection networks (bus, mesh, hypercube,…) Computer networks (LAN, WAN,…)
CS 347Lecture 134 Introductory topics Database architectures Client-server systems Distributed vs. parallel DB systems Cloud Computing
CS 347Lecture 135 DB architectures (1) Shared memory PPP... M
CS 347Lecture 136 DB architectures (2) Shared disk... P M PP MM
CS 347Lecture 137 DB architectures (2B) Shared data storage (disk or file?)... P M PP MM storage area network (SAN) Hadoop/Google file system
CS 347Lecture 138 DB architectures (3) Shared nothing P M P M P M...
CS 347Lecture 139 DB architectures (4) Hybrid example M PPP... M PPP
CS 347Lecture 140 DB architectures (4) Hybrid example 2 P M P M P M P M... WAN LAN #1 RR LAN #2
CS 347Lecture 141 DB architectures (4) Hybrid Tandem-like also in: Microsoft SQLServer Parallel Data Warehouse P M P M P M P M...
CS 347Lecture 142 DB architectures (5) Unusual? Datacycle (Broadcast disks) PPP MMM Entire DB broadcast
CS 347Lecture 143 (5) Unusual Sorting network Sort net P M P M...
CS 347Lecture 144 (5) Unusual — processor per track or processor per disk M P P’P’ P’P’ P’P’... “small” processors + “tiny” memories Related idea in Oracle Exadata "DB machine"
CS 347Lecture 145 (6) Unusual — sensor networks P’P’ M M P B M P B M P B M P B M P B data collection node sensor battery
CS 347Lecture 146 Issues for selecting architecture Reliability Scalability Geographic distribution of data Data “clusters” Performance Cost
CS 347Lecture 147 Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access client server
CS 347Lecture 148 Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access client server
CS 347Lecture 149 Client-Server Systems (or how to partition software) Application Front End Query Processor Transaction Processing File Access client server
CS 347Lecture 150 Transaction Servers Clients ship transactions consisting of 1 or more SQL commands E.g., Open DataBase Connectivity (ODBC) (standard API)
CS 347Lecture 151 Data Servers Client requests pages or records Popular for OODB systems
CS 347Lecture 152 Issues Object granularity Where is data cached? Where is locking done?
CS 347Lecture 153 Basic Tradeoff Offloading work to clients Data transmitted CC SS Get pages Reserve hotel room
CS 347Lecture 154 Note: Similar issues arise when we partition software/functionality within server Reserve hotel room P M P M P M... Where is data cached? Where is locking done?
CS 347Lecture 155 Parallel or distributed DB system? More similarities than differences!
CS 347Lecture 156 Typically, parallel DBs: –Fast interconnect –Homogeneous software –High performance is goal –Transparency is goal
CS 347Lecture 157 Typically, distributed DBs: –Geographically distributed –Data sharing is goal (may run into heterogeneity, autonomy) –Disconnected operation possible
Cloud Computing Is CC just a marketing term?? –utility (like power) –data or CPU cycles? –many processors, many storage units –business model CS 347Lecture 158
Is CC a subset, superset, disjoint from, or overlaps with: grid computing distributed computing Web 2.0 Cluster Computing Peer-to-peer computing software as a service client-server computing data center as a computer massively parallel computing CS 347Lecture 1 59 (A) CC (B) CC (C) CC (D) CC
Clash of the Clouds (Economist April 4, 2009) CS 347Lecture 160