6/25/20151 PSU’s CS 587 22. Parallel, Distributed Access  Parallel vs Distributed DBMSs  Parallel DBMSs  Measuring success: scalability  Hardware Architectures.

6/25/20151 PSU’s CS 587 22. Parallel, Distributed Access  Parallel vs Distributed DBMSs  Parallel DBMSs  Measuring success: scalability  Hardware Architectures  Types of parallelism, partitioning  Parallelizing operators: scan, select, sort, aggregate, join  Distributed DBMSs  Classifying distributed DBMSs Hetero- & homo-geneous, client- & collaborating server, horizontal & vertical fragmentation  Distributed catalog management  Distributed query processing General queries, joins, optimization  Replication Synchronous: voting, read-any write-all Asynchronous: Peer to Peer, Primary Site (Capture and Apply)  Distributed Locking, Deadlock Detection, Transactions

6/25/20152 PSU’s CS 587 Parallel vs. Distributed DBMSs  DBMS may be spread among the nodes of a network  Parallel DBMS  Nodes are under a central control  Typically nodes are in one location  Motivated by performance  Distributed DBMS  Nodes are autonomous, may run different DBMSs  Nodes are physically distributed  Motivated by need for remote access to data in spite of network failures

6/25/20153 PSU’s CS 587 Motivation for Parallel DBMSs: What if one CPU is not enough?  For a large web application, one CPU is never enough  A large web application may need to scan a 1TB file or service 100s of customers at one time.  At 50MB/second, scanning a 1TB file takes 5 hours  Using 1000 CPUs scanning 1000 disks, the scan will take 20 seconds  Similarly, 100s of customers accessing a database can result in response times in the minutes  Solution: Parallel DBMSs

6/25/20154 PSU’s CS 587 DBMS: The || Success Story  DBMSs, and Information Retrieval, are the most (only?) successful application of parallelism.  Every major DBMS vendor has some || server  Large DBMS backends, for web applications, are all parallelized.  Large IR applications are parallelized  Reasons for DBMS success:  Relational Algebra has few operators Just parallelize each one and compose them  Bulk-processing (= partition ||-ism).  Natural pipelining in operator trees.

6/25/20155 PSU’s CS 587 How do we measure the success of parallel algorithms and architectures?  Scalability  There are two kinds of scalability  Speedup : When the number of nodes grows by a factor n, then so does the efficiency  Scaleup : When the number of nodes and the size of the database both grow by n, then the efficiency does not change  Efficiency can be measured in terms of  Elapsed time, or average time per transaction  Number of transactions per second ( throughput )

6/25/20156 PSU’s CS 587 Architecture Issue: Shared What? Easy to program. Expensive to build. Difficult to scale Beyond ~30 procs Hard to program. Cheap to build. Easy to scale To 1000s of procs SUN, many others Clusters NCR Teradata, IBM SP

6/25/20157 PSU’s CS 587 How would you parallelize this plan?  Assume Sailors, Reservations are already sorted on sid SailorsReservations  rank<5 ⋈ sid On-the- fly Sort-merge join

6/25/20158 PSU’s CS 587 Different Types of DBMS ||-ism  Pipeline parallelism: Inter-operator  each operator may run concurrently on a different node (may exploit pipelining or bushy plans)  Partition parallelism: Intra-operator  multiple nodes work to compute a given operation (scan, sort, join)  We’ll focus on intra-operator ||-ism  Challenge: How to partition the data!

6/25/20159 PSU’s CS 587 Three Types of Data Partitioning RangeHashRound Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits more from "good" partitioning A...E F...J K...NO...ST...Z A...E F...JK...NO...S T...Z A...EF...J K...N O...S T...Z Good for equijoins, range queries group-by Good for equijoins Group-by Good to spread load

6/25/201510 PSU’s CS 587 How to Parallelize the Scan Operator  Given a partition, what is the parallel version of the scan operator?  Scan each partition, then merge  What kind of partition will make the scan scalable?  Equal size

6/25/201511 PSU’s CS 587 Parallelizing Selection  What is the parallel version of the selection operator?  Range partition, range selection  Hash partition, range selection  Hash partition, equality selection  Round Robin partition  Which of the above is scalable for speedup?  Range partition, range selection  Hash partition, range selection  Hash partition, equality selection  Round Robin partition

6/25/201512 PSU’s CS 587 Parallel Sorting: Two Algorithms  Distributive sort : Range partition the data, then sort each range  Is it scalable? Only if partitions are equal sized  How to choose partitioning vector?  Parallel External Sort-Merge : Sort the data at each node, then merge  Problem: how to merge in parallel  One solution: range partition the data  Is it scalable?

6/25/201513 PSU’s CS 587 Parallel Sorting, ctd.  How to choose partitioning vector? Sampling  What if sampling is not possible?  For example, if input is another process, such as another operator in a query plan  In this case can use external sort-merge and set up a tree of merging steps  See US Patent #5,852,826

6/25/201514 PSU’s CS 587 Parallel Aggregates  Problem: Compute an aggregate function in parallel  For each aggregate function, need a decomposition:  sum (S) = sum( sum (s(i)) )  count (S) = sum  count (s(i)) )  avg (S) = (  sum (s(i))) /  count (s(i))  Is it scalable?  Sometimes it’s not so simple: median ( )  Similar for groups

6/25/201515 PSU’s CS 587 Parallel Hash Join  Recall Phase 1 of Hash Join  Do Phase 1 in parallel. For R, then S:  Perform partitioning at each node  Send hash bucket i’s output to node i, j to j, etc.  Phase 2 is simpler  Perform join of Ri and Si at each node i Original Relations (R then S) OUTPUT 2 B main memory buffers Disk INPUT 1 hash function h B-1 Partitions 1 2 B-1... Phase 1

6/25/201516 PSU’s CS 587 Dataflow Network for || Join: A ⋈ B  Good use of split/merge makes it easier to build parallel versions of sequential join code. Node I Node J

6/25/201517 PSU’s CS 587 Complex Parallel Query Plans  Complex Queries: Inter-Operator parallelism  Pipelining between operators: note that sort and phase 1 of hash-join block the pipeline!!  Bushy Trees A BRS Sites 1-4Sites 5-8 Sites 1-8

6/25/201518 PSU’s CS 587 Distributed Databases  Data is stored at several sites, each managed by a DBMS that can run independently.  Distributed Data Independence: Users should not have to know where data is located (extends Physical and Logical Data Independence principles).  Distributed Transaction ACIDITY: Users should be able to write atomic and durable Xacts accessing multiple sites

6/25/201519 PSU’s CS 587 Types of Distributed Databases  Homogeneous: Every site runs same type of DBMS.  Heterogeneous: Different sites run different DBMSs (different RDBMSs or even non- relational DBMSs). DBMS1DBMS2DBMS3 Gateway

6/25/201520 PSU’s CS 587 Network Architectures  Definitions  Client: Requests a service  Server: provides a service  Client-server architecture: one of each  Fat client: Applications run from the client  Thin client: Applications run from the server  Business Applications use Thin Client Fat Clients are too hard to mange  Personal Applications use Fat Client Does anyone use Google Docs?  3-tier architecture. Middle tier runs middleware.  Middlware includes business logic, DBMS coordination  DBMS back end may involve multiple collaborating servers

6/25/201521 PSU’s CS 587 Availability  Definition: % of time can answer queries  Sometimes only local queries can be answered  What detracts from availability?  Node crashes  Network crashes  During a network (perhaps partial) crash, the DBMS at each node should continue to operate.

6/25/201522 PSU’s CS 587 Storing Data –Fragmentation  Horizontal Fragmentation  Usually disjoint  Example  Motivation: More efficient to place data where queries are located, if possible  Vertical Fragmentation  Remember Normalization?  Not as common in DDBMSs as horizontal fragmentation

6/25/201523 PSU’s CS 587 Storing Data - Replication  Disadvantages (redundancy!)  Wasted space  Possibly inconsistent data values  Advantages  Increased availability  Faster query evaluation  How do we keep track of all replicas of data? SITE A R1 R2 R3 SITE B

6/25/201524 PSU’s CS 587 Distributed Catalog Management  Catalog contains schema, authorization, statistics, and location of each relation/replica/fragment  Problem: where to store the catalog?  Example: I want to query sales. Where do I find it? Location info is in the catalog, but where is the catalog?  Solutions  Store the entire catalog at every site Catalog updates are too expensive  Store the entire catalog at a single master site Single point of failure, performance bottleneck  Store catalog info for all replicas and fragments of a relation at the birthsite of each relation How do I find the birthsite of a relation?

6/25/201525 PSU’s CS 587 Distributed Catalog Management (ctd.)  Intergalactic standard solution:  Name each relation with E.g.,  Keep catalog of a relation at its birthplace  How many network I/Os are required to update a catalog?  If a data item’s location is changed, does the birth-site change? How many network I/Os are required?

6/25/201526 PSU’s CS 587 Distributed Queries  Horizontally Fragmented: Tuples with rating = 5 at Tokyo.  Must compute SUM (age), COUNT (age) at both sites.  If WHERE contained just S.rating>6, just one site.  Vertically Fragmented: sid and rating at Shanghai, sname and age at Tokyo, tid at both.  How to evaluate the query?  Replicated: Sailors copies at both sites.  Choice of site based on local costs, network costs. SELECT AVG(S.age) FROM Sailors S WHERE S.rating > 3 AND S.rating < 7

6/25/201527 PSU’s CS 587 Distributed Joins  Fetch as Needed, Page NL, Sailors as outer:  Cost: 500 D + 500 * 1000 (D+S)/10 /10 because of rank=5 condition  D is cost to read/write page; S is cost to ship page.  S is very large, so cost is essentially 50,000S  Ship to One Site: Ship Reserves to London.  Cost: 1000 S + 4500 D (SM Join; cost = 3*(500+1000))  Essentially 1000S SailorsReserves LONDONPARIS 500 pages1000 pages SELECT * FROM Sailors S, Reserves R WHERE S.sid = Reserves.sid and rank=5

6/25/201528 PSU’s CS 587 Semijoin Technique  At London, project  rank=5 Sailors onto join column sid and ship this to Paris. Cost: 5S  10% for rank=5, 10% for project onto join column  At Paris, join Sailors projection with Reserves.  Result is called reduction of Reserves wrt Sailors.  Ship reduction of Reserves to London. Cost: 100S  At London, join Sailors with reduction of Reserves.  Total cost: 105S  Idea: Tradeoff the cost of computing and shipping projection for cost of shipping full Reserves relation.

6/25/201529 PSU’s CS 587 Bloomjoin  At London  Create a 64K bit vector V (one 8K page)  Select h a hash function from sids to [0,64K-1]  If there’s a sailor with rating 5 set V(h(her sid))=true  Ship V to Paris. Cost: 1S  At Paris  If a reservation has V(h(its sid))=true, ship it to London  Cost: 100S Perhaps a bit more if hash function is not effective, but 64K is a pretty big bit vector  At London  Join Sailors with reduced Reservations.  Total cost: 101S.

6/25/201530 PSU’s CS 587 Distributed Query Optimization  Cost-based approach; consider plans, pick cheapest; similar to centralized optimization.  Difference 1: Communication costs must be considered.  Difference 2: Local site autonomy must be respected.  Difference 3: New distributed join methods.  Query site constructs global plan, with suggested local plans describing processing at each site.  If a site can improve suggested local plan, free to do so.

6/25/201531 PSU’s CS 587 Updating Distributed Data  Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying Xact commits.  Data distribution is made transparent to users.  Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime.  Users must be aware of data distribution.

6/25/201532 PSU’s CS 587 Synchronous Replication  Read-any Write-all: Writes are slower and reads are faster, relative to Voting.  Most common approach to synchronous replication.  Voting: Xact must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy.  E.g., 10 copies; 7 written for update; 4 copies read.  Each copy has version number.  Not attractive usually because reads are common.  Choice of technique determines which locks to set.

6/25/201533 PSU’s CS 587 Cost of Synchronous Replication  Before an update Xact can commit, it must obtain locks on all modified copies.  Sends lock requests to remote sites, and while waiting for the response, holds on to other locks!  If sites or links fail, Xact cannot commit until they are back up.  Even if there is no failure, committing must follow an expensive commit protocol with many msgs.  So the alternative of asynchronous replication is becoming widely used.

6/25/201534 PSU’s CS 587 Asynchronous Replication  Allows modifying Xact to commit before all copies have been changed (and readers nonetheless look at just one copy).  Users must be aware of which copy they are reading, and that copies may be out-of-sync for short periods of time.  Two approaches: Primary Site and Peer-to- Peer replication.  Difference lies in how many copies are ``updatable’’ or ``master copies’’.

6/25/201535 PSU’s CS 587 Peer-to-Peer Replication  More than one of the copies of an object can be a master in this approach.  Changes to a master copy must be propagated to other copies somehow.  If two master copies are changed in a conflicting manner, this must be resolved. (e.g., Site 1: Joe’s age changed to 35; Site 2: to 36)  Best used when conflicts do not arise:  E.g., Each master site owns a disjoint fragment.

6/25/201536 PSU’s CS 587 Primary Site Replication  Exactly one copy of a relation is designated the primary or master copy. Replicas at other sites cannot be directly updated.  The primary copy is published.  Other sites subscribe to (fragments of) this relation; these are secondary copies.  Main issue: How are changes to the primary copy propagated to the secondary copies?  Done in two steps. First, capture changes made by committed Xacts; then apply these changes.

6/25/201537 PSU’s CS 587 Implementing the Capture Step  Log-Based Capture: The log (kept for recovery) is used to generate a Change Data Table (CDT).  If this is done when the log tail is written to disk, must somehow remove changes due to subsequently aborted Xacts.  Procedural Capture: A procedure that is automatically invoked does the capture; typically, just takes a snapshot.  Log-Based Capture is better (cheaper, faster) but relies on proprietary log details.

6/25/201538 PSU’s CS 587 Implementing the Apply Step  The Apply process at the secondary site periodically obtains (a snapshot or) changes to the CDT table from the primary site, and updates the copy.  Period can be timer-based or user/application defined.  Replica can be a view over the modified relation!  If so, the replication consists of incrementally updating the materialized view as the relation changes.  Log-Based Capture plus continuous Apply minimizes delay in propagating changes.  Procedural Capture plus application-driven Apply is the most flexible way to process changes.

6/25/201539 PSU’s CS 587 Distributed Locking  How do we manage locks for objects across many sites?  Centralized: One site does all locking. Vulnerable to single site failure.  Primary Copy: All locking for an object done at the primary copy site for this object. Reading requires access to locking site as well as site where the object (copy/fragment) is stored.  Fully Distributed: Locking for a copy is done at the site where the copy/fragment is stored. Locks at all or many sites while writing an object.

6/25/201540 PSU’s CS 587 Distributed Deadlock Detection  Each site maintains a local waits-for graph.  A global deadlock might exist even if the local graphs contain no cycles: T1 T2 SITE ASITE BGLOBAL v Three solutions: Centralized (send all local graphs to one site); Hierarchical (organize sites into a hierarchy and send local graphs to parent in the hierarchy); Timeout (abort Xact if it waits too long).

6/25/201541 PSU’s CS 587 Distributed Transactions  Problem: Atomicity and Durability in DDBMSs  Example:  A+=100 at Paris, B-=100 at London  Each is called a subtransaction.  The query originates at a third site called the coordinator.  Who keeps the log? Each site.  How does each subtransaction know when to commit?

6/25/201542 PSU’s CS 587 2-Phase Commit (2PC) Protocol  Coordinator sends prepare message to each subtransaction  Each subtransaction responds with yes or no  Yes means it has done everything but write the end record  Coordinator sends commit or abort to all subtransactions  Subtransactions send ack messages to coordinator

6/25/201543 PSU’s CS 587 2PC Issues  What if  only one subtransaction says no?  All subtransactions say yes, then one cannot commit?  All subtransactions say yes, then one crashes?

6/25/201544 PSU’s CS 587 Learning Objectives  Definitions: parallel and distributed DBMSs, scalability, 3 parallel architectures, pipeline and partition parallelism, 3 types of data partitioning and their uses  How to parallelize operators and determine their scalability: scan, select, sort, aggregate, hash join  Definition of distributed data independence and acidity, network architectures, availability, fragmentation, replication  How is replication managed?  How are distributed queries processed?  How is distributed data updated?  How are distributed locks and deadlocks managed?  How are distributed transactions managed?

6/25/20151 PSU’s CS 587 22. Parallel, Distributed Access  Parallel vs Distributed DBMSs  Parallel DBMSs  Measuring success: scalability  Hardware Architectures.

Similar presentations

Presentation on theme: "6/25/20151 PSU’s CS 587 22. Parallel, Distributed Access  Parallel vs Distributed DBMSs  Parallel DBMSs  Measuring success: scalability  Hardware Architectures."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

6/25/20151 PSU’s CS 587 22. Parallel, Distributed Access  Parallel vs Distributed DBMSs  Parallel DBMSs  Measuring success: scalability  Hardware Architectures.

Similar presentations

Presentation on theme: "6/25/20151 PSU’s CS 587 22. Parallel, Distributed Access  Parallel vs Distributed DBMSs  Parallel DBMSs  Measuring success: scalability  Hardware Architectures."— Presentation transcript:

Similar presentations

About project

Feedback