 Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management.

 Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management  Query processing  Transactions

 A parallel database system is designed to improve performance through parallelism  Loading data, building indexes, evaluating queries  Data may be stored in a distributed way, but solely for performance reasons  A distributed database system is physically stored across several sites  Each site is managed by an independent DBMS  Distribution is affected by local ownership, and availability as well as performance

HHow long does it take to scan a 1 terabyte table at 10MB/s? 11,099,511,627,776 bytes = 1,024 4 or 2 40 bytes 110MB = 10,485,760 bytes 11,099,511,627,776 / 10,485,760 = 104,858 1104,858 / (60 * 20 * 24) = 1.2 days! UUsing 1,000 processors in parallel the time can be reduced to 1.5 minutes

 A coarse-grain parallel machine consists of a small number of processors  Most current high-end computers  A fine-grain parallel machine uses thousands of smaller processors  Also referred to as a massively parallel machine

 Both throughput and response time can be improved by parallelism  Throughput – the number of tasks completed in a given time  Processing many small tasks in parallel increases throughput  Response time – the time it takes to complete a single task  Subtasks of large transactions can be performed in parallel increasing response time

 Speed-up  More resources means less time for a given amount of data  Scale-up  If resources increase in proportion to increase in data size, time is constant degree of parallelismthroughput ideal degree of parallelismresponse time ideal

 Where possible a parallel database should carry out evaluation steps in parallel  There are many opportunities for parallelism in a relational database  There are three main parallel DBMS architectures  Shared nothing  Shared memory  Shared disk

 Multiple CPUs attached to an interconnection network  Accessing a common region of main memory  Similar to a conventional system  Good for moderate parallelism  Communication overhead is low  OS services control the CPUs  Interference increases with size  As CPUs are added memory contention becomes a bottleneck  Adding more CPUs eventually slows the system down P D interconnection network DD … global shared memory PP

 Each CPU has private memory and direct access to data  Through the interconnection network  Good for moderate parallelism  Suffers from interference in the interconnection network  Which acts as a bottleneck  Not a good solution for a large scale parallel system M M … M M M M P PP interconnection network DDD

 Each CPU has local memory and disk space  No two CPUs access the same storage area  All CPU communication is through the network  Increases complexity  Linear speed-up ▪ Operation time decreases proportional to increase in CPUs  Linear scale-up ▪ Performance maintained if CPU increase is proportional to data interconnection network P … PP M M M M M M DDD

 A relational query execution plan is a tree, or graph, of relational algebra operators  Operators in a query tree can be executed in parallel  If one operator consumes the output of another, there is pipelined parallelism  Otherwise the operators can be evaluated independently  An operator blocks if it does not produce any output until it has consumed all its inputs  Pipelined parallelism is limited by blocking operators

 Individual operators can be evaluated in a parallel way by partitioning input data  In data-partitioned parallel evaluation the input data is partitioned, and worked on in parallel  The results are then combined  Tables are horizontally partitioned  Different rows are assigned to different processors

 Partition using a round-robin algorithm  Partition using hashing  Partition using ranges of field values

 Partition using a round-robin algorithm  Assign record i to processor i mod n ▪ Similar to RAID systems  Suitable for evaluating queries that access the entire table  Less efficient for queries that access ranges of values and queries on equality

 Partition using hashing  A hash function based on selected attributes is applied to each record to determine its processor  The data remains evenly distributed as the table grows, or shrinks over time  Good for equality selections  Only one disk is used, leaving the others free  Also useful for sequential scans where the partitioning attributes are a candidate key

 Partition using ranges of field values  Ranges are chosen from the sort key values, each range should contain the same number of records ▪ Each disk contains one range  If a range is too large can lead to data skew  Skew can lead to the processors with large partitions becoming bottlenecks  Good for equality selections, and range selections

 Both hash and range partitioning may result in data skew  Where some partitions are larger or smaller  Skew can dramatically reduce the speed-up obtained from parallelism  In range partitioning skew can be reduced by using histograms  The histograms contain the number of attributes and are used to derive even partitions

 Parallel data streams are used to provide data for relational operators  The streams can come from different disks, or  Output of other operators  Streams are merged or split  Merged to provide the inputs for a relational operator  Split as needed to parallelize processing  These operations can buffer data, and should be able to halt operators that provide their input data  A parallel evaluation consists of a network of relational, merge and split operators

 Inter-query parallelism  Different queries or transactions execute in parallel  Throughput is increased but response time is not  Easy to support in a shared-memory system  Intra-query parallelism  Executing a single query in parallel to speed up large queries  Which in turn can entail either intra-operation or inter-operation parallelism, or both

 Scanning and loading  Pages can be read in parallel while scanning a relation  The results can be merged  If hash or range partitioning is used selections can be directed to the relevant processors  Sorting  Joins

 The simplest sort method is for each processor to sort its portion of the table  Then merge the sorted records  The merging phase may limit the amount of parallelism  A better method is to first redistribute the records over the processors using range partitioning  Using the sort attributes  Each processor sorts its set of records  The sets of sorted records are then retrieved in order  To make the partitions even, the data in the processors can be sampled

 Join algorithms can be parallelized  Parallelization is most effective for hash or sort- merge joins ▪ Parallel hash join is widely used  The process for parallel hash join is  First partition the two tables across the processors using the same hash function  Join the records locally, using any join algorithm  Merge the results of the local joins, the union of these results is the join of the two tables

 If tables are very large, parallel hash join may have a high cost at each processor  If each partition is large, multiple passes will be required for the local joins  An alternative approach is to use all processors for each partition  Partition the tables using h 1 ▪ Each partition of the smaller relation should fit into the combined memory of the processors  Process each partition using all processors ▪ Use h 2 to determine which processor to send records to

 Partitioning is not suitable for joins on inequalities  Such as R ⋈ R.a < S.b S  Since all records in R could join with a record in S  Fragment and replicate joins can be used  In asymmetric fragment and replicate join ▪ One of the relations is partitioned ▪ The other relation is replicated across all partitions

P 0,0  Each relation can be both fragmented and replicated  Into m fragments of R and n of S  However m * n processors are required  This works with any join condition  When partitioning is not possible R0R0 R0R0 R1R1 R1R1 R2R2 R2R2 … … R m-1 S0S0 S0S0 P 1,0 P 2,0 P 0,1 P 1,1 P 0,2 S1S1 S1S1 S2S2 S2S2 … … S n-1 P m-1,n-1

 Selection – the table may already be partitioned on the selection attribute  If not, it can be scanned in parallel  Duplicate elimination – use parallel sorting  Projection – can be performed by scanning  Aggregation – partition by the grouping attribute  If records do have to be transferred between processors it may be possible to just send partial results  The final result can then be calculated from the partial results ▪ e.g. sum

 Using parallel processors reduces the time to perform an operation  Possibly to as little as 1/n * original cost ▪ Where n is the number of processors  However there are also additional costs  Start-up costs for initiating the operation  Skew which may reduce the speed-up  Contention for resources resulting in delays  Cost of assembling the final result

 As well as parallelizing individual operators, different operators can be processed in parallel  D ifferent processors perform different operations  Result of one operator can be pipelined into another  Note that sorting and the hash-join partitioning block pipelines  Multiple independent operations can be executed concurrently  Using bushy, rather than left-deep, join trees

 The best serial plan may not be the best parallel plan  Also note that parallelization introduces further complexity into query optimization  Consider a table partitioned into two nodes, with a local secondary index  Node 1 contains names between A and M  Node 2 contains names between N and Z  Consider the selection: name < “Noober“  Node 1 should scan its partition, but  Node 2 should use the name index

 In a large-scale parallel system the chances of failure increase  Such systems should be designed to operate even if a processor disk fails  Data can be replicated across multiple processors  Failed processors or disks are tracked  And request re-routed to the backup

 Architecture  Shared-memory is easy, but costly and does not scale well  Shared-nothing is cheap and scales well, but is harder to implement  Both intra-operation, and inter-operation parallelism are possible  Most relational algebra operations can be performed in parallel  How the data is partitioned across processors is very important

 A distributed database is motivated by a number of factors  Increased availability ▪ If a site containing a table goes down, the table may still be available if a copy is maintained at another site  Distributed access to data ▪ An organization may have branches in several cities ▪ Access patterns are typically affected by locality  Analysis of distributed data  Distributed systems must support integrated access

 Data is stored at several sites  Each site is managed by an independent DBMS  The system should make the fact that data is distributed transparent to the user  Distributed Data Independence  Users should not need to know where the data is located  Queries that access several sites should be optimized  Distributed Transaction Atomicity  Users should be able to write transactions that access several sites, in the same way as local transactions

 Users may have to be aware of where data is located  Distributed data independence and distributed transaction atomicity may not be supported  These properties may be hard to support efficiently ▪ Sites may be connected by a slow long-distance network  Consider a global system  Administrative overheads for viewing data as a single unified collection may be prohibitively expensive

 Distributed and shared-nothing parallel systems appear similar  In practice these are often very different since distributed DBs are typically  Geographically separated  Separately administered  Have slower interconnections  May have both local and global transactions

 Homogeneous  Data is distributed but every site runs the same DBMS software  Heterogeneous, or multidatabase  Different sites run different DBMSs, and the sites are connected to enable access to data  Require standards for gateway protocols  A gateway protocol is an API that allows external applications access to the database ▪ e.g. ODBC and JDBC  Gateways add a layer of processing, and may not be able to entirely mask differences between servers

 Client-Server  Collaborating Server  Middleware

 One or more client processes and one or more server processes  A client process sends a query to any one server process  Clients are responsible for UI  Servers manage data and execute transactions  A popular architecture  Relatively simple to implement  Servers do not have to deal with user-interactions  Users can run a GUI on clients  Communication between client and server should be as set-oriented as possible  e.g. stored procedures vs. cursors

 Client-server systems do not allow a single query to access multiple servers as this would require  Breaking the query into sub-queries to be executed at different sites and merging the answers to the sub-queries  To do this the client would have to be overly complex  In a collaborating server system the distinction between clients and servers is eliminated  A collection of DB servers, each able to run transactions against local data  When a query is received that requires data from other servers the server generates appropriate sub-queries

 Designed to allow a single query to access multiple servers, but  Without requiring all servers to be capable of managing multi-site query execution  Often used to integrate legacy systems  Requires one database server (the middleware) capable of managing multi-server queries  Other servers only handle local queries and transactions  The special server coordinates queries and transactions  The middleware server typically doesn’t maintain any data

 In a distributed system tables are stored across several sites  Accessing a table stored elsewhere incurs message- passing costs  A single table may be replicated or fragmented across several sites  Fragments are stored at the sites where they are most often accessed  Several replicas of a table may be stored at different sites  Fragmentation and replication can be combined

 Fragmentation consists of breaking a table into smaller tables, or fragments  The fragments are stored instead of the original table  Possibly at different sites  Fragmentation can either be vertical or horizontal TIDempIDfNamelNameagecity 1111SamSpade43Chicago 2222PeterWhimsey51Surrey 3333SherlockHolmes35Surrey 4444AnitaBlake29Boston horizontal vertical

 Records that belong to a horizontal fragment are usually identified by a selection query  e.g. all the records that relate to a particular city, achieving locality, reducing communication costs  A horizontally fragmented table can be recreated by computing the union of the fragments ▪ Fragments are usually required to be disjoint  Records belonging to a vertical fragment are identified by a projection query  The collection of vertical fragments must be a lossless-join decomposition  A unique tuple ID is often assigned to records

 Replication entails storing several copies of a table or of table fragments for  Increased availability of data, which protects against ▪ Failure of individual sites, and ▪ Failure of communication links  Faster query evaluation ▪ Queries can execute faster by using a local copy of a table  There are two kinds of replication, synchronous, and asynchronous  These differ in how replicas are kept current when the table is modified

 Distributing data across sites adds complexity  It is important to track where replicated or fragmented tables are stored  Each replica or fragment must be uniquely named  Naming should be performed locally  A global relation name consists of {birth site, local name } ▪ The birth site is the site where the table was created  A site catalog records fragments and replicas at a site, and tracks replicas of tables created at the site  To locate a table, look up its birth site catalog  The birth site never changes, even if the table is moved

 Estimating the cost of an evaluation plan must include communication costs  Evaluate the number of page reads or writes, and  The number of pages that must be sent from one site to another  Pages may need to be shipped between a number of sites  Sites where the data is located, and where the result is computed, and  The site that initiated the query

 Simple, one table, queries are affected by fragmentation and replication  If a table is horizontally fragmented a query has to be evaluated at multiple sites  And the union of the result computed  Selections that only require data at one site can be executed just at that site  If a table is vertically fragmented the fragments have to be joined on the common attribute  If a table is replicated, the shipping costs have to be considered to determine which site to use

 Joins of tables at different sites can be very expensive  There are a number of strategies for computing joins  Fetch as needed  Ship to one site  Semijoins and Bloomjoins

 Designate one table as the outer relation, and compute the join at that site  Fetch records of the inner relation as needed; the cost depends on  The size of the relations  Whether the inner relation is cached at the outer relation's site ▪ If not, communication costs are incurred once for each time the inner relation is read  The size of the result relation  If the size of the result (R ⋈ S) is greater than R + S it is cheaper to ship both relations to the query site

 In this strategy, relations are shipped to a site and the join carried out at that site  The site can be one of the sites involved in the join  The result has to be shipped from where it was computed to the site where the query was posed  Alternatively both input relations can be shipped to the site where the query was originally posed  The join is then computed at that site

 Consider a join between two relations, R and S at different sites, London and Vancouver  Assume that S (the inner join) is to be shipped to London where the join will be computed  Note that some S records may not join to R records  Shipping costs can be reduced by only shipping those S records that will actually join to R records  There are two techniques that can reduce the number of S records to be shipped  Semi-joins, and  Bloom-joins

 At the first site (London) compute the projection of R on the join columns, a  Ship this relation to site 2 (Vancouver)  At Vancouver compute the join of  a (R) and S  The result of this join is the reduction of S with respect to R  Ship the reduction of S to London  At London compute the join of the reduction of S, and R  The effectiveness of this technique depends on how much smaller the reduction of S is compared to S

 Bloom-joins are similar to semi-joins, except that a bit vector is sent to the second site  The vector is size k and each record in R is hashed to it ▪ A bit is set to 1 if a record hashes to it ▪ The hash function is on the join attribute  The reduction of S is then computed in step 2  By hashing records of S to the bit vector  Only those records that hash to a bit with the value of 1 are included in the reduction  The cost to send the bit vector is less than the cost to send the projection (of the join attribute on R)  But some unwanted records of S may be in the reduction

 The basic cost based approach is to consider a set of plans and pick the cheapest  Communication costs must be considered  Local autonomy must be respected  Some operations can be carried out in parallel  The query site generated a global plan with suggested local plans  Local sites are allowed to change their suggested plans if they can improve them

 If data is distributed it should be transparent to users  Users should be able to ask queries without having to worry where tables are stored  Transactions should be atomic actions, regardless of data fragmentation or replication  If so, all copies of a replicated relation must be modified before the transaction commits  Referred to as synchronous replication  Another approach, asynchronous replication, allows copies of a relation to differ  More efficient, but compromises data independence

 There are two techniques for ensuring that a transaction sees the same values  Regardless of which copy of an object it accesses  In voting, a transaction must write a majority of copies to modify an object, and  Must read enough copies to ensure that it sees at least one most recent copy  e.g. 10 copies of an object, at least 6 copies must be written, and at least 5 read  Note that the copies include a version number so that it is possible to tell which copy is the latest

 Voting is not a generally efficient technique  Reading an object requires that multiple copies of the object must be read  Typically, objects are read more than they are written  The read-any write-all policy allows any single copy to be read, but  All copies must be written when an object is written  Writes are slower, relative to voting, but  Reads are fast, particularly is a local copy is available  Read-any write-all is usually used for synchronous replication

 Synchronous replication is expensive  Before an update transaction is committed it must obtain X locks on all copies of the data  This may entail sending lock requests to remote sites and waiting for the locks to be confirmed  While holding its other locks  If sites, or the communication links fail, the transaction cannot commit until they are back up  Committing the transaction requires ending multiple messages as part of a commit protocol  An alternative is to use asynchronous replication

 A transaction is allowed to commit before all the copies have been changed  Readers still only look at a single copy  Users must be aware of which copy they are reading, and that copies may be out of sync  There are two approaches to asynchronous replication  Peer-to-peer, and  Primary site

 More than one copy can be designated as updatable  Changes to the master(s) must be propagated to other copies  If two masters are changed a conflict resolution strategy must be used  Peer-to-peer replication is best used when conflicts do not arise  Where each master site owns a disjoint fragment ▪ Usually a horizontal fragment  Update rights are only held by one master at a time ▪ A backup site may gain update rights if the main site fails

 One copy of a table is designated as the primary or master copy  Users register or publish the primary copies  Other sites subscribe to the table (or fragments of it), by creating secondary copies  Secondary copies cannot be directly updated  Changes to the primary copy must be propagated to the secondary copies  First, capture change made by committed transactions  Apply the changes to secondary copies

 Log-based capture creates an update record from the recovery log when it is written to stable storage  Log changes that affect replicated tables are written to a change data table (CDT)  Note that aborted transactions must, at some point, be removed from the CDT  Another approach is to use procedural capture  A trigger invokes a procedure which takes a snapshot of the primary copy  Log-based capture is cheaper and has less delay, but relies on proprietary log details

 The apply step takes the capture step changes and propagates them to secondary copies  This can be continuously pushed from the master whenever a CDT is generated, or  Periodically requested (or pulled) by the copies ▪ A timer or application controls the frequency of the requests  Log-based capture with continuous apply minimizes delay  A cheaper substitute for synchronous replication  Procedural capture and application driven apply gives the most flexibility

 Complex decision support queries that require data from multiple sites are popular  To improve query efficiency, all the data can be copied to one site, which is then queried  These data collections are called data warehouses  Warehouses use asynchronous replication  The source data is typically controlled by different DBMSs  Source data often has to be cleaned when creating the replicas  Procedural capture and application apply is best used for this environment

 Transactions may be submitted at one site but can access data at other sites  The transaction manager breaks the transaction into sub- transactions that execute at different sites  The sub-transactions are submitted to the other sites  The transaction manger at the initial site must coordinate the activity of the sub-transactions  Distributed concurrency control  Locks and deadlocks must be managed across sites  Distributed recovery  Transaction atomicity must be ensured across sites

 In centralized locking, a single site is in charge of handling lock and unlock requests  This is vulnerable to single site failure and bottlenecks  In primary copy locking, all locking is done at the primary copy site for an object  Reading a copy of an object usually requires communication with two sites  In fully distributed locking, lock requests are handled by the lock manager at the local site  X locks must be set at all sites when copies are modified  S locks are only set at the local site  There are other protocols for locking replicated data

 If deadlock detection is being used (rather than prevention) the scheme must be modified  Centralized - send all local waits-for graphs to a central site  Hierarchical - organize sites into a hierarchy and send local graphs to parent  Timeout - abort the transaction if it waits too long  Communication delays can cause phantom deadlocks T1T2 site A T1T2 site B T1T2 global

 Recovery in a distributed system is more complex  New kinds of failure can occur  Communication failures, and  Failures at remote sites where sub-transactions are executing  To ensure atomicity, either all or no sub- transactions must commit  This property must be guaranteed regardless of site or communication failure  This is achieved using a commit protocol

 During normal execution each site maintains a log  Transactions are logged where they execute  The transaction manager at the originating site is called the coordinator  Transaction managers at sub-transaction sites are referred to as subordinates  The most widely used commit protocol is two-phase commit  The 2PC protocol for normal execution starts when the user commits a transaction

 Coordinator sends prepare messages  Subordinates decide whether to abort or commit  Force-write an abort or prepare log record  Send no or yes messages to coordinator  If the coordinator receives unanimous yes, it force- writes commit record and sends commit messages  Otherwise, force-writes abort and sends abort messages  Subordinates force-write abort or commit log records and send acknowledge messages to the coordinator  When all acknowledge messages have been received the coordinator writes an end log record

 2PC requires two rounds of messages  Voting phase  Termination phase  Any site’s transaction manager can unilaterally abort a transaction  Log records describing decisions are always forced to stable storage before the message is sent  Log records include the record type, transaction ID, and coordinator ID  The coordinator’s commit or abort log record includes the IDs of all subordinates

 If there is a commit or abort log record for transaction, T, but no end record, T must be redone  If the site is a coordinator keep sending commit, or abort messages until all acknowledge messages are received  If there is a prepare log record for T, but not commit or abort the site is a subordinate  The coordinator is repeatedly contacted to determine T’s status, until a commit or abort message is received  If there is no prepare log record for T, the transaction is unilaterally aborted  And send an abort message if contacted by a subordinate

 If a coordinator fails, the subordinates are unable to determine whether to commit or abort  The transaction is blocked until the coordinator recovers  What happens if a remote site does not respond during the commit protocol?  If the site is the coordinator the transaction should be aborted  If the site is a subordinate that has not voted yes, it should abort the transaction  If the site is a subordinate that has voted yes, it is blocked until the coordinator responds

 The acknowledge messages are used to tell the coordinator that it can forget a transaction  Until all acknowledge messages are received it must keep T in the transaction table  The coordinator may fail after prepare messages, but before commit or abort  It therefore has no information about the transaction’s status before the crash ▪ So it subsequently aborts the transaction  If another site enquires about T, the recovery process responds with an abort message  If a sub-transaction doesn’t perform updates its commit or abort status is irrelevant

 When a coordinator aborts T, it can undo T and remove it from the transaction table  If there is no information about T, it is presumed to be aborted  Similarly, subordinates do not need to send ack messages on abort  As the coordinator does not have to wait for acks to abort a transaction  Abort log records do not have to be force-written  As the default decision is to abort a transaction

 It a sub-transaction does not perform updates it responds to prepare with a reader message  And writes no log records  If the coordinator receives a reader message it is treated as yes  But no further messages are sent to that subordinate  If all sub-transactions are readers the second phase of the protocol is not required  The transaction can be removed from the transaction table

 In cloud computing a vendor supplies computing resources as a service  A large number of computers are connected through a communication network  Such as the internet …  The client runs applications and stores data using these resources  And can access the resources with little effort

 Web applications have to be highly scalable  Applications may have hundreds of millions of users  Requiring data to be partitioned across thousands of processors  There are a number of systems for data storage on the cloud  Such as Bigtable (from Google)  They do not necessarily guarantee the ACID properties ▪ They drop ACID …

 Many web data storage systems are not built around an SQL data model  Such as NoSql DBs or BigTable  Some support semi-structured data  Many web applications manage without extensive query language support  Data storage systems often allow multiple versions of data items to be stored  Versions can be identified by timestamp

 Data is often partitioned using hash or range partitioning  Such partitions are referred to as tablets  This is performed dynamically as required  It is necessary to know which site contains a particular tablet  A tablet controller site tracks the partitioning function ▪ And can map a request to the appropriate site  The mapping information can be replicated to a set of router sites ▪ So that the controller does not act as a bottleneck

 A cloud DB introduces a number of challenges to making a DB ACID compliant  Locking  Ensuring transactions are atomic  Frequent communication between sites  In addition there are a number of issues that relate to both DBs and data storage  Replication is controlled by the cloud vendor  Security and legal issues

 Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management.

Similar presentations

Presentation on theme: " Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management.

Similar presentations

Presentation on theme: " Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management."— Presentation transcript:

Similar presentations

About project

Feedback