Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.

Similar presentations

Presentation on theme: "Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected."— Presentation transcript:

1 Distributed Databases John Ortiz

2 Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected by a computer network  Distributed Database Management System (DDBMS) is software which manages a distributed database  World Wide Web technology does not yet constitute a DDB by our definition

3 Lecture 24Distributed Databases3 Advantages of a DDB  Supports various levels of transparency  Distribution (network) transparency  Degree to which user is unaware of the networked nature of the DB  Replication transparency  Degree to which user is unaware of copies of the DB  Fragmentation transparency  Degree to which user is unaware the DB is broken into pieces

4 Lecture 24Distributed Databases4 Advantages of a DDB  Increased Reliability and Availability  Reliability – probability a system is running at a particular point in time  Availability – probability a system is continuously available during a time interval

5 Lecture 24Distributed Databases5 Advantages of a DDB  Improved Performance  Supports data localization – data is kept near where it is most often used to reduce affects of network delay  Easier Expansion  Adding more data, increasing DB size, adding resources is easier  Reduced Operation Costs (when considering a mainframe system)  cheaper to add workstations than a new mainframe computer

6 Lecture 24Distributed Databases6 Advantages of a DDB  No Single Point of Failure  When one computer fails, others can take its place

7 Lecture 24Distributed Databases7 Disadvantages of a DDB  Significant increase in complexity  Normalization, query optimization, security, transaction processing, concurrency control, crash recovery, etc. ALL become much more difficult to handle  Increased storage requirements  Since multiple copies of various portions of the DB exist, more storage space is required

8 Lecture 24Distributed Databases8 Data Fragmentation  Fragmentation is the division of the database into pieces stored at different sites  Horizontal Fragmentation – a subset of tuples in a particular relation  the result of a query which SELECTS some tuples, but not others produces a horizontal “fragment”  In a DDB, the output from the previous query may be stored as a separate DB at a separate site  Requires a UNION to recombine information

9 Lecture 24Distributed Databases9 Data Fragmentation  Vertical Fragmentation – a subset of attributes of a particular relation  The result of a query which PROJECTS certain, specific attributes  Requires an outer join (or an outer union) to recombine information  Hybrid Fragmentation – can you guess?  Includes both horizontal and vertical fragmentation  Complete fragmentation simply means all tuples/attributes are in the result  A fragmentation schema

10 Lecture 24Distributed Databases10 Data Fragmentation  A fragmentation schema is a definition of the set of fragments that includes all attributes and tuples sufficient to reconstruct the DB  An allocation schema describes which fragments are at what sites

11 Lecture 24Distributed Databases11 Data Replication  Replication is the creation of copies of the DB  A DDB may be fully replicated (a copy of the entire DB is made at each site)  Why would you want to make a full copy of a DDB?  A DDB may have no replication (each fragment is stored at one and only one site)  Naturally, a DDB may be partially replicated  A replication schema is a description of what pieces are copied at which sites

12 Lecture 24Distributed Databases12 Data Replication  Replication creates new consistency and redundancy problems  Every piece of data that is replicated is redundant, and therefore subject to be inconsistent  These copies may be updated separately which causes inconsistency  How much inconsistency acceptable?

13 Lecture 24Distributed Databases13 Synchronization  Synchronization is the process of of updating the individual replicas  Since pieces are stored in different places, the DDB must periodically be made consistent  Synchronization can be expensive in terms of network resources and time  It is not simply copying one replica to another – most recent updates on both copies being synchronized must be accounted for  P.775 - 778 in the text has an example of a DDB

14 Lecture 24Distributed Databases14 US Air Force Email  We have noted in the past that there are many types of databases such as spreadsheets, address books, and even documents (such as MS Word)  Consider the AF with approximately 500,000 people who all have email addresses and need to communicate  They have constructed a global email address book and make use of replication  The AF is divided into levels: global, command, base

15 Lecture 24Distributed Databases15 US Air Force Email  Initially the bases were each set up with email and interconnected via the network  However, you had to know the email address of anyone at a different base  Eventually, each command (a group of related bases) set up an address book consisting of all the bases  Each base maintains a complete replica of the entire commands address book  Why not just a piece?

16 Lecture 24Distributed Databases16 US Air Force Email  The DB is synchronized each night  So, when someone moves, their email address is removed from the local copy  All the other bases will still have that “old” email address until the next day, at which point the DDB is consistent again  I believe that now the entire AF address book is available at each base  Not sure how often it is synchronized, perhaps weekly  Search for an email address is quick

17 Lecture 24Distributed Databases17 US Air Force Email  Search for an email address is quick since a local copy is kept  This reduces network traffic considerably compared with everyone having to search a centralized DB for email addresses

18 Lecture 24Distributed Databases18 Query Processing in DDB  When we looked at query processing before, the largest delay was with the disk  Now, that same concept is extended to include network delay – which can be much longer  Suppose the EMPLOYEE DB (10,000 records, 100 bytes each) is at site 1, and the DEPARTMENT DB (100 records, 35 bytes each) is at site 2  YOU are at site 3  Assume result is 400,000 bytes

19 Lecture 24Distributed Databases19 Query Processing in DDB  SELECT E_Name  FROM EMPLOYEE  WHERE DeptNum = 5  There are 3 strategies:  1) Txfr both DBs to site 3 to perform the query  (1,003,500 bytes txfr’d)  2) Txfr EMPLOYEE to site 2, perform the query, txfr result to site 3 (1,400,000 bytes txfr’d)  3) Txfr DEPARTMENT to site 1, perform the query, txfr result to site 3 (403,500 bytes)

20 Lecture 24Distributed Databases20 Query Processing using Semijoin  Rather than sending the entire set of records to be joined, we could just send the joining attribute(s)  Then the join is performed and the join attributes as well as the attributes projected, can be transferred to the requesting site  The semijoin is symbolized as:  NOTE:  R S S R  Substantially reduces amount of data txfr’d

21 Lecture 24Distributed Databases21 Concurrency Control and Recovery  Dealing with multiple copies  Failure of individual sites  Failure of network  Distributed commit is more complicated  Deadlock is more difficult to detect and prevent  A number of techniques have been proposed to deal with these problems

22 Lecture 24Distributed Databases22 Distinguished Copy  The locks for a data item are associated with the distinguished copy  There are several distinguished copy variations:  Primary site (with backup)  One site is the chosen one and coordinates locking activities (centralized locking)  Primary copy  Various fragments at different sites are chosen as the distinguished copy – this distributes the locking problem

23 Lecture 24Distributed Databases23 Distributed Recovery  Very complex  Suppose that X sends a request to Y – there may be a number of reasons the request was not granted  Message was never delivered  Site Y is down  Site Y sent a response but the response was not delivered

24 Lecture 24Distributed Databases24 Summary  Re-read the first 23 slides!  Advantages/Disadvantages of a DDB  The 3 Transparencies: network, replication, fragmentation  Fragmentation  Replication and Synchronization  Query Processing in a DDB  Semijoin  Concurrency Control and Recovery

25 Lecture 24Distributed Databases25 Primary Site Technique

Download ppt "Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected."

Similar presentations

Ads by Google