Distributed Databases

Distributed Databases
Sahim Mohammed By Dr. Haddouti

Outline Introduction to Distributed database
distributed systems principles Problems Conclusion references 9/18/2018 Distributed Database

Introduction A single logical database that is spread physically across computers in multiple locations that are connected by a data communication link. 9/18/2018 Distributed Database

WafaBank Example CPU CPU Terminals Terminals Casablanca Agency
Tangier Agency 9/18/2018 Distributed Database

Alternative Definition
Collection of connected sites each site is a DB in its own right (1) has its own DBMS and its own users operations can be performed locally as if the DB was not distributed the sites collaborate (transparently from the user’s point of view) the union of all DBs = the DB of the whole organisation (institution) (oppose to (1)) physical or logical distribution 9/18/2018 Distributed Database

Organization Advantages
Efficiency of processing Data is stored close to the point where it is most commonly used. Increased accessibility Possibility to access Casablanca account from Tangier and vice versa, via communication link Etc. 9/18/2018 Distributed Database

distributed System principles
Transparency Local Autonomy Data sharing Capacity and incremental growth Reliability and availability Efficiency and flexibility others 9/18/2018 Distributed Database

Transparency(1/3) Def.: Extension of data independence to distributed systems by hiding the distribution, fragmentation and replication of data from the users. Forms Data independence Network transparency Replication transparency Fragmentation transparency Transparency can be provided at different levels of the system. 9/18/2018 Distributed Database

Transparency (2/3) Data independence
logical data independence: user applications are not affected by changes in the logical structure of the DB physical data independence: hiding the details of the storage structure the only form that is important in centralized DBMSs as well Network (distribution) transparency location transparency: an operation on data is independent of both the location and the system where it is executed naming transparency: unique name is provided for each object in the DB. 9/18/2018 Distributed Database

Transparency (3 /3) Replication transparency
Data is replicated for reliability and performance considerations. The user should not be aware of the existence of copies Fragmentation transparency DB relations are divided into smaller fragments for performance and availability reasons The global queries should be translated to fragment queries A question of query processing. 9/18/2018 Distributed Database

Local autonomy all operations at a certain site are fully controlled by that site not achievable (why?) therefore, autonomy should be achieved to the maximum extent possible local data is locally owned and managed local data belongs to the local server even if it is accessible from other servers security, integrity, ..., are in the responsibility of the local server 9/18/2018 Distributed Database

Data sharing sharing data is required over business units.
so it must be convenient to consolidate data across local databases on demand 9/18/2018 Distributed Database

Capacity and incremental growth
No single machine with adequate capacity DS can grow more than non Distributed system Nessecity to expand the system Easy to add new site than to replace exesting cetralized system. Involve less disruption of service to users 9/18/2018 Distributed Database

Reliability and availability
Unlike centralized system, DS is not an all-or-nothing proposition. Continue to function in the face failure of individual site/client Each replica of a given object can be used s free backup 9/18/2018 Distributed Database

Efficiency and flexibility
cost to ship large quantities of data across a communications network or to handle a large volume of transactions from remote sources can be high dependencies on data communications can be risky, so keeping local copies or fragments of data can be a reliable way to support the need for a rapid access to data across the organizations 9/18/2018 Distributed Database

Distributed system structure
Visualized as graph Collection of nodes and edges Node -> site CPU & local database & terminals Edge -> communication link In normal operation every site is connected to every other site Network can degenerate into a partitioned site A tota nwk can split into a collection of subnwks 9/18/2018 Distributed Database

Points to consider about distributed Systems
Reliable communications Data fragmentation Transaction management Homogeneous vs heterogeneous systems 9/18/2018 Distributed Database

Reliable communications
Message delivery must be Guaranteed In the order sent Messages must not be garbled Delivered more than once  Communication reliability is the responsibility of the communication control software 9/18/2018 Distributed Database

Data fragmentation data fragmentation types definition restrictions
if a relation can be divided into “fragments” for storing purposes motivation: performance - data is stored where it is mostly used types horizontal or vertical definition fragment = any subrelation derivable via restriction or projection restrictions disjoint decompositions non-loss decompositions 9/18/2018 Distributed Database

The role of fragmentation (1/2)
Fragmentation: decomposition of relations it is not aimed at data distribution! Rather: the relation is not a proper unit for distribution, a more appropriate unit can be obtained by partitioning applications view only a subset of relations - locality can be optimized unnecessary replication and high volume of remote memory accesses can be avoided transactions can be executed concurrently: intraquery transaction 9/18/2018 Distributed Database

The role of fragmentation (2/2)
but: conflicting requirements may prevent decomposition into mutually exclusive fragments that can lead to performance degradation semantic data control: checking for dependencies may be difficult 9/18/2018 Distributed Database

Fragmentation - Alternatives
ACCOUNT Horizontal(distr. + parallel) Vertical(parallel) 9/18/2018 Distributed Database

Fragmentation - Queries
DEFINE FRAGMENT CASA1 AS SELECT ACCOUNT#, BRANCH, TYPE FROM ACCOUNT WHERE BRANCH = ‘CASA’ DEFINE FRAGMENT CASA2 AS SELECT ACCOUNT#, BRANCH, CUSTOMER, TYPE, BALANCE 9/18/2018 Distributed Database

Correctness rules of fragmentation
Rules to avoid semantic changes at fragmentation: Completeness (lossless decomposition): no attributes or tuples may be eliminated at fragmentation Reconstruction: there must be a relational operator () that can reconstruct the original relation, i.e. R= Ri,  RiFR  may be different for different alternatives of fragmentation (horizontal or vertical) Disjointness: horizontal fragments must be disjoint, i.e. if data item d is in Rj, then it is not in any other Ri (ij) 9/18/2018 Distributed Database

Transaction Managemant(1)
Unit of consistent and atomic execution against the database. Agent Process executing on behalf of some particular transaction at some particular site Two agents within the same transaction Called cohorts of each other All cohorts must be treated as a unit For recovery & concurency pupose 9/18/2018 Distributed Database

Transaction Managemant (2)
Two-phase commit protocol Atomic commitment protocol Must be used to ensure cohort either all commit or all roll back System should allow an agent B to initiate a 3rd agent C back at the site of the original agent A 9/18/2018 Distributed Database

Homogenous vs Heterogeneous Systems
DBMSs at different sites are not always of the same family. If the same Homogenous system Otherwise heterogenous 9/18/2018 Distributed Database

Problems Communication links are rather slow High access delay time
Messages are expensive in terms of CPU instructions executed. Different communication systems have differing characteristics Design decisions differ from one system to another Ex. may not be feasible to propagate updates to all copies in real time  Aim: Minimize network utilization 9/18/2018 Distributed Database

Problems To use the ANSI/sparc terminology, all the DDB Pbs are internal-level pbs, not conceptual-level or external-level pbs. Distribution has no effect on the users view of data/specific language used, or logical DB design Effect on Concurrency Recovery Physical DB design 9/18/2018 Distributed Database

Derived problems Query processing Catalog management
Update propagation Recovery control Concurrency control 9/18/2018 Distributed Database

Query processing in a distributed environment
query execution is distributed query optimisation is distributed global optimisation local optimisation example query on relation R issued at site X part of R, say Ry, stored at Y part of R, say Rz, stored at Z where is the query going to be executed? 9/18/2018 Distributed Database

Query processing example - initial conditions
Site A: Suppliers ( S_id, City ) 10,000 tuples Contracts ( S_id, P_id ) 1,000,000 tuples Site B: Parts (P_id, Colour ) 100,000 tuples SELECT S.S_id FROM Suppliers S, Contracts C, Parts P WHERE S.S_id = C.S_id AND P.P_id = C.P_id AND City = ‘London’ AND Colour = ‘red’ ; 9/18/2018 Distributed Database

Query processing example - evaluation
possible evaluation procedures (1) move relation Parts to site A and evaluate the query at A (2) move relations Suppliers and Contracts to B and evaluate at B (3) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, and for each of these tuples check at site B to see whether the corresponding part is red (4) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, transfer them B and terminate the processing there (5) restrict Parts to tuples containing red parts, move the result to A and process there (6) think of other possibilities … there is an extra dimension added by the site where the query was issued 9/18/2018 Distributed Database

Query processing example - total time
total_time = delay_time + data_transfer_time = no_messages * data_volume(in bits) / 50000 assumptions: 1. disregard computation time on each server (site) 2. estimated cardinality of some intermediate results red parts …... 10 contracts with suppliers from London …... 50,000 3. communication assumptions date rate …... 50k bits / second access delay … second 4. size of each tuple ……. 200 bits 9/18/2018 Distributed Database

Query processing example - total time
9/18/2018 Distributed Database

Query processing Summary
Each of the five strategies represent a plausible approach to the problem The variation communication time is enormous Data rate & access delay are both important factors in choosing an appropriate strategy Communication time is likely to negligible compared w/ comm. Time for the poor strategies 9/18/2018 Distributed Database

Which Query processing to choose?
SQL compiler chooses an access plan for a given query by generating a set of possible strategies for that query. Estimate the work involved in each strategy & assign cost to it Select the plan having the lowest cost I/O cost + CPU cost + (communication cost) 9/18/2018 Distributed Database

Catalog management Known as the system dictionary/directory where control information concerning objects of interest to the system are stored Purposes Enable the sys to transform high level requests for data into appropriate low-level operation on stored objects To satisfy the architecture of data independence In DDB, the catalog must specify the site at which the object is stored 9/18/2018 Distributed Database

Catalog management what ‘other’ data does the catalog include?
fragmentation, replication ... where should the catalogue be stored centralised objective: no central site! fully replicated loss of autonomy - update propagation! partitioned non local operations - very expensive! combination of first and third 9/18/2018 Distributed Database

Catalog management each site maintains a catalog entry for
every object born at that site (and the site where it had migrated, if applicable) every object stored at that site object identification - at most 2 sites need to be accessed 9/18/2018 Distributed Database

Catalog management R* - object naming
<creator site ID>.<local site ID> e.g. Where:  Creator ID: object creator ID  Creator Site ID: the ID of site from which the object was created  Local name: the object name  Birth site ID: the ID of the site at which the the object is initially stored 9/18/2018 Distributed Database

Update propagation problems because of replication primary copy scheme
data might become less available due to immediate update request primary copy scheme one copy is designated primary copy (unique) primary copies exist at different sites (distributed) an update is logically complete if the primary copy has been updated the site holding the primary copy would have to propagate the updates this has to be done before COMMIT (preserve - ACID) commercial systems: update propagation is guaranteed for some future time violation of local autonomy 9/18/2018 Distributed Database

Recovery control two-phase commit protocol issues
there can be no central site so each site should be able to act as a coordinator usually the site where the transaction was initiated other sites are told by the coordinator what to do loss of autonomy there is no protocol (theoretically) that guarantees that a transaction is / is not performed by all agents with respect to any kind of failure increased number of messages more complex protocols 9/18/2018 Distributed Database

Concurrency control Concurrency control algorithm is needed
To achieve serialization for correct execution of a set of concurrent transactions. Two strategies are used Locking Susceptible to deadlock overhead - increased number of messages Timestamping Not susceptible to deadlock Both can be used in Distributed systems, but tradeoffs are different. 9/18/2018 Distributed Database

Concurrency control primary copy strategy global deadlock
locking only the primary copy the primary copy’s site will propagate the update Overcome the drawback of single locking site Reintroduce the possibility of global deadlock global deadlock two interlocked (waiting for each other) sites cannot be detected using the wait-for graph - therefore, communication overhead 9/18/2018 Distributed Database

Global deadlock Example
site X holds lock on tx Transaction 1x Transaction 2x Transaction 1y Transaction 2y holds lock on ty site Y 9/18/2018 Distributed Database

Conclusion Distributed systems research is very young field
Most proposed sol to the DDS problems can only be regarded as speculative at this time Implemented only in a research environment, and not be subject to any extensive practical tests, not there is the best approach in many cases 9/18/2018 Distributed Database

References Date Book C.J.: An Introduction to Database Systems, Volume 1, 6th Edition, Addition-Wesley, 1995 Date Book C.J.: An Introduction to Database Systems, Volume 2, Addition-Wesley, 1985 www-peda.ac-martinique.fr/ecogest/ressources/cours/sgbdr/sgbd.htm 9/18/2018 Distributed Database

Distributed Databases

Similar presentations

Presentation on theme: "Distributed Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Databases

Similar presentations

Presentation on theme: "Distributed Databases"— Presentation transcript:

Similar presentations

About project

Feedback