Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed databases 1. 2 Outline introduction principles / objectives problems.

Similar presentations

Presentation on theme: "Distributed databases 1. 2 Outline introduction principles / objectives problems."— Presentation transcript:

1 Distributed databases 1

2 2 Outline introduction principles / objectives problems

3 Distributed databases 3 Introduction communication network server application server DBMS in its own right

4 Distributed databases 4 Introduction distributed database = collection of connected sites each site is a DB in its own right (1) has its own DBMS and its own users operations can be performed locally as if the DB was not distributed the sites collaborate (transparently from the users point of view) the union of all DBs = the DB of the whole organisation (institution) (oppose to (1)) physical or logical distribution strict homogeneity (assumption)

5 Distributed databases 5 Motivation advantages matches the structure of the organisation example efficiency of processing stored closely to where it is being used increased accessibility remote DBs can be accessed disadvantage complexity

6 Distributed databases 6 Implementations (systems) commercial INGRES/STAR ( Ask Group Inc. Ingres Division ) distributed database option of ORACLE7 ( Oracle Corporation ) distributed data facility of DB2 ( IBM ) they all provide some sort of features for distributed databases

7 Distributed databases 7 Fundamental principle a distributed DB system should look to the user exactly as a non-distributed DB system

8 Distributed databases 8 Principles / objectives local autonomyno reliance on central site continuous operationlocation independence fragmentation independencereplication independence distributed query processingdistributed transaction management hardware independenceOS independence network independenceDBMS independence

9 Distributed databases 9 Principles / objectives not independent from each other not exhaustive sometimes contradicting different degree of importance (for the user)

10 Distributed databases 10 Local autonomy all operations at a certain site are fully controlled by that site not achievable (why?) therefore, autonomy should be achieved to the maximum extent possible local data is locally owned and managed local data belongs to the local server even if it is accessible from other servers security, integrity,..., are in the responsibility of the local server

11 Distributed databases 11 No reliance on a central site reasons bottle-neck vulnerability conclusion all sites must be equal

12 Distributed databases 12 Continuous operation greater reliability the probability that the system is running at any moment of time greater availability the probability that the system is running for a specified period of time

13 Distributed databases 13 Location independence / transparency users should not have to know where data is physically stored why do you think this is needed? think of application programs what does this objective look like?

14 Distributed databases 14 Data fragmentation data fragmentation if a relation can be divided into fragments for storing purposes motivation: performance - data is stored where it is mostly used types horizontal or vertical definition fragment = any subrelation derivable via restriction or projection restrictions disjoint decompositions non-loss decompositions

15 Distributed databases 15 FRAGMENT Emp INTO Lo_Emp AT SITE London WHERE Dept_id = Sales Le_Emp AT SITE Leeds WHERE Dept_id = Dev ; Data fragmentation - example

16 Distributed databases 16 Fragmentation independence / transparency users should perceive data as if it were not fragmented why? it is the optimisers responsibility to determine which fragments need to be physically accessed similar to views retrieving updating (JOIN and UNION views)

17 Distributed databases 17 FRAGMENT Emp INTO Lo_Emp AT SITE London WHERE Dept_id = Sales Le_Emp AT SITE Leeds WHERE Dept_id = Dev ; --looks (and works almost) like a view SELECT * FROM Emp WHERE Salary > 40 AND Dept_id = Dev; --is transformed into SELECT * FROM Lo_emp WHERE Salary > 40 AND Dept_id = Dev; UNION SELECT * FROM Le_emp WHERE Salary > 40 AND Dept_id = Dev ; Fragmentation independence - example

18 Distributed databases 18 Data replication copies of the same fragment can exist at different sites reasons better availability better performance disadvantage update propagation

19 Distributed databases 19 Replication independence / transparency users should not have to be aware of data replication it is the optimisers responsibility to choose which replica to use commercial systems not full support for replication independence (update problems) - primary copy

20 Distributed databases 20 Distributed query processing the system must have set level operators one record at a time - too many messages (traffic) relational - indicated optimisation particularly relevant! find best way to move data across the network

21 Distributed databases 21 Distributed transaction management problems recovery concurrency transaction = set of agents agent - runs on a certain machine recovery two-phase commit protocol concurrency locking

22 Distributed databases 22 Problems occur due to network utilisation network messages are costly aim minimise network utilisation problems: query processing catalog management update propagation recovery control concurrency control

23 Distributed databases 23 Query processing in a distributed environment query execution is distributed query optimisation is distributed global optimisation local optimisation example query on relation R issued at site X part of R, say R y, stored at Y part of R, say R z, stored at Z where is the query going to be executed?

24 Distributed databases 24 Query processing example - initial conditions Site A: Suppliers ( S_id, City ) 10,000 tuples Contracts ( S_id, P_id )1,000,000 tuples Site B: Parts (P_id, Colour )100,000 tuples SELECT S.S_id FROM Suppliers S, Contracts C, Parts P WHERES.S_id = C.S_id AND P.P_id = C.P_id AND City = London AND Colour = red ;

25 Distributed databases 25 Query processing example - evaluation possible evaluation procedures (1) move relation Parts to site A and evaluate the query at A (2) move relations Suppliers and Contracts to B and evaluate at B (3) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, and for each of these tuples check at site B to see whether the corresponding part is red (4) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, transfer them B and terminate the processing there (5) restrict Parts to tuples containing red parts, move the result to A and process there (6) think of other possibilities … there is an extra dimension added by the site where the query was issued

26 Distributed databases 26 Query processing example - total time total_time = delay_time + data_transfer_time = no_messages * data_volume(in bits) / assumptions: 1. disregard computation time on each server (site) 2. estimated cardinality of some intermediate results red parts …...10 contracts with suppliers from London …...50, communication assumptions date rate …...50k bits / second access delay … second 4. size of each tuple ……. 200 bits

27 Distributed databases 27 Query processing example - total time

28 Distributed databases 28 Catalog management what other data does the catalog include? fragmentation, replication... where should the catalogue be stored centralised objective: no central site! fully replicated loss of autonomy - update propagation! partitioned non local operations - very expensive! combination of first and third

29 Distributed databases 29 Catalog management R* - e.g. each site maintains a catalog entry for every object born at that site (and the site where it had migrated, if applicable) every object stored at that site object identification - at most 2 sites need to be accessed

30 Distributed databases 30 Update propagation problems because of replication data might become less available due to immediate update request primary copy scheme one copy is designated primary copy (unique) primary copies exist at different sites (distributed) an update is logically complete if the primary copy has been updated the site holding the primary copy would have to propagate the updates this has to be done before COMMIT (preserve - ACID) commercial systems: update propagation is guaranteed for some future time violation of local autonomy

31 Distributed databases 31 Recovery control two-phase commit protocol issues there can be no central site so each site should be able to act as a coordinator usually the site where the transaction was initiated other sites are told by the coordinator what to do loss of autonomy there is no protocol (theoretically) that guarantees that a transaction is / is not performed by all agents with respect to any kind of failure increased number of messages more complex protocols

32 Distributed databases 32 Concurrency control locking overhead - increased number of messages primary copy strategy locking only the primary copy the primary copys site will propagate the update loss of autonomy (severely) global deadlock two interlocked (waiting for each other) sites cannot be detected using the wait-for graph - therefore, communication overhead

33 Distributed databases 33 Global deadlock Transaction 1 x Transaction 1 y Transaction 2 y Transaction 2 x holds lock on t x holds lock on t y site X site Y

34 Distributed databases 34 Gateways DBMS #1 (Ingres) DBMS #2 (Oracle) GATE WAY

35 Distributed databases 35 Client / server systems are a special case of distributed database systems remember from last term read for extra information

Download ppt "Distributed databases 1. 2 Outline introduction principles / objectives problems."

Similar presentations

Ads by Google