Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.

Slides:



Advertisements
Similar presentations
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
Advertisements

Database Systems: Design, Implementation, and Management
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
IS698: Database Management Min Song IS NJIT. The Relational Data Model.
V. Megalooikonomou Distributed Databases (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU) Temple University – CIS.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Distributed databases
Management Information Systems, Sixth Edition
Transaction.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Chapter 13 (Web): Distributed Databases
1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
Distributed Databases Logical next step in geographically dispersed organisations goal is to provide location transparency starting point = a set of decentralised.
Chapter 25 Distributed Databases and Client-Server Architectures Copyright © 2004 Pearson Education, Inc.
Databases. Database Information is not useful if not organized In database, data are organized in a way that people find meaningful and useful. Database.
ABCSG - Distributed Database 1 Data Management Distributed Database Data Replication.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Overview Distributed vs. decentralized Why distributed databases
1 Distributed Databases Chapter What is a Distributed Database? Database whose relations reside on different sites Database some of whose relations.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Lecture-10 Distributed Database System A distributed database system consists of loosely.
Chapter 12 Distributed Database Management Systems
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed Databases
Distributed databases
Distributed Databases
Distributed Databases and DBMSs: Concepts and Design
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Database Design – Lecture 16
III. Current Trends: 1 - Distributed DBMSsSlide 1/32 III. Current Trends Part 1: Distributed DBMSs: Concepts and Design Lecture 12 (2 hours) Lecturer:
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Database Management System Module 5 DeSiaMorewww.desiamore.com/ifm1.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Session-9 Data Management for Decision Support
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Lecture 5: Sun: 1/5/ Distributed Algorithms - Distributed Databases Lecturer/ Kawther Abas CS- 492 : Distributed system &
Session-8 Data Management for Decision Support
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 12 Distributed Database Management Systems.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
Distributed Database Systems Overview
DDBMS Distributed Database Management Systems Fragmentation
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Distributed Databases
ASMA AHMAD 28 TH APRIL, 2011 Database Systems Distributed Databases I.
1 Distributed Databases BUAD/American University Distributed Databases.
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
DISTRIBUTED DATABASES JORGE POMBAR. Overview Most businesses need to support databases at multiple sites. Most businesses need to support databases at.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Relational Algebra p BIT DBMS II.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Distributed DBMS, Query Processing and Optimization
1 Chapter 22 Distributed DBMS Concepts and Design CS 157B Edward Chen.
Distributed Database Design Bayu Adhi Tama, MTI Fasilkom-Unsri Adapted from Connolly, et al., Database Systems 4 th Edition, Pearson Education Limited,
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
CMS Advanced Database and Client-Server Applications Distributed Databases slides by Martin Beer and Paul Crowther Connolly and Begg Chapter 22.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
CENG 553 Database Management Systems1 Distributed Databases.
Distributed Databases and Client-Server Architectures
Chapter 19: Distributed Databases
MANAGING DATA RESOURCES
Introduction of Week 14 Return assignment 12-1
Database System Architectures
Presentation transcript:

Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1

2  Data might be distributed to equalize the workload so the individual site are not overloaded to such a degree that the throughput is impaired.  Data might be placed in such a way as to minimize communication costs and / or response time.  Data might be kept at the site in order to maintain control and guarantee security.  Certain data items might be replicated at multiple site to increase their availability in the event of system crashes.

3 Multiple Local Schema Global Schema Restricted Global Schema

4 Definition: The distributed database looks to the application program like a collection of individual databases, each with its own schema. Heterogeneous System: If the individual DBMSs have been supplied by different vendors. Homogeneous System: If the individual DBMSs have been supplied by the same vendors. DBMS Application Program

5 o The application program must explicitly set up a connection to each site that contains data items to be accessed. After a connection has been established, the program can access the database using SQL statements constructed using the site’s schema. o A single SQL statement that refers to tables at different site like a global join, is not supported. In such a this case, the application must read the tuples from each table into buffers at the application site and explicitly test the join condition for each pairs of tuples. Characteristics o Data at different site might be stored at different formats. In such a this case, the application must provide the conversion routines.

6 Definition: In this approach, the application designer sees a single schema that integrates all local schema. DBMS Integration Middleware Application Program Integration is done by middleware. Middleware is a software that supports the interaction of clients and servers often in a heterogeneous systems. The Global Schema might include tables that do not appear in any local schema but can be computed from tables in local schema using appropriate SQL statements.

7 o Connection to individual sites are made automatically by the middleware Hence the location of tables are hidden from the application program. This feature is called Location Transparency. o Application program execute SQL statements against the global schema. In the case of join, if two tables are stored at different sites, the middleware must translate the join into a sequence of steps executed by the individual DBMSs. Characteristics o Data at different site might be stored at different formats. The middleware provides the conversion routines to integrate the systems. An important aspect of this schema is Query Optimization. It means the middleware choose a sequence of steps that constitutes a least-cost plan for evaluating the SQL statement submitted by the application.

8 Restricted Global Schema are supported by the vendors of some homogeneous systems. The databases cooperate directly, eliminating the need for middleware. The application can execute an SQL statement that refers to the tables at different sites for example a global join. The system includes a Global Query Optimizer to design efficient query plans.

9 1 - Partitioning 2- Replication Common Models of Data Distribution: 1-1 Horizontal Partitioning 1-2 Vertical Partitioning 1-3 Mixed Partitioning Certain data items may be stored at particular sites for security reasons. Other items may be divided and stored. Some items may be replicated for easier access.

10 The simplest approach for distributing data is to store each table in a different site. However, always the table unit is not the unit of best choice to be distributed. More often, a transaction access only a subset of the rows of a table or a view of the table rather than a table as a whole. When a table is decomposed in a way that each portion of a table is stored at a different site where the corresponding transaction is executed, the portion is referred to as partitions. One advantage of partitioning a table is that the time to process a single query over a large table could be reduced by distributing the execution over a number of sites at which partitions are stored.

11 A single table of T is partitioned into several tables of T 1, T 2, …, T n where each partition contains a subset of the rows of T and each row of T is in exactly one partition. Generally, each partition satisfies T i = σ Ci (T) Where c i is a selection condition and each tuple in T satisfies c i for exactly one value of i and T= U i T i

12 INVENTORY (StockNum, Amount, Price, Location) If the horizontal partitioning is performed on the relation by Location, it stores all tuples satisfying LOCATION = ‘CHICAGO’ As a partition named INVENTORY_CH (StockNum, Amount, Price) regarding the redundancy of the attribute Location. Because each tuple is stored in a partition, horizontal partitioning is lossless. Example

13 A single table of T is partitioned into several tables of T 1, T 2, …, T n where each partition contains a subset of the columns of T and each column must be included in at least one partition and each partition must include the columns of a candidate key (the same for all partitions). Generally, each partition satisfies T i = π att-list i (T) Because each column is stored in at least one partition, vertical partitioning is lossless T = T 1 T 2 … T n

14 EMPLOYEE (SSnum, Name, Salary, Title, Location) If the vertical partitioning is performed, so the two below partitions are constructed EMP1 (SSnum, Name, Salary) EMP2 (SSnum, Name, Title, Location) Where EMP1 could be stored at headquarter site (where the payroll is computed) and EMP2 is stored elsewhere. Example

15 EMPLOYEE (SSnum, Name, Salary, Title, Location) Step 1: Vertical Partitioning EMP1 (SSnum, Name, Salary) EMP2 (SSnum, Name, Title, Location) Step 2: Horizontal Partitioning As partitions named EMP2_CH (SSnum, Name, Title) and EMP2_NY (SSnum, Name, Title) respectively regarding the redundancy of the attribute Location. Example Combination of horizontal and vertical partitioning are possible but care must be taken that the original table could be constructed from its partitions.

16 If one employee transfers from Chicago warehouse to New York warehouse: EMP2_CH (SSnum, Name, Title) EMP2_NY (SSnum, Name, Title) When relations are partitioned, update operations sometimes require tuples to be moved from one partition to another and hence from one database to another.

17 Replication is one of the most useful and common used mechanism in distributed databases. Data Replication at several sites causes the increase in availability since the data can still be accessed if some of the sites fail. Replication has the potential to improve performance since queries can be executed more efficiently because the data could be read from a local or nearby copy. Updates are slower since all replicas of the data must be updated.

18 A multidatabase system is composed of a set of independent DBMSs. In order for the application to query information stored at multiple sites, it must decompose the query into a sequence of SQL statements that each of them is processed by a particular DBMS. System that supports the Global Schema contains a Global Query Optimizer which analyzes a query using the global schema and translates it into an appropriate sequence of steps to be executed at individual sites. Regarding the fact that the cost of I/O is so much greater than that of computation, so the measure of efficiency in a query execution plan is the number of required I/O operation. Communication costs will be measured by the number of bytes that have to be transmitted.

19 Planning with joins Suppose that an application in site A wants to join tables at site B and C with the result to be returned to site A. Two ways to execute the join that could be employed by the Global Query Optimizer are:  Transmit both tables to site A and execute joins there.  Transmit the smaller of the tables (e.g. the tables at site B to site C, execute the join at site C) and then transmit the result to site A.

20 Example Consider Two tables of STUDENT (Id, Major) TRANSCRIPT (StudId, CrsCode) The tables are stored at site B and C respectively. Suppose an application at site A wants to compute a join with the join condition as Id=StudId. For this example, the assumptions are: 1- The length of attributes  Id and StudId: 9 bytes  Major: 3 bytes  CrsCode: 6 bytes 2- STUDENT has tuples 3- Approximately 5000 students are registered for at least one course and on average each student is registered for four courses. Thus TRANSCRIPT has about tuples.  students are not registered for any course.

21 Solution STUDENT has tuples each of length 12 bytes (9+3). TRANSCRIPT has tuples each with the length of 15 bytes (9+6). The join will have tuples each of length of 18 bytes (9+6+3). Based on the above assumptions, there are three alternative plans:  If both tables are sent to site A to perform the join there, so 15000* * 15= bytes needs to be transferred.  If the STUDENT table is sent to site C and then in site C the join is computed and the result is sent to site A so, 15000* *18= bytes need to be transferred.  If the TRANSCRIPT table is sent to site B, the join table is computed there and the result is sent to site A, so 20000* * 18= bytes need to be transferred.

22 Queries that involve joins and selections Suppose that there are only one warehouse in Internet grocer application and the EMPLOYEE relation as EMPLOYEE (SSnum, Name, Salary, Title, Location) is vertically partitioned as EMP1 (SSnum, Name, Salary) : stored at site B (Headquarter) EMP2 (SSnum, Title, Location): stored at site C (Warehouse) Suppose a query at the third site, A, requests the names of all employees with the title of “manager” whose salary is more than 20000$. So the problem aims at: 1 π Name (σ Title=‘manager’ AND Salary> ‘20000’ (EMP1 EMP2)) 1 π Name (σ Title=‘manager’ AND Salary> ‘20000’ (EMP1 EMP2)) 2 π Name ((σ Salary> ‘20000’ (EMP1)) (σ Title=‘manager’ (EMP2))) √ 2 π Name ((σ Salary> ‘20000’ (EMP1)) (σ Title=‘manager’ (EMP2))) √

23 So 1- At site B select all tuples from EMP1 for which the Salary is more than 20000$ and call the result R At site C select all tuples from EMP2 for which the Title is manager and call the result R Perform the join of R 1 and R 2 in a site and project on the result using Name attribute. Call it R 3. If this site is not A, send R 3 to site A. The length of attributes are: SSnum: 9bytes; Salary: 6bytes; Title: 7bytes; Location: 10 bytes; Name: 15 bytes;

24 Plans 1- Sending R 2 to site B and do the join there. Then send the names to A. 2- Sending R 1 to site C and do the join there. Then send the names to A. 3- Sending R 1 and R 2 to site A and do the join there.  The length of each tuple in EMP1 is 30 bytes and in EMP2 is 26 bytes.  EMP1 and hence EMP2 has about tuples.  About 5000 employees have a salary of more than 20000$. So R 1 has 5000 tuples (each of 30 bytes) for a total of bytes.  There are about 50 managers. Therefore R 2 has about 50 tuples (each of length 26 bytes), for a total of 1300 bytes.  About 90% of the managers have salary of more than 20000$. Therefore R 3 has about 45 tuples each of length 15 bytes for a total of 675 bytes.

25 Evaluating the cost of each plan: 1- Upon doing the join at site B, so 1300 bytes should be sent from site C to B and then 675 bytes from site B to site A. So totally 1975 bytes should be transferred. 2- Upon doing the join at site C, so bytes should be sent from site B to C and then 675 bytes from site C to site A. So totally bytes should be transferred. 3- Upon doing the join at site A, so bytes should be sent from site B to A and then 1300 bytes from site C to site A. So totally bytes should be transferred.