Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Slides:



Advertisements
Similar presentations
Distributed Query Processing Donald Kossmann University of Heidelberg
Advertisements

Database System Concepts and Architecture
A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
Choosing an Order for Joins
Crucial Patterns in Service- Oriented Architecture Jaroslav Král, Michal Žemlička Charles University, Prague.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
CS 540 Database Management Systems
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Distributed Systems Architectures Slide 1 1 Chapter 9 Distributed Systems Architectures.
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Distributed databases
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Distributed Systems Architectures
Cs44321 CS4432: Database Systems II Query Optimizer – Cost Based Optimization.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Overview Distributed vs. decentralized Why distributed databases
CMSC724: Database Management Systems Instructor: Amol Deshpande
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 17 Client-Server Processing, Parallel Database Processing,
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
The University of Akron Dept of Business Technology Computer Information Systems Database Management Approaches 2440: 180 Database Concepts Instructor:
Client-Server Processing and Distributed Databases
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Ekrem Kocaguneli 11/29/2010. Introduction CLISSPE and its background Application to be Modeled Steps of the Model Assessment of Performance Interpretation.
Computer System Architectures Computer System Software
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Loading a Cache with Query Results Laura Haas, IBM Almaden Donald Kossmann, Univ. Passau Ioana Ursu, IBM Almaden.
Architectural Design portions ©Ian Sommerville 1995 Establishing the overall structure of a software system.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Access Path Selection in a Relational Database Management System Selinger et al.
Database Management 9. course. Execution of queries.
Session-9 Data Management for Decision Support
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Lecture 5: Sun: 1/5/ Distributed Algorithms - Distributed Databases Lecturer/ Kawther Abas CS- 492 : Distributed system &
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
The Forest and the Trees Julia Stoyanovich Candidacy Exam in Database Systems Fall 2005.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Databases Illuminated
Distributed database system
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
CS 540 Database Management Systems
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Background Computer System Architectures Computer System Software.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS4432: Database Systems II Query Processing- Part 1 1.
Parallel Databases.
Open Source distributed document DB for an enterprise
Introduction to Query Optimization
Chapter 15 QUERY EXECUTION.
Introduction to Database Systems
Database Architecture
Implementation of Relational Operations
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Query Optimization.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Presentation transcript:

Distributed Query Processing Based on “The state of the art in distributed query processing” Donald Kossman (ACM Computing Surveys, 2000)

Motivation Cost and scalability: network of off-shelf machines Integration of different software vendors (with own DBMS) Integration of legacy systems Applications inherently distributed, such as workflow or collaborative-design State-of-the-art distributed information technologies (e-businesses)

Part 1 : Basics Query Processing Basics –centralized query processing –distributed query processing

Problem Statement Input: Query such as „Biological objects in study A referenced in a literature in journal Y“. Output: Answer Objectives: –response time, throughput, first answers, little IO,... Centralized vs. Distributed Query Processing –same basic problem –but, more and different parameters, such(data sites or available machine power) and objectives

Steps in Query Processing Input: Declarative Query –SQL, XQuery,... Step 1: Translate Query into Algebra –Tree of operators (query plan generation) Step 2: Optimize Query –Tree of operators (logical) - also select partitions of table –Tree of operators (physical) – also site annotations –(Compilation) Step 3: Execution –Interpretation; Query result generation

Algebra –relational algebra for SQL very well understood –algebra for XQuery mostly understood SELECT A.d FROM A, B WHERE A.a = B.b AND A.c = 35 A.d A.a = B.b, A.c = 35 X AB

Query Optimization –logical, e.g., push down cheap predicates –enumerate alternative plans, apply cost model –use search heuristics to find cheapest plan A.d A.a = B.b, A.c = 35 X AB A.d hashjoin B.b index A.cB

Basic Query Optimization Classical Dynamic Programming algorithm –Performs join order optimization –Input : Join query on n relations –Output : Best join order

The Dynamic Prog. Algorithm for i = 1 to n do { optPlan({Ri}) = accessPlans(Ri) prunePlans(optPlan({Ri})) } for i = 2 to n do for all S  { R1, R2 … Rn } such that |S| = i do { optPlan(S) =  for all O  S do { optPlan(S) = optPlan(S)  joinPlans(optPlan(O), optPlan(S – O)) prunePlans(optPlan(S)) } return optPlan({R1, R2, … Rn})

Query Execution –library of operators (hash join, merge join,...) –exploit indexes and clustering in database –pipelining (iterator model) A.d hashjoin B.b index A.cB (John, 35, CS) (Mary, 35, EE) (Edinburgh, CS,5.0) (Edinburgh, AS, 6.0) (CS) (AS) (John, 35, CS) John

Summary : Centralized Queries Basic SQL (SPJG, nesting) well understood Very good extensibility –spatial joins, time series, UDF, xquery, etc. Current problems –Better statistics : cost model for optimization –Physical database design expensive & complex Some Trends –interactiveness during execution –approximate answers, top-k –self-tuning capabilities (adaptive; robust; etc.)

Distributed Query Processing: Basics Idea: Extension of centralized query processing. (System R* et al. in 80s) What is different? –extend physical algebra: send&receive operators –other metrics : optimize for response time –resource vectors, network interconnect matrix –caching and replication –less predictability in cost model (adaptive algos) –heterogeneity in data formats and data models

Issues in Distributed Databases Plan enumeration –The time and space complexity of traditional dynamic programming algorithm is very large –Iterative Dynamic Programming (heuristic for large queries) Cost Models –Classic Cost Model –Response Time Model –Economic Models

Distributed Query Plan A.d hashjoin B.b index A.cB receive send Forms Of Parallelism?

Cost : Resource Utilization Total Cost = Sum of Cost of Ops Cost = 40

Another Metric : Response Time 25, 33 24, 32 0, 12 0, 50, 10 0, 70, 24 0, 60, 18 Total Cost = 40 first tuple = 25 last tuple = 33 first tuple = 0 last tuple = 10 Pipelined parallelism Independent parallelism

Query Execution Techniques for Distributed Databases Row Blocking Multi-cast optimization Multi-threaded execution Joins with horizontal partitioning Semi joins Top n queries

Query Execution Techniques for DD Row Blocking – –SEND and RECEIVE operators in query plan to model communication –Implemented by TCP/IP, UDP, etc. –Ship tuples in block-wise fashion (batch); smooth burstiness

Query Execution Techniques for DD Multi-cast Optimization –Location of sending/receiving may affect communication costs; forwarding versus multi-casting Multi-threaded execution –Several threads for operators at the same site (intra- query parallelism) –May be useful to enable concurrent reads for diverse machines (while continuing query processing) –Must consider if resources warrant concurrent operator execution (say two sorts each needing all memory)

Query Execution Techniques for DD Joins with Data (horizontal) partitioning: –Hash-based partitioning to conduct joins on independent partitions Semi Joins : –Reduce communication costs; Send only “join keys” instead of complete tuples to the site to extract relevant join partners Double-pipelined hash joins : –Non-blocking join operators to deliver first results quickly; fully exploit pipelined parallelism, and reduce overall response time Top n queries : –Isloate top n tuples quickly and only perform other expensive operations (like sort, join, etc) on those few (use “stop” operators)

Adaptive Algorithms Deal with unpredictable events at run time –delays in arrival of data, burstiness of network –autonomity of nodes, changes in policies Example: double pipelined hash joins –build hash table for both input streams –read inputs in separate threads –good for bursty arrival of data Re-optimization at run time (LEO, etc.) –monitor execution of query –adjust estimates of cost model –re-optimize if delta is too large

Special Techniques for Client-Server Architectures Shipping techniques –Query shipping –Data shipping –Hybrid shipping Query Optimization –Site Selection –Where to optimize –Two Phase Optimization

Special Techniques for Federated Database Systems Wrapper architecture Query optimization –Query capabilities –Cost estimation Calibration Approach Wrapper Cost Model Parameter Binding

Heterogeneity Use Wrappers to “hide“ heterogeneity Wrappers take care of data format, packaging Wrappers map from local to global schema Wrappers carry out caching –connections, cursors, data,... Wrappers map queries into local dialect Wrappers participate in query planning!!! –define the subset of queries that can be handled –give cost information, statistics –“capability-based rewriting“

Summary Theory well understood –extend traditional (centralized) query processing –add many more details –heterogenity needs manual work and wrappers Problems in Practice –cost model, statistics –architectures are not fit for adaptivity, heterogeneity –optimizers do not scale for 10,000s of sites –autonomy of sites; systems not built for asynchronous communication

Middleware Two kinds of middleware –data warehouses –virtual integration Data Warehouses –good: query response times –good: materializes results of data cleaning –bad: high resource requirements in middleware –bad: staleness of data Virtual Integration –the opposite –caching possible to improve response times

Virtual Integration Query Middleware (query decomposition, result composition) DB1DB2 wrapper sub query wrapper sub query

IBM Data Joiner SQL Query Data Joiner SQL DB1SQL DB2 wrapper sub query wrapper sub query

Adding XML Query Middleware (SQL) DB1DB2 wrapper sub query wrapper sub query XML Publishing

XML Data Integration XML Query Middleware (XML) DB1DB2 wrapper XML query wrapper XML query

XML Data Integration Example: BEA Liquid Data Advantage –Availability of XML wrappers for all major databases Problems –XML - SQL mapping is very difficult –XML is not always the right language (e.g., decision support style queries)

Web Services Idea: Encapsulate Data Source –provide WSDL interface to access data –works very well if query pattern is known Problem: Exploit Capability of Source –WSDL limits capabilities of data source; –good optimization requires „white box“ –example: access by id, access by name, full scan should all combinations be listed in WSDL? Solution: WSDL for Query Planning

Summary Middleware looks like a homogenous centralized database –location transparency –data model transparency Middleware provides global schema –data sources map local schemas to global schema Various kinds of middleware (SQL, XML) “Stacks“ of middleware possible Data cleaning requires special attention