PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.

PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University

Lecture 07 Overview of Query Processing Slide 2

Outline Overview of Query Processing à Objective of Query Processing à Characterization of Query Processors à Layers of Query Processing Slide 3

Query Processing high level user query query processor low level data manipulation Commands ( relational algebra) Slide 4

SELECTENAME  Project FROMEMP,ASG  Select WHEREEMP.ENO = ASG.ENO  Join ANDRESP = ”Manager” Strategy 1  ENAME (  RESP = ”Manager”  EMP.ENO=ASG.ENO  (EMP  ASG)) Strategy 2  ENAME (EMP ENO (  RESP = ”Manager” (ASG))) Strategy 2 avoids Cartesian product, so is “better” Selecting Alternatives Example: 7.2 This example illustrates the importance of site selection and communication for a chosen relational algebra query against a fragmented database. Slide 5

What is the Problem? Site 1Site 2Site 3Site 4Site 5 EMP 1 =  ENO≤“E3” (EMP)EMP 2 =  ENO>“E3” (EMP) ASG 2 =  ENO>“E3” (ASG) ASG 1 =  ENO≤“E3” (ASG) Result Site 5 Site 1Site 2Site 3Site 4 ASG 1 EMP 1 EMP 2 ASG 2 result 2 =(EMP 1   EMP 2 ) ENO  RESP = ”Manager” (ASG 1  ASG 1 ) Site 4 result = EMP 1 ’  EMP 2 ’ Site 3 Site 1Site 2 EMP 2 ’ =EMP 2 ENO ASG 2 ’ EMP 1 ’ =EMP 1 ENO ASG 1 ’ ASG 1 ’ =  RESP = ”Manager” (ASG 1 )ASG 2 ’ =  RESP = ”Manager” (ASG 2 ) Site 5 ASG 2 ’ ASG 1 ’ EMP 1 ’ EMP 2 ’ We assume that relations EMP and ASG are horizontally fragmented. Fragments ASG1, ASG2, EMP1, and EMP2 are stored at sites 1, 2, 3, and 4,respectively, and the result is expected at site 5. (a) Strategy A (b) Strategy B Slide 6

Assume: à size (EMP) = 400, size (ASG) = 1000 à tuple access cost = 1 unit; tuple transfer cost = 10 units Strategy A  produce ASG': (10+10)  tuple access cost 20  transfer ASG' to the sites of EMP: (10+10)  tuple transfer cost 200  produce EMP': (10+10)  tuple access cost  2 40  transfer EMP' to result site: (10+10)  tuple transfer cost 200 Total cost 460 Strategy B  transfer EMP to site 5:400  tuple transfer cost 4,000  transfer ASG to site 5 :1000  tuple transfer cost 10,000  produce ASG':1000  tuple access cost 1,000  join EMP and ASG':400  20  tuple access cost 8,000 Total cost23,000 Cost of Alternatives To evaluate the resource consumption of the two strategies, we use a simple cost model. Strategy A is better by a factor of 50, which is quite significant. Slide 7

Objectives of Query Processing The objective of query processing in a distributed context is à To transform a high-level query on a distributed database into an efficient low-level language on local databases. à The different layers are involved in the query transformation. An important aspect of query processing is à query optimization.  Because many execution strategies are correct transformations of the same high-level query, the one that optimizes (minimizes) resource consumption should be retained.  A good measure of resource consumption is the total cost that will be incurred in processing the query  Another good measure is the response time of the query. Slide 8

Characterization of Query Processors The first four characteristics hold for both centralized and distributed query processors while the next four characteristics are particular to distributed query processors in tightly-integrated distributed DBMSs. à Languages à Types of Optimization à Optimization Timing à Statistics à Decision Sites à Exploitation of the Network Topology à Exploitation of Replicated Fragments à Use of Semijoins Slide 9

Types of Optimization Exhaustive search à query optimization aims at choosing the “best” point in the solution space of all possible execution strategies. à search the solution space to predict the cost of each strategy à select the strategy with minimum cost. à Although this method is effective in selecting the best strategy, it may incur a significant processing cost for the optimization itself. à The problem is that the solution space can be large that is, there may be many equivalent strategies, even with a small number of relations.. Characterization of Query Processors Slide 10

Types of Optimization Heuristics à popular way of reducing the cost of exhaustive search à restrict the solution space so that only a few strategies are considered à regroup common sub-expressions à perform selection, projection first à replace a join by a series of semijoins à reorder operations to reduce intermediate relation size à optimize individual operations to minimize data communication. Characterization of Query Processors Slide 11

Types of Optimization Randomized strategies à Find a very good solution, not necessarily the best one, but avoid the high cost of optimization, in terms of memory and time consumption Characterization of Query Processors Slide 12

Optimization Timing Optimization can be done statically before executing the query or dynamically as the query is executed. à Static  Static query optimization is done at query compilation time.  Thus the cost of optimization may be amortized over multiple query executions.  this timing is appropriate for use with the exhaustive search method.  Since the sizes of the intermediate relations of a strategy are not known until run time, they must be estimated using database statistics. Characterization of Query Processors Slide 13

Optimization Timing Dynamic à run time optimization à database statistics are not needed to estimate the size of intermediate results à The main advantage over static query optimization is that the actual sizes of intermediate relations are available to the query processor, thereby minimizing the probability of a bad choice. à The main shortcoming is that query optimization, an expensive task, must be repeated for each execution of the query. Therefore, this approach is best for ad-hoc queries. Characterization of Query Processors Slide 14

Optimization Timing Hybrid à provide the advantages of static query optimization à The approach is basically static, but dynamic query optimization may take place at run time when a high difference between predicted sizes and actual size of intermediate relations is detected. à if the error in estimate sizes > threshold, reoptimize at run time Characterization of Query Processors Slide 15

Statistics à The effectiveness of query optimization relies on statistics on the database. à Dynamic query optimization requires statistics in order to choose which operators should be done first. à Static query optimization is even more demanding since the size of intermediate relations must also be estimated based on statistical information. à statistics for query optimization typically bear on fragments, and include fragment cardinality and size as well as the size and number of distinct values of each attribute. à To minimize the probability of error, more detailed statistics such as histograms of attribute values are sometimes used. à The accuracy of statistics is achieved by periodic updating. à With static optimization, significant changes in statistics used to optimize a query might result in query reoptimization. Characterization of Query Processors Slide 16

Decision Sites à Centralized decision approach  single site generates the strategy that is determines the “best” schedule  Simpler  need knowledge about the entire distributed database à Distributed decision approach  cooperation among various sites to determine the schedule (elaboration of the best strategy)  need only local information à Hybrid decision approach  one site makes the major decisions that is determines the global schedule  Other sites make local decisions that is optimizes the local sub-queries Characterization of Query Processors Slide 17

Network Topology à distributed query optimization be divided into two separate problems:  selection of the global execution strategy, based on inter-site communication, and selection of each local execution strategy, based on a centralized query processing algorithm. Wide area networks (WAN) – point-to-point à communication cost will dominate; ignore all other cost factors à global schedule to minimize communication cost à local schedules according to centralized query optimization Characterization of Query Processors Slide 18

Network Topology Local area networks (LAN) à communication costs are comparable to I/O costs. à increase parallel execution at the expense of communication cost. à The broadcasting capability of some local area networks can be exploited successfully to optimize the processing of join operators à special algorithms exist for star networks Characterization of Query Processors Slide 19

Four main layers are involved in distributed query processing. à each layer solves a well-defined subproblem. à The input is a query on global data. à This query is posed on global (distributed) relations à The first three layers map the input query into an optimized distributed query execution plan.  They perform the functions of query decomposition, data localization, and global query optimization. Layers of Query Processing Slide 20

Query decomposition and data localization correspond to query rewriting. The first three layers are performed by a central control site and use schema information stored in the global directory. The fourth layer performs distributed query execution by executing the plan and returns the answer to the query. It is done by the local sites and the control site. Layers of Query Processing Slide 21

Calculus Query on Distributed Relations CONTROL SITE LOCAL SITES Query Decomposition Query Decomposition Data Localization Data Localization Algebraic Query on Distributed Relations Global Optimization Global Optimization Fragment Query Local Optimization Local Optimization Optimized Fragment Query with Communication Operations Optimized Local Queries GLOBAL SCHEMA GLOBAL SCHEMA FRAGMENT SCHEMA FRAGMENT SCHEMA STATS ON FRAGMENTS STATS ON FRAGMENTS LOCAL SCHEMAS LOCAL SCHEMAS Layers of Query Processing Slide 22

Query Decomposition Query decomposition can be viewed as four successive steps. à First, the calculus query is rewritten in a normalized form that is suitable for subsequent manipulation. à Second, the normalized query is analyzed semantically so that incorrect queries are detected and rejected as early as possible. à Third, the correct query is simplified. One way to simplify a query is to eliminate redundant predicates à Fourth, the calculus query is restructured as an algebraic query. Slide 23

Data Localization The input to the second layer is an algebraic query on global relations. The main role of the second layer is to localize the query’s data using data distribution information in the fragment schema. This layer determines which fragments are involved in the query and transforms the distributed query into a query on fragments Fragmentation is defined by fragmentation predicates that can be expressed through relational operators. Slide 24

A global relation can be reconstructed by applying the fragmentation rules, and then deriving a program, called a localization program of relational algebra operators which then act on fragments. Generating a fragment query is done in two steps. à First, the query is mapped into a fragment query by substituting each relation by its reconstruction program (also called materialization program), à Second, the fragment query is simplified and restructured to produce another “good” query. Data Localization Slide 25

Global Query Optimization The input to the third layer is an algebraic query on fragments. The goal of query optimization is to find an execution strategy for the query which is close to optimal. An execution strategy for a distributed query can be described with relational algebra operators and communication primitives (send/receive operators) for transferring data between sites. Query optimization consists of finding the “best” ordering of operators in the query, including communication operators that minimize a cost function. Slide 26

Distributed Query Execution The last layer is performed by all the sites having fragments involved in the query. Each subquery executing at one site, called a local query, is then optimized using the local schema of the site and executed. the algorithms to perform the relational operators may be chosen Slide 27

Thank You Slide 28

PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.

Similar presentations

Presentation on theme: "PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.

Similar presentations

Presentation on theme: "PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University."— Presentation transcript:

Similar presentations

About project

Feedback