Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California.

Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California

System R Grand-daddy of RDBMS Grand-daddy of RDBMS  Started in 1975 at IBM San Jose Research Lab.  Won the ACM Software System Award in 1988.  Introduced fundamental database concepts such as SQL, locking, logging, cost-based query optimization techniques, etc.

Four Phases of SQL Processing Parsing Parsing  Checks for correct SQL syntax,  Computes the list of items to be retrieved, the table(s) referenced, and boolean combination of simple predicates. Optimization Optimization  Looks up the tables in the database catalog for their existence and statistics, and available access paths.  Computes the execution plan with minimum cost.  Output: Execution plan in the Access Specification Language (ASL). Code generation Code generation  Code generator is a table-driven program which translates ASL tress into machine language code.  Parse tree is replaced by executable machine code and its associated data structures. This code can be stored away in the database for later execution. Execution Execution  Executes the machine code by invoking System R internal storage system (RSS) via the storage system interface (RSI) to scan each of the physically stored relations referenced by the query.

Research Storage System (RSS) Maintains physical storage of relations, access paths on these relations. Maintains physical storage of relations, access paths on these relations. Implements locking and logging. Implements locking and logging. RSS represents a relation as: RSS represents a relation as:  A collection of tuples stored in 4KB pages,  Columns of a tuple are physically contiguous,  No tuple spans a page.  Pages are organized into logical units called segments.  Segments may contain one or more relations.  Each tuple is tagged with the identification of the relation to which it belongs.  At most one relation per segment.

RSS (Cont…) Access tuples using a scan: OPEN, NEXT, and CLOSE. A scan returns a tuple at a time. Access tuples using a scan: OPEN, NEXT, and CLOSE. A scan returns a tuple at a time. Supports two types of scans: Supports two types of scans: 1. Segment scan: Find all tuples of a relation. All non-empty pages of a segment are referenced only once. 2. Index scan: B+-trees

Optimizer Formulates a cost prediction for each access plan, using the following cost formula: Formulates a cost prediction for each access plan, using the following cost formula: COST = Page fetches + W * (RSI Calls) W is an adjustable weighting factor between I/O and CPU. W is an adjustable weighting factor between I/O and CPU. RSI calls is an approximation for CPU utilization. RSI calls is an approximation for CPU utilization. Assumptions: Assumptions:  WHERE tree is considered to be in conjunctive normal form,  Every disjunct is called a boolean factor.

Optimizer (Motivation) Given a query, there are many ways to execute it. The optimizer must identify the best execution plan. Given a query, there are many ways to execute it. The optimizer must identify the best execution plan. Example: Example: SELECT name, title, sal FROM Emp, Job WHERE Emp.Job = Job.Job and Title = ‘CLERK’

Optimizer (Motivation) Example: Example: SELECT name, title, sal FROM Emp, Job WHERE Emp.Job = Job.Job and Title = ‘CLERK’ Decide order to perform the different operators: Decide order to perform the different operators:  process “Title = ‘CLERK’” followed by the join  Process the join “Emp.Job = Job.Job” followed by “Title = ‘CLERK’” Decide which index structure to use: Segment scan, clustered index, non-clustered index. Decide which index structure to use: Segment scan, clustered index, non-clustered index. Decide the join algorithm: nested-loops versus merge-scan. Decide the join algorithm: nested-loops versus merge-scan. This paper tries to answer all the above questions! This paper tries to answer all the above questions!

How? Enumerating the different execution plans, Enumerating the different execution plans, Estimate the cost of performing each plan, Estimate the cost of performing each plan, Pick the cheapest plan. Pick the cheapest plan. What is definition of cost? What is definition of cost?

How? Enumerating the different execution plans, Enumerating the different execution plans, Estimate the cost of performing each plan, Estimate the cost of performing each plan, Pick the cheapest plan. Pick the cheapest plan. What is definition of cost? What is definition of cost? COST = Page fetches + W * (RSI Calls)

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) Is ~(B OR C) in CNF? Is ~(B OR C) in CNF?

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) Is ~(B OR C) in CNF? Is ~(B OR C) in CNF? Fix it by carrying the negation inside: ~B AND ~C

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) How about (A AND B) OR C? How about (A AND B) OR C?

Conjunctive Normal Form A formula is in conjunctive normal form if it is a conjunction of clauses: A formula is in conjunctive normal form if it is a conjunction of clauses:  A AND B  ~A AND (B OR C)  (A OR B) AND (D OR ~E) How about (A AND B) OR C? How about (A AND B) OR C? Transform it to (A OR C) AND (B OR C)

CNF Why? Why?  Every tuple returned to the user must satisfy every boolean factor.  If a tuple fails a boolean factor, discard it from farther consideration.

Database Catalog System R maintains statistics for each relation T: System R maintains statistics for each relation T:  NCARD(T), number of records in T  TCARD(T), number of pages in the segment that holds tuples of T  P(T), fraction of data pages in the segment that hold tuples of relation T P(T) = TCARD(T) / (# of non-empty pages in the segment) For each index I on relation T, For each index I on relation T,  ICARD(I), number of distinct keys in index I.  NINDX(I), number of pages in index I.

Maintenance of Statistics

Selectivity Factor (F) Corresponds to the expected fraction of tuples which will satisfy the predicate. Corresponds to the expected fraction of tuples which will satisfy the predicate. Column = value Column = value  F = 1 / ICARD(column index) with an index, assuming an even distribution of tuples among the index key values.  F = 1 / 10 otherwise.

Clustered Index Assume a student table: Student(name, age, gpa, major) Assume a student table: Student(name, age, gpa, major) t(Student) = 16 P(Student) = 4 Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS

Number of Records per GPA 0 1 2 3 4 2.32.52.833.1 3.2 3.53.73.83.94 Actual GPA Values

ESTIMATING NUMBER OF RESULTING RECORDS For exact match selection predicates assume a uniform distribution of records across the number of unique values. E.g., the selection predicate is gpa = 3.3 For exact match selection predicates assume a uniform distribution of records across the number of unique values. E.g., the selection predicate is gpa = 3.3 For range selection predicates assume a uniform distribution of records across the range of available values defined by min and max. In this case, one must think about the interval. E.g., gpa > 3.5 For range selection predicates assume a uniform distribution of records across the range of available values defined by min and max. In this case, one must think about the interval. E.g., gpa > 3.5 0 0.5 1 1.5 2 2.32.52.833.13.23.53.73.83.94 0 0.2 0.4 0.6 0.8 1 43.93.83.73.5 3.23.132.8 2.5 2.3

Selectivity Factor (F) Column > value Column > value  F = (high key value – value) / (high key value – low key value) as long as the column is an arithmetic type and value is known at access path selection time.  F = 1/3 otherwise (column is not arithmetic)

Selectivity Factor (F) Column < value Column < value?

Selectivity Factor (F) Column < value Column < value  F = (value - low key value) / (high key value – low key value) as long as the column is an arithmetic type and value is known at access path selection time.  F = 1/3 otherwise (column is not arithmetic)

Selectivity Factor (F) Value1 < Column < Value2 Value1 < Column < Value2?

Selectivity Factor (F) Value1 < Column < Value2 Value1 < Column < Value2  F = (Value2 – Value1) / (high key value – low key value) as long as the column is arithmetic  F = ¼ otherwise

Selectivity Factor (F) Column in (list of values) Column in (list of values) Join predicate, Column 1 = Column 2 Join predicate, Column 1 = Column 2 Disjunctive predicate Disjunctive predicate

Selectivity Factor (F) Conjunctive predicate Conjunctive predicate Negation Negation

Interesting order A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. Example query: Example query:

Interesting order A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. A query block’s GROUP BY or ORDER BY clauses may correspond to the order of records in an access path. This tuple order is an interesting order. Example query: Example query: Student(name, age, gpa, major) with a B+-tree on the gpa attribute SELECT name FROM Student WHERE gpa < 3.0 ORDER BY gpa SELECT gpa, count(*) FROM Student WHERE gpa < 3.0 GROUP BY gpa

B + -Tree A B+-tree on the gpa attribute A B+-tree on the gpa attribute Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (3, 1)) (3.8, (3,2)) (3.8, (3,3)) (3.9, (4,2)) (4, (4,3)) (3.8, (3,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (1, 1)) (2.5, (1,2)) (2.8, (1,3)) (3.1, (2,2)) (3.2, (2,3) (2.8, (1,4)) (3, (2,1)) (3.5, (2,4)) 3.6

Single Relation Access Paths Single relation access paths are simple selects with ORDER BY and GROUP BY clauses Single relation access paths are simple selects with ORDER BY and GROUP BY clauses SELECT name FROM Student WHERE age < 20 Without an index, must perform a segment scan, what is the cost? Without an index, must perform a segment scan, what is the cost?  TCARD / P + W * RSISCAN  TCARD(T), number of pages in the segment that holds tuples of T  P(T), fraction of data pages in the segment that hold tuples of relation T P(T) = TCARD(T) / (# of non-empty pages in the segment)  Why?

Single Relation Access Paths Single relation access paths are simple selects with ORDER BY and GROUP BY clauses Single relation access paths are simple selects with ORDER BY and GROUP BY clauses SELECT name FROM Student WHERE age < 20 Without an index, must perform a segment scan, what is the cost? Without an index, must perform a segment scan, what is the cost?  TCARD / P + W * RSISCAN  TCARD(T), number of pages in the segment that holds tuples of T  P(T), fraction of data pages in the segment that hold tuples of relation T P(T) = TCARD(T) / (# of non-empty pages in the segment)  Tuples of Student might be inter-mixed with professors. Example: the student table with TCARD = 100 pages and P(T) = 0.75. Note that P(T) = 1 when the student table is not intermixed with another table.

Single Relation Access Paths Cost of scanning leaf pages and data pages Cost of scanning leaf pages and data pages Bob, 21, 3.7, CSMary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, MELam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CSChad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (3, 1)) (3.8, (3,2)) (3.8, (3,3)) (3.9, (4,2)) (4, (4,3)) (3.8, (3,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (1, 1)) (2.5, (1,2)) (2.8, (1,3)) (3.1, (2,2)) (3.2, (2,3) (2.8, (1,4)) (3, (2,1)) (3.5, (2,4)) 3.6

Single Relation Access Paths Cost of scanning leaf pages and data pages containing the qualifying records Cost of scanning leaf pages and data pages containing the qualifying records

Non-Clustered B + -Tree A random I/O for every qualifying record A random I/O for every qualifying record Bob, 21, 3.7, CS Mary, 24, 3, ECE Tom, 20, 3.2, EE Kathy, 18, 3.8, LS Kane, 19, 3.8, ME Lam, 22, 2.8, ME Chang, 18, 2.5, CS Vera, 17, 3.9, EE Louis, 32, 4, LS Martha, 29, 3.8, CS James, 24, 3.1, ME Pat, 19, 2.8, EE Chris, 22, 3.9, CS Chad, 28, 2.3, LS Leila, 20, 3.5, LS Shideh, 16, 4, CS (3.7, (1, 1)) (3.8, (3,2)) (3.8, (2,1)) (3.9, (2,4)) (4, (3,1)) (3.8, (1,4)) (3.9, (4,1)) (4, (4,4)) (2.3, (4, 2)) (2.5, (2,3)) (2.8, (2,2)) (3.1, (3,3)) (3.2, (1,3) (2.8, (3,4)) (3, (1,2)) (3.5, (4,3)) 3.6

Non-Clustered B + -Tree A random I/O for every qualifying record A random I/O for every qualifying record

R EQUALITY JOIN S: R.A = S.A Two algorithms for performing the join operator: nested loops and merge-scan. Two algorithms for performing the join operator: nested loops and merge-scan. Tuple nested loops: Tuple nested loops: for each tuple r in R do for each tuple s in S do for each tuple s in S do if r.A=s.A then output r,s in the result relation if r.A=s.A then output r,s in the result relation end-for end-forend-for Estimated cost of tuple nested loops: Estimated cost of tuple nested loops:  TCARD(R)/P(R) + [NCARD(R) × TCARD(S)/P(S)] TCARD(S)/P(S) NCARD(R)

EQUALITY JOIN (Cont … ) Merge-scan: Merge-scan: 1. Interesting order on R.A (sorted) 2. Interesting order on S.A (sorted) 3. Scan R and S in parallel, merging tuples with matching A values Estimated cost of merge scan: NINDX(I R ) + NINDX(I S )

N-Way Join N-Way joins as a sequence of 2-way joins. N-Way joins as a sequence of 2-way joins. Utilize pipelining whenever appropriate: Utilize pipelining whenever appropriate: The ordering of the joins is important. Consider all ordering such that: The ordering of the joins is important. Consider all ordering such that:  Join predicates relate the two participating tables together; do not consider cartesian products. For example if the join clause is (R.A = S.A and R.B = T.B) then it would be a mistake to use the following clause (S Cartesian product T) and R.A = ST.A and R.B = ST.B  Delay computation of cartesian products as much as possible.  Consider interesting orders in order to use merge-scan whenever possible.

Search Space Rather large search space for expressions joining several tables: Rather large search space for expressions joining several tables: Heuristics prune the search space: Heuristics prune the search space:

Nested Queries Correlation subquery: A subquery with a reference to a value obtained from a candidate tuple of a higher level query block. Correlation subquery: A subquery with a reference to a value obtained from a candidate tuple of a higher level query block.

Non-Correlation sub-queries Evaluate the inner query once and use its results to process the outer query. Evaluate the inner query once and use its results to process the outer query.

Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California.

Similar presentations

Presentation on theme: "Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California.

Similar presentations

Presentation on theme: "Access Path Selection in a RDBMS Shahram Ghandeharizadeh Computer Science Department University of Southern California."— Presentation transcript:

Similar presentations

About project

Feedback