Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November.

Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 25, 2003 Some slide content courtesy Susan Davidson, Raghu Ramakrishnan & Johannes Gehrke

2 Administrivia  It’s nearly the end!  Homework 7 due next Tuesday 12/2  Projects due next Thurs. 12/4: please sign up to give me a demo that day  Final exam handed out 12/4  Final exam and project report due 12/18  Projects will be graded both on quality of the project and the quality of the report – writing is always important!

Recap of Query Optimization  Plan: Tree of R.A. ops, with choice of alg for each op.  Each operator typically implemented using a `pull’ interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them.  Two main issues:  For a given query, what plans are considered?  Algorithm to search plan space for cheapest (estimated) plan.  How is the cost of a plan estimated?  Ideally: Want to find best plan. Practically: Avoid worst plans!  Our focus is on the approach from “System R”.

Highlights of System R Optimizer  Impact:  Most widely used currently; works well for < 10 joins.  Cost estimation: Approximate art at best.  Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes.  Considers combination of CPU and I/O costs.  Plan Space: Too large, must be pruned!  Break the query into blocks  Only the space of left-deep plans will be considered.  Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation.  Cartesian products avoided.

First Need to Divide into Query Blocks  An SQL query is parsed into a collection of query blocks, and these are optimized one block at a time.  Nested blocks are usually treated as calls to a subroutine, made once per outer tuple. (This is an over- simplification, but serves for now.) SELECT S.sname FROM Sailors S WHERE S.age IN ( SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating ) Nested blockOuter block  For each block, the plans considered are: – All available access methods, for each reln in FROM clause. – Possible join trees

6 Recall: The Core Idea in System R  For computing the most effective way of joining tables, use dynamic programming, which builds subresults and then uses these to construct successively bigger results  Find cheapest ways of accessing tables  Find cheapest ways of joining every pair of tables  Find cheapest way of joining every pair of tables with a 3 rd table (reusing the cheapest way of getting that pair)  Find cheapest way of joining every triple of tables with a 4 th table (reusing the cheapest way of joining that triple)  …

7 Still Not Quite Enough…  Problem 1: Query blocks also have selection, projection, grouping!  Problem 2: We still have to consider too many alternative plans!  Problem 3: Sorting causes funny things to happen in the dynamic programming approach – it doesn’t easily account for amortization of a sort across multiple queries

8 Heuristic 1: Selections, Projections, Groupings  What do we know is generally the case about selection & projection operations?  ORDER BY, GROUP BY, aggregates etc. handled as a final step

Heuristic: Left-Deep Join Trees  Fundamental decision in System R: only left-deep join trees are considered.  As the number of joins increases, the number of alternative plans grows rapidly; we need to restrict the search space.  Left-deep trees allow us to generate all fully pipelined plans.  Intermediate results not written to temporary files.  Not all left-deep trees are fully pipelined (e.g., SM join). B A C D B A C D C D B A

10 “Interesting Orders”  Dynamic programming doesn’t account for amortization of a sort across multiple joins  We need to fix this!  Solution:  Figure out all of the possible orderings that might be useful in the plan (for joining or grouping)  Create a separate “layer” in the DP table for these  At every point in the DP algorithm:  Find the cheapest join that maintains the order  Find the cheapest join that doesn’t maintain the order, using both the ordered and unordered alternatives

Final Details of Query Block Optimization  First, joins and cartesian products are enumerated:  An N-1 way plan is not combined with an additional relation if there is no join condition between them, unless all predicates in WHERE have been used up.  i.e., avoid Cartesian products if possible  Selections and projections are “pushed down”  Final ORDER BY is applied  In spite of pruning the plan space and using heuristics, this approach is still exponential in the # of tables.

Nested Queries  Nested block is optimized independently, with the outer tuple considered as providing a selection condition.  Outer block is optimized with the cost of `calling’ nested block computation taken into account.  Implicit ordering of these blocks means that some good strategies are not considered. The non-nested version of the query is typically optimized better. SELECT S.sname FROM Sailors S WHERE EXISTS ( SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value Equivalent non-nested query: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

Query Optimization Recapped  Query optimization is an important task in a relational DBMS  Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries)  Additionally, may need to do “hand optimization”  Two parts to optimizing a query:  Consider a set of alternative plans  Heuristics for simpler operators  Must prune search space; typically, left-deep plans only  Must estimate cost of each plan that is considered  Must estimate size of result and cost for each plan node  Key issues: statistics, indexes, operator implementations  PITFALL: often the estimates of intermediate results AREN’T good!!!

14 The Bigger Picture: Tuning  We saw that indexes and optimization decisions were critical to performance  Homeworks 6 and 7 tried to demonstrate some of that  Also important: buffer pool sizes, layout of data on disk, isolation levels (discussed shortly)  Many DBAs and consultants have made a living off understanding query workloads, data, and estimated intermediate result sizes  They “tune” DBs as a specialty  … Though this career MIGHT be diminishing in significance…

15 Autonomic & Auto-Tuning DBMSs  Hot research topic: self-tuning and adaptive DBMSs  SQL Server and DB2 have “Index Wizards” that take a query workload and try to find an optimal set of indices for it  Basically, they try lots of combinations of indices to find one that works well  “Adaptive query processing” systems also try to figure out where the optimizer’s estimates “went wrong” and compensate for it  Change the query in the middle, or  Make a note so we pick a better plan next time!

16 Switching Gears…  We’ve spent a lot of time talking about querying data  Yet updates are a really major part of many DBMS applications  Particularly important: ensuring ACID properties  Atomicity: each operation looks atomic to the user  Consistency: each operation in isolation keeps the database in a consistent state (this is the responsibility of the user)  Isolation: should be able to understand what’s going on by considering each separate transaction independently  Durability: updates stay in the DBMS!!!

17 What is a transaction?  A transaction is a sequence of read and write operations on data items that logically functions as one unit of work:  should either be done entirely or not at all  if it succeeds, the effects of write operations persist (commit); if it fails, no effects of write operations persist (abort)  these guarantees are made despite concurrent activity in the system, and despite failures that may occur

18 How things can go wrong  Suppose we have a table of bank accounts which contains the balance of the account. A deposit of $50 to a particular account # 1234 would be written as:  Reads and writes the account’s balance  What if two owners of the account make deposits simultaneously? update Accounts set balance = balance + $50 where account#= ‘1234’;

19 Concurrent deposits  This SQL update code is represented as a sequence of read and write operations on “data items” (which for now should be thought of as individual accounts):  Here, X is the data item representing the account with account# 1234. Deposit 1 Deposit 2 read(X.bal) X.bal := X.bal + $50 X.bal:= X.bal + $10 write(X.bal)

20 A “bad” concurrent execution  But only one “action” (e.g. a read or a write) can happen at a time, and there are a variety of ways in which the two deposits could be simultaneously executed: Deposit 1 Deposit 2 read(X.bal) X.bal := X.bal + $50 X.bal:= X.bal + $10 write(X.bal) time BAD!

21 A “good” execution  Previous execution would have been fine if the accounts were different (i.e. one were X and one were Y).  The following execution is a serial execution, and executes one transaction after the other: Deposit 1 Deposit 2 read(X.bal) X.bal := X.bal + $50 write(X.bal) read(X.bal) X.bal:= X.bal + $10 write(X.bal) time GOOD!

22 Good executions  An execution is “good” if is it serial (i.e. the transactions are executed one after the other) or serializable (i.e. equivalent to some serial execution)  This execution is equivalent to executing Deposit 1 then Deposit 3, or vice versa. Deposit 1 Deposit 3 read(X.bal) read(Y.bal) X.bal := X.bal + $50 Y.bal:= Y.bal + $10 write(X.bal) write(Y.bal)

23 Atomicity  Problems can also occur if a crash occurs in the middle of executing a transaction:  Need to guarantee that the write to X does not persist (ABORT)  Default assumption if a transaction doesn’t commit Transfer read(X.bal) read(Y.bal) X.bal= X.bal-$100 Y.bal= Y.bal+$100 CRASH

24 Transactions in SQL  A transaction begins when any SQL statement that queries the db begins.  To end a transaction, the user issues a COMMIT or ROLLBACK statement. Transfer UPDATE Accounts SET balance = balance - $100 WHERE account#= ‘1234’; UPDATE Accounts SET balance = balance + $100 WHERE account#= ‘5678’; COMMIT;

25 Read-only transactions  When a transaction only reads information, we have more freedom to let the transaction execute in parallel with other transactions.  We signal this to the system by stating SET TRANSACTION READ ONLY; SELECT * FROM Accounts WHERE account#=‘1234’;...

26 Read-write transactions  If we state “read-only”, then the transaction cannot perform any updates.  Instead, we must specify that the transaction may update (the default): SET TRANSACTION READ ONLY; UPDATE Accounts SET balance = balance - $100 WHERE account#= ‘1234’;... SET TRANSACTION READ WRITE; update Accounts set balance = balance - $100 where account#= ‘1234’;... ILLEGAL!

27 Dirty reads  Dirty data is data written by an uncommitted transaction; a dirty read is a read of dirty data.  Sometimes dirty reads are acceptable, other times they are not: e.g., if we wished to ensure balances never went negative in the transfer example, we should test that there is enough money first!

28 “Bad” dirty read EXEC SQL select balance into :bal from Accounts where account#=‘1234’; if (bal > 100) { EXEC SQL update Accounts set balance = balance - $100 where account#= ‘1234’; EXEC SQL update Accounts set balance = balance + $100 where account#= ‘5678’;} EXEC SQL COMMIT; If the initial read (italics) were dirty, the balance could become negative!

29 Acceptable dirty read  However, if we are just checking availability of an airline seat, a dirty read might be fine! Reservation transaction: EXEC SQL select occupied into :occ from Flights where Num= ‘123’ and date=11-03-99 and seat=‘23f’; if (!occ) {EXEC SQL update Flights set occupied=true where Num= ‘123’ and date=11-03-99 and seat=‘23f’;} else {notify user that seat is unavailable}

30 Other phenomena  Unrepeatable read: a transaction reads the same data item twice and gets different values.  Phantom problem: a transaction retrieves a collection of tuples twice and sees different results

31 Phantom Problem Example  T1: “find the oldest climber who is either MED or EXP”  T2: “insert a new EXP climber aged 96, then insert a new MED climber aged 60” Suppose that T1 locks all data pages with some EXP climber and finds that the oldest is 85. Then T2 executes, inserting the new EXP climber on a page not locked by T1. T1 then completes, locking all pages with some MED climber and finding the oldest MED climber is 60 (whereas the previous oldest MED climber had been 40).

Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November.

Similar presentations

Presentation on theme: "Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November.

Similar presentations

Presentation on theme: "Optimization, Auto-Tuning, and Introduction to Transactions Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November."— Presentation transcript:

Similar presentations

About project

Feedback