CS 345: Topics in Data Warehousing Thursday, November 4, 2004.

CS 345: Topics in Data Warehousing Thursday, November 4, 2004

Review of Tuesday’s Class Pre-computed aggregates –Materialized views –Aggregate navigation –Dimension and fact aggregates Selection of aggregates –Manual selection –Greedy algorithm –Limitations of greedy approach

Outline of Today’s Class Index Selection Selecting Views and Indexes Together Storage Systems –Mirroring, Striping, and Parity –RAID Levels

Index Selection Problem Similar problem to selecting aggregate tables –Select column sets to include / exclude Additional degrees of freedom –What type of index (B-tree, hash, bitmap, join index) –Ordering of columns in index search key –Clustered vs. non-clustered Additional restrictions –Columns chosen from a single table Except for special case of join index Interaction between indexes can be important –Less of an issue with aggregate tables –Examples: index intersection index-based merge join without sorting

Heuristics for Manual Selection Always include single-column indexes on: –dimension primary keys –fact foreign keys Mixture of wide and thin indexes –Build multi-column indexes on fact & dimension tables Covering indexes allow index-only plans Coverage vs. speed-up trade-off –More columns → useful for a greater variety of queries –Fewer columns → smaller index → greater speed-up –Build single-column indexes on important dimension columns Particularly on attributes with high filtering power –Product Name, Brand, etc. Bitmap indexes for low- and medium-cardinality columns B-tree indexes for high-cardinality columns Fact tables often clustered on Date –Most queries reference Date dimension –Little or no reorganization necessary as data appended

Automatic Index Selection AutoAdmin project –Research project at Microsoft –Developed tools for index & materialized view selection –Similar tools now available from all major vendors Papers we’ll cover –“An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server” by Chaudhuri and Narasayya, 1997 –“Automated Selection of Materialized Views and Indexes for SQL Databases” by Agrawal, Chaudhuri, and Narasayya, 2000

Guiding Principles Workload-driven approach –Which indexes are good depends on which queries are asked Incorporate the query optimizer –Indexes are only useful if the optimizer chooses to use them –Optimizer’s cost estimation model is well-developed, accurate Limit search space heuristically –Indexes that are good in combination are also good by themselves –Leading term of good multi-column index is a good single- column index –Indexes that are good for an entire workload are the best possible choice for some query in the workload –Heuristics speed up the selection process considerably, at the cost of missing some good index combinations

Index Selection Architecture Workload Final Indexes Identify Candidate Indexes Enumerate Configurations Generate Multi-Column Indexes Simulated Index Creation Cost Estimation Database Management System (Query Optimizer) Index Recommender

“What-If” Index Analysis Query optimizer estimates cost of query plan based on statistics –Sizes of relations and indexes –Number of distinct values –Frequency of occurrence of each value Generating statistics for an index is cheaper than actually building the index –Statistics can be estimated from a sample of the data Simulated / “what-if” index analysis –Ask the optimizer to optimize a query –Record cost estimate for best query plan –Update statistics to trick optimizer into thinking that an extra index exists –Ask the optimizer to optimize the query again –Record new cost estimate for best query plan –Compare before/after estimates to quantify impact of index

Estimating Workload Cost Configuration = set of indexes Atomic configuration = set of indexes that are all used together to answer some query Many possible configurations, fewer atomic configurations –Most query plans use only a small number of indexes –Example: 50 possible indexes, choose best 10 No query uses more than 3 indexes # of configurations = 10 billion # of atomic configurations = 20876 Only need to consider atomic configuration when estimating costs –Cost(Q,I) = cost of query Q with index set I –Let A  I be an atomic configuration contained in I –Cost(Q,I) = min[ Cost(A,I) ] –Mininum taken over all atomic configurations contained in I

Identifying Atomic Configurations Query syntax can be used –Leading term of index = column mentioned in WHERE, GROUP BY, or ORDER BY clause –Trailing term of index = column mentioned anywhere in query Heuristics for reducing number of atomic configurations –Number of atomic configs. can be large for complex queries –Too many atomic configurations → index selection is very slow –Trade off index selection time vs. quality of recommendations –Single-join heuristic: only consider atomic configurations which involve ≤ 2 tables and ≤ 2 indexes per table –Adaptively identify index interactions Compare “cost of query Q with indexes I” vs. “cost of query Q with best subset of I” If the two costs are equal or close, then I is not an atomic configuration

Identify Candidate Indexes For each query in the workload, determine the best atomic configuration –Enumerate relevant atomic configurations for each query based on query syntax –Simulate each configuration by modifying statistics –Calculate estimated execution cost using query optimizer Candidate index set = union of best atomic configuration for each query in workload Some indexes from optimal index set may be omitted –Suppose index I is second-best index for 10 queries but best for no query –Index I is likely to be part of the optimal configuration –However, index I will not be in the candidate set –This choice of candidate set is a time-saving heuristic –Considering all reasonable indexes would be too expensive

Enumerate Configurations Among all candidate indexes, which k indexes should we build? One approach: Greedy algorithm –Similar to the one discussed last class –Add indexes one at a time –Always choose the index that will decrease workload cost by the greatest amount Greedy approach fails to capture index interactions –An index may be useless by itself but useful in conjunction with a second index –Such combinations will be missed by greedy selection Greedy(m,k) algorithm –Exhaustively consider all configurations of ≤ m indexes –Select the best such configuration –Greedily add (k-m) additional indexes Choice of m trades off search time vs. result quality –Greedy(0,k) = pure greedy approach (fast) –Greedy(k,k) = exhaustive search (accurate) –Other values of m are in between [m=2 seems good in practice]

Generate Multi-Column Indexes Another heuristic to reduce optimization time Initially consider only narrow indexes, and iteratively widen them First iteration: –When building atomic configurations, consider only single-column indexes Second iteration: –Include the best indexes chosen in Iteration 1 –Also consider two-column “expansions” of the single-column indexes chosen in Iteration 1 Third iteration: –Include the best indexes from iteration 2 –Also consider three-column “expansions” of the two-column indexes chosen in Iteration 2 Generalizes to as many iterations as desired –Cache results of optimizer evaluations –Only cost for new atomic configs. need be computed in each iteration Experimental results indicate that little loss in quality occurs –As compared to the non-iterative solution

Selecting Indexes and Views Indexes and aggregate tables each serve to speed up queries There are interactions between them –Indexes can be built on aggregate tables –Constructing an aggregate table can decrease the usefulness of a related index (or vice versa) Selecting them together can deliver better results than selecting them independently How to combine the two?

Candidate Identification for Views Materialized views considered by AutoAdmin –Join of several tables –With or without aggregation –Optionally including filters –(More general than the aggregate tables we’ve discussed) Restricting the space of views considered –First identify “interesting table-subsets” –Idea: Materialized views over large tables are most useful –A table-subset is a set of tables –Table-subset that are referenced in < C% of queries (weighted by cost) are not interesting. –TS-Cost(T) = Sum [Cost(Q) * (size of tables in T) / (size of tables in Q)] Sum over all queries Q that reference every table in table-subset T –Table-subsets with TS-Cost < C% of total cost are not interesting

Candidate Identification For each query in the workload, determine best atomic configuration Atomic configuration made up of: –Indexes –Materialized views over interesting table-subsets –Indexes on materialized views over interesting table- subsets Candidate set = union of best atomic configurations across all queries

View Merging View merging is like multi-column index generation Combine two views to create a more generic view –Move up the data cube lattice Merge(V1,V2) –Group by union of V1, V2 grouping columns –Filter by intersection of V1, V2 filters –Filters that are in one of V1,V2 but not the other become grouping columns Example: –SELECT Income, SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key AND Customer.State = 'CA' –SELECT Age, SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key –Merged view: SELECT Income, Age, State SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key

Storage Data analysis queries touch lots of data Data warehouses are often very large Reading the data from disk is usually the bottleneck What can be done to improve performance? Add more disks and benefit from parallelism

RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware –RAID delivers better price/performance than high-end disks Performance –Read data from n disks at once → reads are n times faster Reliability –Store multiple copies of data –If one disk fails, no data is lost and the system continues to run Three main concepts –Mirroring –Striping –Parity

Mirroring Use two disks that are identical copies of each other –Primary goal: fault-tolerance If one disk fails, use the other one –Writes must be done to both disks at once –Improved random read performance Can do two random reads at one time –Sequential read performance mostly unaffected This Is What Mirroring Looks Like

Striping Spread data across n disks First disk gets blocks 1, n+1, 2n+1, etc. Second disk gets blocks 2, n+2, 2n+2, etc. Improved random read performance –Can do as many as n reads at the same time –But each read must go to a specific disk –Thus multiple reads can conflict if unlucky Sequential reads are very fast –Especially for long reads (many blocks from each disk) –Read in parallel from all disks Each write goes to a single disk This You A Three Is Would Sentence Disk How Stripe Across Drives!

Parity Mirroring delivers fault-tolerance through redundancy Storage utilization is rather poor –Only 50% of disk capacity is useful –The other 50% is overhead for fault tolerance Parity checks deliver fault-tolerance with less redundancy –Use n+1 disks –Store data on n of the disks –Last disk contains parity data XOR of other n disks Compare ith bit on each disk Even number of 1s → ith parity bit is 0 Odd number of 1s → ith parity bit is 1 –Any one disk fails → no data is lost

Parity Example Three servers + 1 parity server –Server 1 stores “110011” –Server 2 stores “011011” –Server 3 stores “110101” –Server P stores “011101” Number of 1s = 2,3,1,1,2,3 Even, Odd, Odd, Odd, Even, Odd Suppose Server 2 fails –“110011”, “??????”, “110101”, “011101” –Take XOR of remaining servers to reconstruct Number of 1s = 2,3,1,2,1,3 Even,odd,odd,even,odd,odd 011011

RAID Levels RAID 0 –Striping (without parity) –Pros: Good performance No redundancy (no wasted capacity) –Cons: Poor fault-tolerance (worse than no RAID!) RAID 1 –Mirroring –Pros: Good fault-tolerance Very fast recovery –Cons: Wastes storage capacity Performance not as good as other RAID levels

RAID Levels RAID 2: Not used. RAID 3 and 4: –Striping with dedicated parity disk –Stripe size = byte for RAID 3, block for RAID 4 –Pros: Good performance Good fault-tolerance with little redundancy Reasonably fast recovery –Cons: Parity disk is a bottleneck for writes This You Data Level Is Would Using Four. How Store RAID (Parity 1) (Parity 2) (Parity 3) (Parity 4)

RAID Levels RAID 5 –Striping with distributed parity –Servers “take turns” being the parity server –Pros and Cons similar to RAID 3 and 4 Avoids write bottleneck associated with RAID 3 and 4 Performance degrades following disk failure This You Data (Parity 4) Is Would (Parity 3) Level How (Parity 2) Using Five. (Parity 1) Store RAID

Multi-Level RAID The RAID ideas can be hierarchically combined Most common combination are: –RAID 1+0 – stripes of mirrors –RAID 0+1 – mirror of stripes This RAID Is 1+0 How Works This RAID Is 1+0 How Works This RAID Is 0+1 How Works This RAID Is 0+1 How Works RAID 1+0RAID 0+1

RAID 1+0 vs. RAID 0+1 Difference is what happens when a disk fails –RAID 1+0 One stripe becomes unmirrored Failure of the other disk in that stripe leads to data loss –RAID 0+1 One mirror becomes invalid Failure of any disk in the other stripe leads to data loss This RAID Is 1+0 How Works This RAID Is 1+0 How Works This RAID Is 0+1 How Works This RAID Is 0+1 How Works RAID 1+0RAID 0+1

CS 345: Topics in Data Warehousing Thursday, November 4, 2004.

Similar presentations

Presentation on theme: "CS 345: Topics in Data Warehousing Thursday, November 4, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 345: Topics in Data Warehousing Thursday, November 4, 2004.

Similar presentations

Presentation on theme: "CS 345: Topics in Data Warehousing Thursday, November 4, 2004."— Presentation transcript:

Similar presentations

About project

Feedback