CS 345: Topics in Data Warehousing Thursday, November 4, 2004.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
RAID Redundant Array of Independent Disks
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya Indian Institute of Technology Bombay CS632 Course seminar.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Physical Database Monitoring and Tuning the Operational System.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
RAID Systems CS Introduction to Operating Systems.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
RAID Shuli Han COSC 573 Presentation.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
CS 345: Topics in Data Warehousing Thursday, October 21, 2004.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
Physical DB Issues, Indexes, Query Optimisation Database Systems Lecture 13 Natasha Alechina.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
Views Lesson 7.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
RAID Systems Ver.2.0 Jan 09, 2005 Syam. RAID Primer Redundant Array of Inexpensive Disks random, real-time, redundant, array, assembly, interconnected,
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Session 1 Module 1: Introduction to Data Integrity
SCALING AND PERFORMANCE CS 260 Database Systems. Overview  Increasing capacity  Database performance  Database indexes B+ Tree Index Bitmap Index 
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
1 Introduction to Database Systems, CS420 SQL Views and Indexes.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
CS Introduction to Operating Systems
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
Vladimir Stojanovic & Nicholas Weaver
RAID RAID Mukesh N Tekwani
ICOM 6005 – Database Management Systems Design
Physical Database Design
RAID Redundant Array of Inexpensive (Independent) Disks
Presentation transcript:

CS 345: Topics in Data Warehousing Thursday, November 4, 2004

Review of Tuesday’s Class Pre-computed aggregates –Materialized views –Aggregate navigation –Dimension and fact aggregates Selection of aggregates –Manual selection –Greedy algorithm –Limitations of greedy approach

Outline of Today’s Class Index Selection Selecting Views and Indexes Together Storage Systems –Mirroring, Striping, and Parity –RAID Levels

Index Selection Problem Similar problem to selecting aggregate tables –Select column sets to include / exclude Additional degrees of freedom –What type of index (B-tree, hash, bitmap, join index) –Ordering of columns in index search key –Clustered vs. non-clustered Additional restrictions –Columns chosen from a single table Except for special case of join index Interaction between indexes can be important –Less of an issue with aggregate tables –Examples: index intersection index-based merge join without sorting

Heuristics for Manual Selection Always include single-column indexes on: –dimension primary keys –fact foreign keys Mixture of wide and thin indexes –Build multi-column indexes on fact & dimension tables Covering indexes allow index-only plans Coverage vs. speed-up trade-off –More columns → useful for a greater variety of queries –Fewer columns → smaller index → greater speed-up –Build single-column indexes on important dimension columns Particularly on attributes with high filtering power –Product Name, Brand, etc. Bitmap indexes for low- and medium-cardinality columns B-tree indexes for high-cardinality columns Fact tables often clustered on Date –Most queries reference Date dimension –Little or no reorganization necessary as data appended

Automatic Index Selection AutoAdmin project –Research project at Microsoft –Developed tools for index & materialized view selection –Similar tools now available from all major vendors Papers we’ll cover –“An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server” by Chaudhuri and Narasayya, 1997 –“Automated Selection of Materialized Views and Indexes for SQL Databases” by Agrawal, Chaudhuri, and Narasayya, 2000

Guiding Principles Workload-driven approach –Which indexes are good depends on which queries are asked Incorporate the query optimizer –Indexes are only useful if the optimizer chooses to use them –Optimizer’s cost estimation model is well-developed, accurate Limit search space heuristically –Indexes that are good in combination are also good by themselves –Leading term of good multi-column index is a good single- column index –Indexes that are good for an entire workload are the best possible choice for some query in the workload –Heuristics speed up the selection process considerably, at the cost of missing some good index combinations

Index Selection Architecture Workload Final Indexes Identify Candidate Indexes Enumerate Configurations Generate Multi-Column Indexes Simulated Index Creation Cost Estimation Database Management System (Query Optimizer) Index Recommender

“What-If” Index Analysis Query optimizer estimates cost of query plan based on statistics –Sizes of relations and indexes –Number of distinct values –Frequency of occurrence of each value Generating statistics for an index is cheaper than actually building the index –Statistics can be estimated from a sample of the data Simulated / “what-if” index analysis –Ask the optimizer to optimize a query –Record cost estimate for best query plan –Update statistics to trick optimizer into thinking that an extra index exists –Ask the optimizer to optimize the query again –Record new cost estimate for best query plan –Compare before/after estimates to quantify impact of index

Estimating Workload Cost Configuration = set of indexes Atomic configuration = set of indexes that are all used together to answer some query Many possible configurations, fewer atomic configurations –Most query plans use only a small number of indexes –Example: 50 possible indexes, choose best 10 No query uses more than 3 indexes # of configurations = 10 billion # of atomic configurations = Only need to consider atomic configuration when estimating costs –Cost(Q,I) = cost of query Q with index set I –Let A  I be an atomic configuration contained in I –Cost(Q,I) = min[ Cost(A,I) ] –Mininum taken over all atomic configurations contained in I

Identifying Atomic Configurations Query syntax can be used –Leading term of index = column mentioned in WHERE, GROUP BY, or ORDER BY clause –Trailing term of index = column mentioned anywhere in query Heuristics for reducing number of atomic configurations –Number of atomic configs. can be large for complex queries –Too many atomic configurations → index selection is very slow –Trade off index selection time vs. quality of recommendations –Single-join heuristic: only consider atomic configurations which involve ≤ 2 tables and ≤ 2 indexes per table –Adaptively identify index interactions Compare “cost of query Q with indexes I” vs. “cost of query Q with best subset of I” If the two costs are equal or close, then I is not an atomic configuration

Identify Candidate Indexes For each query in the workload, determine the best atomic configuration –Enumerate relevant atomic configurations for each query based on query syntax –Simulate each configuration by modifying statistics –Calculate estimated execution cost using query optimizer Candidate index set = union of best atomic configuration for each query in workload Some indexes from optimal index set may be omitted –Suppose index I is second-best index for 10 queries but best for no query –Index I is likely to be part of the optimal configuration –However, index I will not be in the candidate set –This choice of candidate set is a time-saving heuristic –Considering all reasonable indexes would be too expensive

Enumerate Configurations Among all candidate indexes, which k indexes should we build? One approach: Greedy algorithm –Similar to the one discussed last class –Add indexes one at a time –Always choose the index that will decrease workload cost by the greatest amount Greedy approach fails to capture index interactions –An index may be useless by itself but useful in conjunction with a second index –Such combinations will be missed by greedy selection Greedy(m,k) algorithm –Exhaustively consider all configurations of ≤ m indexes –Select the best such configuration –Greedily add (k-m) additional indexes Choice of m trades off search time vs. result quality –Greedy(0,k) = pure greedy approach (fast) –Greedy(k,k) = exhaustive search (accurate) –Other values of m are in between [m=2 seems good in practice]

Generate Multi-Column Indexes Another heuristic to reduce optimization time Initially consider only narrow indexes, and iteratively widen them First iteration: –When building atomic configurations, consider only single-column indexes Second iteration: –Include the best indexes chosen in Iteration 1 –Also consider two-column “expansions” of the single-column indexes chosen in Iteration 1 Third iteration: –Include the best indexes from iteration 2 –Also consider three-column “expansions” of the two-column indexes chosen in Iteration 2 Generalizes to as many iterations as desired –Cache results of optimizer evaluations –Only cost for new atomic configs. need be computed in each iteration Experimental results indicate that little loss in quality occurs –As compared to the non-iterative solution

Selecting Indexes and Views Indexes and aggregate tables each serve to speed up queries There are interactions between them –Indexes can be built on aggregate tables –Constructing an aggregate table can decrease the usefulness of a related index (or vice versa) Selecting them together can deliver better results than selecting them independently How to combine the two?

Candidate Identification for Views Materialized views considered by AutoAdmin –Join of several tables –With or without aggregation –Optionally including filters –(More general than the aggregate tables we’ve discussed) Restricting the space of views considered –First identify “interesting table-subsets” –Idea: Materialized views over large tables are most useful –A table-subset is a set of tables –Table-subset that are referenced in < C% of queries (weighted by cost) are not interesting. –TS-Cost(T) = Sum [Cost(Q) * (size of tables in T) / (size of tables in Q)] Sum over all queries Q that reference every table in table-subset T –Table-subsets with TS-Cost < C% of total cost are not interesting

Candidate Identification For each query in the workload, determine best atomic configuration Atomic configuration made up of: –Indexes –Materialized views over interesting table-subsets –Indexes on materialized views over interesting table- subsets Candidate set = union of best atomic configurations across all queries

View Merging View merging is like multi-column index generation Combine two views to create a more generic view –Move up the data cube lattice Merge(V1,V2) –Group by union of V1, V2 grouping columns –Filter by intersection of V1, V2 filters –Filters that are in one of V1,V2 but not the other become grouping columns Example: –SELECT Income, SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key AND Customer.State = 'CA' –SELECT Age, SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key –Merged view: SELECT Income, Age, State SUM(Quantity) FROM Sales, Customer WHERE Sales.Customer_key = Customer.Customer_key

Storage Data analysis queries touch lots of data Data warehouses are often very large Reading the data from disk is usually the bottleneck What can be done to improve performance? Add more disks and benefit from parallelism

RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware –RAID delivers better price/performance than high-end disks Performance –Read data from n disks at once → reads are n times faster Reliability –Store multiple copies of data –If one disk fails, no data is lost and the system continues to run Three main concepts –Mirroring –Striping –Parity

Mirroring Use two disks that are identical copies of each other –Primary goal: fault-tolerance If one disk fails, use the other one –Writes must be done to both disks at once –Improved random read performance Can do two random reads at one time –Sequential read performance mostly unaffected This Is What Mirroring Looks Like

Striping Spread data across n disks First disk gets blocks 1, n+1, 2n+1, etc. Second disk gets blocks 2, n+2, 2n+2, etc. Improved random read performance –Can do as many as n reads at the same time –But each read must go to a specific disk –Thus multiple reads can conflict if unlucky Sequential reads are very fast –Especially for long reads (many blocks from each disk) –Read in parallel from all disks Each write goes to a single disk This You A Three Is Would Sentence Disk How Stripe Across Drives!

Parity Mirroring delivers fault-tolerance through redundancy Storage utilization is rather poor –Only 50% of disk capacity is useful –The other 50% is overhead for fault tolerance Parity checks deliver fault-tolerance with less redundancy –Use n+1 disks –Store data on n of the disks –Last disk contains parity data XOR of other n disks Compare ith bit on each disk Even number of 1s → ith parity bit is 0 Odd number of 1s → ith parity bit is 1 –Any one disk fails → no data is lost

Parity Example Three servers + 1 parity server –Server 1 stores “110011” –Server 2 stores “011011” –Server 3 stores “110101” –Server P stores “011101” Number of 1s = 2,3,1,1,2,3 Even, Odd, Odd, Odd, Even, Odd Suppose Server 2 fails –“110011”, “??????”, “110101”, “011101” –Take XOR of remaining servers to reconstruct Number of 1s = 2,3,1,2,1,3 Even,odd,odd,even,odd,odd

RAID Levels RAID 0 –Striping (without parity) –Pros: Good performance No redundancy (no wasted capacity) –Cons: Poor fault-tolerance (worse than no RAID!) RAID 1 –Mirroring –Pros: Good fault-tolerance Very fast recovery –Cons: Wastes storage capacity Performance not as good as other RAID levels

RAID Levels RAID 2: Not used. RAID 3 and 4: –Striping with dedicated parity disk –Stripe size = byte for RAID 3, block for RAID 4 –Pros: Good performance Good fault-tolerance with little redundancy Reasonably fast recovery –Cons: Parity disk is a bottleneck for writes This You Data Level Is Would Using Four. How Store RAID (Parity 1) (Parity 2) (Parity 3) (Parity 4)

RAID Levels RAID 5 –Striping with distributed parity –Servers “take turns” being the parity server –Pros and Cons similar to RAID 3 and 4 Avoids write bottleneck associated with RAID 3 and 4 Performance degrades following disk failure This You Data (Parity 4) Is Would (Parity 3) Level How (Parity 2) Using Five. (Parity 1) Store RAID

Multi-Level RAID The RAID ideas can be hierarchically combined Most common combination are: –RAID 1+0 – stripes of mirrors –RAID 0+1 – mirror of stripes This RAID Is 1+0 How Works This RAID Is 1+0 How Works This RAID Is 0+1 How Works This RAID Is 0+1 How Works RAID 1+0RAID 0+1

RAID 1+0 vs. RAID 0+1 Difference is what happens when a disk fails –RAID 1+0 One stripe becomes unmirrored Failure of the other disk in that stripe leads to data loss –RAID 0+1 One mirror becomes invalid Failure of any disk in the other stripe leads to data loss This RAID Is 1+0 How Works This RAID Is 1+0 How Works This RAID Is 0+1 How Works This RAID Is 0+1 How Works RAID 1+0RAID 0+1