CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

SQL SERVER 2012 XVELOCITY COLUMNSTORE INDEX Conor Cunningham Principal Architect SQL Server Engine.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Multidimensional Data
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
BTrees & Bitmap Indexes
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
CS 345: Topics in Data Warehousing Thursday, October 28, 2004.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
CS 345: Topics in Data Warehousing Thursday, October 21, 2004.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2015, David C. Roberts, all rights reserved.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
CS 338Query Evaluation7-1 Query Evaluation Lecture Topics Query interpretation Basic operations Costs of basic operations Examples Textbook Chapter 12.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Indexes and Views Unit 7.
CS4432: Database Systems II Query Processing- Part 3 1.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 13 – Query Evaluation.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
CS4432: Database Systems II Query Processing- Part 2.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 4 Logical & Physical Database Design
Chapter 5 Index and Clustering
File Organizations and Indexing
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
SCALING AND PERFORMANCE CS 260 Database Systems. Overview  Increasing capacity  Database performance  Database indexes B+ Tree Index Bitmap Index 
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
IT 5433 LM4 Physical Design. Learning Objectives: Describe the physical database design process Explain how attributes transpose from the logical to physical.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
How To Build a Compressed Bitmap Index
Indexes By Adrienne Watt.
Evaluation of Relational Operations: Other Operations
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004

Review of Thursday’s Class Indexes –B-Tree and Hash Indexes –Clustered vs. Non-Clustered –Covering Indexes Using Indexes in Query Plans Bitmap Indexes –Index intersection plans –Bitmap compression

Outline of Today’s Class Bitmap compression with BBC codes –Gaps and Tails –Variable byte-length encoding of lengths –Special handling of lone bits Speeding up star joins –Cartesian product of dimensions –Semi-join reduction –Early aggregation

Bitmap Compression Compression via run length encoding –Just record number of zeros between adjacent ones – –Store this as “7,4,12,0,5” But: Can’t just write –It could be 7,4,12,0,5. (111)(100)(1100)(0)(101) –Or it could be 3,25,8,2,1. (11)(11001)(1000)(10)(1) –Need structured encoding

BBC Codes Byte-aligned Bitmap Codes –Proposed by Antoshenkov (1994) –Used in Oracle –We’ll discuss a simplified variation Divide bitmap into bytes –Gap bytes are all zeros –Tail bytes contain some ones –A chunk consists of some gap bytes followed by some tail bytes Encode chunks –Header byte –Gap length bytes (sometimes) –Verbatim tail bytes (sometimes)

BBC Codes Number of gap bytes –0-6: Gap length stored in header byte –7-127: One gap-length byte follows header byte – : Two gap-length bytes follow header byte “Special” tail –Tail of a chunk is special if: Tail consists of only 1 byte The tail byte has only 1 non-zero bit –Non-special tails are stored verbatim (uncompressed) Number of tail bytes is stored in header byte –Special tails are encoded by indicating which bit is set

BBC Codes Header byte –Bits 1-3: length of (short) gap Gaps of length 0-6 don’t require gap length bytes 111 = gap length > 6 –Bit 4: Is the tail special? –Bits 5-9: Number of verbatim bytes (if bit 4=0) Index of non-zero bit in tail byte (if bit 4 = 1) Gap length bytes –Either one or two bytes –Only present if bits 1-3 of header are 111 –Gap lengths of encoded in single byte –Gap lengths of encoded in 2 bytes 1 st bit of 1 st byte set to 1 to indicate 2-byte case Verbatim bytes –0-15 uncompressed tail bytes –Number is indicated in header

BBC Codes Example Consists of two chunks Chunk 1 –Bytes 1-3 –Two gap bytes, one tail byte –Encoding: (010)(1)(0100) –No gap length bytes since gap length < 7 –No verbatim bytes since tail is special Chunk 2 –Bytes 4-18 –13 gap bytes, two tail bytes –Encoding: (111)(0)(0010) –One gap length byte gives gap length = 13 –Two verbatim bytes for tail

Expanding Query Plan Choices “Conventional” query planner has limited options for executing star query –Join order: In which order should the dimensions be joined to the fact? –Join type: Hash join vs. Merge join vs. NLJ –Index selection: Can indexes support the joins? –Grouping strategy: Hashing vs. Sorting for grouping We’ll consider extensions to basic join plans –Dimension Cartesian product –Semi-join reduction –Early aggregation

Faster Star Queries Consider this scenario –Fact table has 100 million rows –3 dimension tables, each with 100 rows –Filters select 10 rows from each dimension One possible query plan 1.Join fact to dimension A Produce intermediate result with 10 million rows 2.Join result to dimension B Produce intermediate result with 1 million rows 3.Join result to dimension C Produce intermediate result with 100,000 rows 4.Perform grouping & aggregation Each join is expensive –Intermediate results are quite large

Dimension Cartesian Product Consider this alternate plan: –“Join” dimensions A and B Result is Cartesian product of all combinations Result has 100 rows (10 A rows * 10 B rows) –“Join” result to dimension C Another Cartesian product 1000 rows (10 A rows * 10 B rows * 10 C rows) –Join result to fact table Produce intermediate result with 100,000 rows –Perform grouping and aggregation Computing Cartesian product is cheap –Few rows in dimension tables Only one expensive join rather than three Approach breaks down with: –Too many dimensions –Too many rows in each dimension satisfy filters

Dimension Cartesian Product Fact indexes can make Cartesian product approach even better –Suppose fact index exists with (A_key, B_key, C_key) as leading terms –Compute Cartesian product of A, B, C –Then use index to retrieve only the 0.1% of fact rows satisfying all filters –Joining fact to a single dimension table would require retrieving 10% of fact rows

Cartesian Product Pros & Cons Benefits of dimension Cartesian product –Fewer joins involve fact table or its derivatives –Leverage filtering power of multi-column fact indexes with composite keys Drawbacks of dimension Cartesian product –Cartesian product result can be very large –More stringent requirements on fact indexes Fact index must include all dimensions from Cartesian product to be useful Dimension-at-a-time join plans can use thin fact index for initial join

Semi-Join Reduction Query plans involving semi-joins are common in distributed databases Semi-join of A with B (A B) –All rows in A that join with at least 1 row from B –Discard non-joining rows from A –Attributes from B are not included in the result Semi-join of B with A (B A) –All rows in B that join with at least 1 row from A –A B != B A Identity: A B = A (B A)

Semi-Join Reduction To compute join of A and B on A.C1 = B.C2: –Server 1 sends C1 values from A to Server 2 –Server 2 computes semi-join of B with A –Server 2 sends joining B tuples to Server 1 –Server 1 computes join of A and B Better sending simply sending entire B when: –Not too many B rows join with qualifying A rows AB A.C1 B Server 1Server 2

Semi-Join Reduction for Data Warehouses Goal is to save disk I/O rather than network I/O –Dimension table is “Server 1” –Fact table is “Server 2” –Fact table has single-column index on each foreign key column Query plan goes as follows: –For each dimension table: Determine keys of all rows that satisfy filters on that dimension Use single-column fact index to look up RIDs of all fact rows with those dimension keys –Merge RID lists corresponding to each dimension –Retrieve qualifying fact rows –Join fact rows back to full dimension tables to learn grouping attributes –Perform grouping and aggregation

Semi-Join Reduction Semi-join query plan reduces size of intermediate results that must be joined –Intermediate results can be sorted, hashed more efficiently DimFact Dim Keys Fact Server 1Server 2

Semi-Join Reduction Dimension Table Apply filters to eliminate non-qualifying rows Generate list of dimension keys 1 Fact Index Dim. Keys 2 Semi-join fact index and dimension keys = 3 Intersect fact RID lists

Semi-Join Reduction 4 Lookup fact rows based on RIDs Fact Table Fact RIDs 5 Join back to dimensions to bring in grouping attributes Dimension Table Fact Rows

Semi-Join Reduction Pros &Cons Benefits of semi-join reduction –Makes use of thin (1-column) fact indexes –Only relevant fact rows need be retrieved Apply all filters before retrieving any fact rows Drawbacks of semi-join reduction –Incur overhead of index intersection –Looking up fact rows from RIDs can be expensive Random I/O Only good when number of qualifying fact rows is small –Potential to access same dimension twice Initially when generating dimension key list Later when joining back to retrieve grouping columns

Early Aggregation Query plans we’ve considered do joins first, then grouping and aggregation Sometimes “group by” can be handled in two phases –Perform partial aggregation early as a data reduction technique –Finish up the aggregation after completing all joins Example: –SELECT Store.District, SUM(DollarSales) FROM Sales, Store, Date WHERE Sales.Store_key = Store.Store_key AND Sales.Date_key = Date.Date_key AND Date.Year = 2003 GROUP BY Store.District –Lots of Sales rows, but fewer distinct (Store, Date) combinations Early aggregation plan: 1.Group Sales by (Store, Date) & compute SUM(DollarSales) 2.Join result with Date dimension, filtered based on Year 3.Join result with Store dimension 4.Group by District & compute SUM(DollarSales)

Compare with Conventional Plan Conventional plan –Join Sales and Date, filtering based on Year Result has 36 million rows –Join result with Product Result has 36 million rows –Group by District & compute aggregate Early aggregation plan –Group Sales by (Date, Product) & compute aggregate Result has 100,000 rows –Join result with Date Result has 36,500 rows –Join result with District Result has 36,500 rows –Group by District & compute aggregate Assumptions –Sales fact has 100 million rows –Store dimension has 100 rows –Date dimension has 1000 rows (365 in 2003)

Early Aggregation Pros & Cons Benefits of early aggregation –Initial aggregation can be fast with appropriate covering index Leverage fact index on (Date, Store, DollarSales) –Result of early aggregation significantly smaller than fact table Fewer rows Fewer columns –Joins to dimension tables are cheaper Because intermediate result is much smaller than fact table Drawbacks of early aggregation –Can’t take advantage of data reduction due to filters Prefer joins with highly selective filters (Date.Day = 'October 20, 2004') before early aggregation –Two aggregation steps instead of one Adds additional overhead

Summary Three query planning techniques for star schema queries Cartesian product of dimension tables –Useful when several dimensions are small, or filtered to a small number of rows –Cut down on the number of fact table joins Semi-join reduction –Useful when AND of filters is quite selective, but individual filters are not –Only relevant rows from fact table are accessed –Doesn’t require a wide covering index Early aggregation –Aggregation, like filtering, reduces size of tables –Useful when dimensions needed in query have low distinct cardinality Which technique is best depends on individual query parameters –Sometimes a “traditional” plan is best after all –Decision made based on query optimizer’s cost model