Indexing Techniques CS 543 – Data Warehousing. CS 543 - Data Warehousing (Sp 2007-2008) - Asim LUMS2 Indexing Goal: Increase efficiency of data.

Slides:



Advertisements
Similar presentations
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Advertisements

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Chapter 8 File organization and Indices.
1 File Organizations and Indexing Module 4, Lecture 2 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
© 2011 Pearson Education, Inc. Publishing as Prentice Hall 1 Chapter 5 Part 2: File Organization and Performance Modern Database Management 10 th Edition.
CSC271 Database Systems Lecture # 30.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
Lecture 8 Index Organized Tables Clusters Index compression
Oracle Data Block Oracle Concepts Manual. Oracle Rows Oracle Concepts Manual.
1 Physical Data Organization and Indexing Lecture 14.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
September 2011Copyright 2011 Teradata Corporation1 Teradata Columnar.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
© Pearson Education Limited, Chapter 13 Physical Database Design – Step 4 (Choose File Organizations and Indexes) Transparencies.
Nimesh Shah (nimesh.s) , Amit Bhawnani (amit.b)
Storage and Indexing1 Overview of Storage and Indexing.
Dr. Abdul Basit Siddiqui Assistant Professor FUIEMS.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 5 Index and Clustering
Indexes CSE2132 Database Systems Week 11 Lecture Indexes.
Session 1 Module 1: Introduction to Data Integrity
File Organizations and Indexing
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Physical Database Design Considerations.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
10/3/2017 Chapter 6 Index Structures.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database Management System
Physical Database Design and Performance
Chapter 12: Query Processing
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
File organization and Indexing
Chapter 11: Indexing and Hashing
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Implementation of Relational Operations
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Chapter 11: Indexing and Hashing
Presentation transcript:

Indexing Techniques CS 543 – Data Warehousing

CS Data Warehousing (Sp ) - Asim LUMS2 Indexing Goal: Increase efficiency of data access by reducing the number of I/Os required to find desired record(s). Library analogy: Indexed access is analogous to using the card catalog in a library rather than searching through every shelf in the library until the desired book is found (e.g., avoids full table scan).

CS Data Warehousing (Sp ) - Asim LUMS3 DW Indexing Issues Indexes and loading Indexing for large tables Index-only reads Selecting columns for indexing A staged approach

CS Data Warehousing (Sp ) - Asim LUMS4 B-Tree Index

CS Data Warehousing (Sp ) - Asim LUMS5 Bitmapped Index

CS Data Warehousing (Sp ) - Asim LUMS6 Bitmapped Index

CS Data Warehousing (Sp ) - Asim LUMS7 Indexing the Fact Table If the DBMS does not create an index for the primary key, create one using B-tree indexing In the concatenated primary key, place the primary keys of frequently accessed dimension tables in the top order Create indexes for combinations of dimension table primary keys based query performance Do not overlook indexing metric columns Bitmapped indexing does not apply to fact tables; there is hardly any low-selectivity columns

CS Data Warehousing (Sp ) - Asim LUMS8 Indexing the Dimension Tables Create a unique B-tree index on the single-column primary key Index any column that is used frequently to constrain queries Create index for combination of columns that are used frequently together in queries Index every column likely to be used in a join operation

CS Data Warehousing (Sp ) - Asim LUMS9 Hash Indexing In contrast to B-tree indexing, hash based indexes do not (typically) keep index values in sorted order. Index entry is located by hashing index value. Index entries keep in hash organized tables rather than B-tree structures. Index entry contains ROWID values for each row corresponding to the index value. ROWIDs kept in sorted order to facilitate maximum I/O performance.

CS Data Warehousing (Sp ) - Asim LUMS10 Primary Indexing Primary index for a table in Teradata is a specification of its partitioning column(s). Primary index may be defined as unique (UPI) or non-unique (NUPI).  Automatic enforcement of uniqueness when UPI is specified. Primary index provides an implicit access path to any row just by knowing its value. Only one primary index per table.

CS Data Warehousing (Sp ) - Asim LUMS11 Primary Indexing Primary index selection criteria: Common join and retrieval key. Distributes rows evenly across database partitions. Less than ten thousand rows per PI value when non-unique.

CS Data Warehousing (Sp ) - Asim LUMS12 Primary Indexing Trick question: What should be the primary index of the transaction table for a large financial services firm? create table tx (tx_id decimal (15,0) NOT NULL,account_id decimal (10,0) NOT NULL,tx_amt decimal (15,2) NOT NULL,tx_dt date NOT NULL,tx_cd char (2) NOT NULL.... ) primary index (???); Answer: It depends.

CS Data Warehousing (Sp ) - Asim LUMS13 Primary Indexing Almost all joins and retrievals will come in through the account _id foreign key.  Want account_id as NUPI. If data is “lumpy” when distributed on account_id or if accounts have very large numbers of transactions (e.g., an institutional account could easily have 10,000+ transactions).  Want tx_id as UPI for good data distribution.

CS Data Warehousing (Sp ) - Asim LUMS14 Primary Indexing Joins and access via primary index are very efficient due to Teradata’s sophisticated row hashing algorithms that allow going directly to the data block containing the desired row. Single I/O operation for accessing a data row via UPI. Single I/O operation for accessing a data row via NUPI whenever all rows with the same PI value fit into a single block. Single VAMP operation for indexed retrieval. No spool space required.

CS Data Warehousing (Sp ) - Asim LUMS15 Primary Indexing Primary index is free! No storage cost. No index build required. This is a direct result of the underlying hash- based file system implementation. OLTP databases use a page-based file system and therefore do not deliver this performance advantage.

CS Data Warehousing (Sp ) - Asim LUMS16 Secondary Indexing Secondary index structures are implemented using the same underlying structure as base tables (often referred to as subtables). Secondary index may be defined as unique (USI) or non-unique (NUSI).  Automatic enforcement of uniqueness when USI is specified. Up to thirty-two secondary indexes per table in Teradata. Unlike a primary index, secondary indexes are not “free” in terms of storage.

CS Data Warehousing (Sp ) - Asim LUMS17 Secondary Index: NUSI A non-unique secondary index (NUSI) is partitioned so that each index entry is co-located on the same Vamp (Virtual Access Module Processor) with its corresponding row in the base table. Each row access via a NUSI is a single Vamp operation (for that row) because the NUSI entry and data row are co-located. NUSI access is always performed in parallel across all Vamp whenever it is appropriate to do so.

CS Data Warehousing (Sp ) - Asim LUMS18 Secondary Indexing: NUSI Compressed ROWID index structure: Hash on index value to get block location (ROWID for subtable). Store index value just once followed by all ROWIDs in base table corresponding to the index value. Sorted by ROWID to facilitate maximum efficiency when accessing base table, performing updates and deletes, etc. Additional blocks allocated when NUSI is non- selective and compressed ROWID structure for the index value exceeds 64K.

CS Data Warehousing (Sp ) - Asim LUMS19 Secondary Indexing: NUSI

CS Data Warehousing (Sp ) - Asim LUMS20 When to Build a NUSI? Building a NUSI helps when the selectivity of the indexed column is very high. Cost-based optimizer will determine when to access via NUSI: Number of rows selected by NUSI must be less than number of blocks in the table to justify access via NUSI (assumes even distribution of rows with NUSI value within table). Must also consider cost for reading the NUSI subtable and building ROWID spool file. Note that the extreme efficiency of table scanning in Teradata reduces the need for secondary indexing as compared to other databases.

CS Data Warehousing (Sp ) - Asim LUMS21 Secondary Indexing: USI A unique secondary index (USI) is partitioned by the unique column upon which the index is built. Row access via a USI is a two Vamp operation.  First I/O is initiated on the Vamp with the USI entry.  Second I/O is initiated on the Vamp with the data row entry.

CS Data Warehousing (Sp ) - Asim LUMS22 Secondary Indexing: USI

CS Data Warehousing (Sp ) - Asim LUMS23 When to Build a USI? To allow data access without all VAMP operations.  Increased efficiency for (very) high selectivity retrievals. Obtain co-location of index with frequently joined tables.

CS Data Warehousing (Sp ) - Asim LUMS24 When to Build a USI? Example: create table order_header (order_id decimal(12, 0) NOT NULL,customer_id decimal(9, 0) NOT NULL,order_dt date NOT NULL... ) primary index( customer_id ); create unique index oh_order_idx (order_id) on order_header; create table order_detail (order_id decimal(12, 0) NOT NULL,product_id integer NOT NULL,extended_price_amt decimal(15,2) NOT NULL,item_cnt integer NOT NULL... ) primary index( order_id );

CS Data Warehousing (Sp ) - Asim LUMS25 When to Build a USI? Example: How many customers ordered green socks in the last month? Assume that green socks is quite selective. select count(distinct order_header.customer_id) from order_header,order_detail,product where order_header.order_id = order_detail.order_id and order_header.order_dt > add_months(date, -1) and order_detail.product_id = product.product_id and product.product_subcategory_cd = 'SOCKS' and product.color_cd = 'GREEN' ; The order_id USI on order_header table obviates the need for all Vamp duplication of spool result from order detail to product join when joining to the order header table.

CS Data Warehousing (Sp ) - Asim LUMS26 A Simple Query Example: What is the average age (in years) of customers who live in California or Massachusetts, completed a graduate degree, are consultants, and have a hobby of volleyball or chess? select avg( (days(date) - days(customer.birth_dt)) / ) from customer where customer.state_cd in (‘CA’, MA’) and customer.education_cd = ‘G’ and customer.occupation_cd = ‘CONSULTANT’ and customer.hobby_cd in (‘VOLLEYBALL’,‘CHESS’) ;

CS Data Warehousing (Sp ) - Asim LUMS27 Sample Table Structure Assume: 20M customers. 128 byte rows. 64K data block size. Results in approximately 512 rows per block and a total of 39,063 blocks in the customer table. Note: We are ignoring block overhead for purposes of simplicity in calculations.

CS Data Warehousing (Sp ) - Asim LUMS28 Data Demographics Assume: 8% of customers live in California. 4% of customers live in Massachusetts. 4% of customers have completed a graduate degree. 6% of customers are consultants. 2% of customers have a primary hobby of chess. 3% of customers have a primary hobby of volleyball.

CS Data Warehousing (Sp ) - Asim LUMS29 Full Table Scan Performance Must read every block in the table. Apply where clause predicates to determine which customers to include in average. Adjust numerator and denominator of average as appropriate. Total I/O count = 39,063 Note: Data demographics have no (minimal) impact on query performance when using a full table scan operation.

CS Data Warehousing (Sp ) - Asim LUMS30 Single Index Structure B-tree or hash organization of column values: Index entries store row IDs (RIDs), lists of RIDs, or pointers to lists of RIDs. Originally designed for columns with many unique values (OLTP legacy). Assuming an eight byte RID, we will get 8096 RIDs per 64K block.

CS Data Warehousing (Sp ) - Asim LUMS31 Single Index Access ¶Optimizer chooses index with best selectivity based on values specified in query. ·Access next (first) index entry corresponding to specified column value(s). ¸Use RID from index entry to locate row with specified column value. ¹Validate remaining predicates to qualify row. ºAdjust average as appropriate. »Go to 2 until no more matching index values.

CS Data Warehousing (Sp ) - Asim LUMS32 Single Index Access What are my indexing choices? state_cd (8% + 4% = 12% selectivity) education_cd (4% selectivity) occupation_cd (6% selectivity) hobby_cd (2% + 3% = 5% selectivity) Choose education_cd because it has best selectivity.

CS Data Warehousing (Sp ) - Asim LUMS33 Single Index Performance Access via index on education_cd: 800,000 RIDs (4% of 20M) 99 blocks of RIDs to read But...4% selectivity with 512 rows per block in the base table means that 800,000 selected RIDs will cause access to every block in the base table! Total I/O count = 39, = 39,162 Worse than full table scan!

CS Data Warehousing (Sp ) - Asim LUMS34 Single Index Performance Accessing via an index helps only when the selectivity of the indexed column is very high. Rule-of-thumb: Number of rows selected by an index should not be more than the number of blocks in the table to justify indexed access (assumes rows with selected value(s) have an “even” distribution within table). Must also consider cost for reading the index and sorting RIDS (if not already sorted) prior to accessing base table rows (to avoid hitting same block multiple times).

CS Data Warehousing (Sp ) - Asim LUMS35 Single Index Performance What is the break even index selectivity (S) versus full table scan performance? Selectivity= S Row Count= 20M rows Block Size= 64K Row Width= 128 bytes RID Width= 8 bytes RIDs per Block= floor(Block Size/RID Width)= 8k Rows per Block= floor(Block Size/Row Width)= 512 Total Blocks= ceiling((Row Count) / (Rows per Block)) = 39,063

CS Data Warehousing (Sp ) - Asim LUMS36 Single Index Performance RID I/Os = (S * Row Count) / (RIDs per Block) Indexed Base Table I/Os = (Total Blocks) * (1 - ((1 - S) ** Rows per Block)) Full Table Scan I/Os= Total Blocks Break even formula: RID I/Os + Indexed Base Table I/Os = Full Table Scan I/Os Break Even S is less than 2%.

CS Data Warehousing (Sp ) - Asim LUMS37 Single Index Performance Larger row widths and/or smaller block sizes will generally make indexes more desirable because there is a higher probability that a given block will not contain a selected row when fewer rows fit in a block. Example: With a row width of 256 bytes (instead of 128 bytes) the break even selectivity for indexing becomes approximately 2.7%. Of course, larger row widths and/or smaller block sizes means that we are actually getting fewer rows per I/O and thus the amount of work we will do to satisfy a query will generally be higher.

CS Data Warehousing (Sp ) - Asim LUMS38 Single Index Performance Traditional index structures work well in OLTP because selectivity is extremely high (1 customer out of 20M or a few accounts out of 50M). Selectivity is % and thus is significantly better than the one or two percent required for break even. Bottom line: Traditional indexing is good for OLTP style queries, but is not so great for traditional DSS queries.

CS Data Warehousing (Sp ) - Asim LUMS39 Combining Multiple Indexes Observation: Indexed access on a single column is rarely useful in a traditional data warehouse environment. Idea: Combine multiple indexes to get the selectivity required for efficient indexed access.

CS Data Warehousing (Sp ) - Asim LUMS40 Combining Multiple Indexes While none of the index choices (state, education, occupation, hobby) are selective enough on their own to be useful...when combined we have sufficient selectivity to make indexed access efficient. Example: Start with 20M customers... Incremental Selected Column Index Selectivity Customers Selectivity by state 12% 2,400,000 Followed with selectivity by education 4% 96,000 Followed with selectivity by occupation 6% 5,760 Followed with selectivity by hobby 5% 288 Combined selectivity is % Note: These selectivity figures assume that column values are independent (...and they usually are not).

CS Data Warehousing (Sp ) - Asim LUMS41 Combining Multiple Indexes Must consider I/O cost of accessing four different indexes plus cost of accessing selected blocks from base table: State= 297 Education= 99 Occupation= 149 Hobby= 124 Base table= 288 ===== Total= 957 More efficient than a full table scan!

CS Data Warehousing (Sp ) - Asim LUMS42 Combining Multiple Indexes Notice that there is a significant performance benefit if we can satisfy a query directly out of indexes without accessing base table. Example: How many customers live in California or Massachusetts, completed a graduate degree, are consultants, and have a hobby of volleyball or chess? select count(*) from customer where customer.state_cd in ('CA','MA') and customer.education_cd = 'G' and customer.occupation_cd = 'CONSULTANT' and customer.hobby_cd in ('VOLLEYBALL','CHESS') ;

CS Data Warehousing (Sp ) - Asim LUMS43 Combining Multiple Indexes Now we only need to consider the I/O cost of accessing the four indexes (sans the cost of accessing base table): State=297 Education= 99 Occupation=149 Hobby=124 ===== Total= 669 Much more efficient than a full table scan!

CS Data Warehousing (Sp ) - Asim LUMS44 Combining Multiple Indexes Traditional combining of multiple indexes: Requires RID list ANDing (and sometimes ORing) to combine the multiple indexes. May incur overhead of RID list sorting in order to facilitate ANDing operation (depends on RDBMS indexing implementation). This technique is useful when no one index is selective enough to produce an efficient access path, but multiple indexes taken together can provide the needed selectivity.

CS Data Warehousing (Sp ) - Asim LUMS45 Bottom Line Optimizer sophistication is critical in effectively exploiting indexes. Selectivity of indices are critical in determining their usefulness. Indexed access paths are not nearly as useful in data warehousing as compared to OLTP workloads.