CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing.

Slides:



Advertisements
Similar presentations
Overview of Storage and Indexing
Advertisements

1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.
Overview of Storage and Indexing
1 Physical Design: Indexing Yanlei Diao UMass Amherst Feb 13, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
File Organizations and Indexing Lecture 5 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 Physical Database Design and Tuning Module 5, Lecture 3.
1 File Organizations and Indexing Module 4, Lecture 2 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Storage and Indexing February 26 th, 2003 Lecture 19.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Database Systems An Overview of Storage and Indexing
1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Queries, Database Design, Constraint Enforcement Specify Schema + specify constraints.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Storage and Indexing1 Overview of Storage and Indexing.
1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope ( )
DATABASE MANAGEMENT SYSTEMS TERM B. Tech II/IT II Semester UNIT-VIII PPT SLIDES Text Books: (1) DBMS by Raghu Ramakrishnan (2) DBMS by Sudarshan.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Overview of Storage and Indexing Content based on Chapter 4 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Database Systems ( 資料庫系統 ) October 4, 2004 Lecture #4 By Hao-hua Chu ( 朱浩華 )
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “If you don’t find it in the index, look very.
1 Database Systems ( 資料庫系統 ) October 29/31, 2007 Lecture #6.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
File Organizations and Indexing
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Clustered vs. Unclustered Index Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “If you don’t find it in the index, look very.
1 CS122A: Introduction to Data Management Lecture #15: Physical DB Design Instructor: Chen Li.
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
CS522 Advanced database Systems
Record Storage, File Organization, and Indexes
CS522 Advanced database Systems
Introduction to Database Systems File Organization and Indexing
Overview of Storage and Indexing
CS222: Principles of Data Management Notes #09 Indexing Performance
File Organizations and Indexing
File Organizations and Indexing
Chapter 8 – Part II. A glimpse at indices and workloads
Overview of Storage and Indexing
CS222P: Principles of Data Management Notes #09 Indexing Performance
Database Systems October 20, 2008 Lecture #6.
Overview of Storage and Indexing
File Storage and Indexing
Database Systems (資料庫系統)
Storage and Indexing.
General External Merge Sort
Overview of Storage and Indexing
Overview of Storage and Indexing
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #08 Comparisons of Indexes and Indexing Performance Instructor: Chen Li.
Overview of Storage and Indexing
CS222P: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
Presentation transcript:

CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing (2)

3. Overview(2) CS522_S Outline  Comparison of file organizations  Choice of indices  Index only plans

3. Overview(2) CS522_S File organizations and operation cost  Which file organization is good? m Each file organization makes certain operations efficient but other operations expensive  Compare the costs of some simple operations for some basic file organizations m Example table Employees: (name, age, sal) m Assume files and indices are organized according to the composite search key m All selection operations are specified on age and sal  Goal m Emphasize the importance of an appropriate file organization

Operations  Scan m Fetch all records in the file to memory (buffer pool)  Search with equality selection m Fetch all records that satisfy an equality selections m Locate qualifying records  Search with range selection m Fetch all records that satisfy a range selection  Insert a record m Indentify the page in the file into which the new record must be inserted m Fetch that page from disk m Modify it to include the new record m Write back the modified page  Delete a record 3. Overview(2) CS522_S16 3-4

3. Overview(2) CS522_S16 Cost Model for Our Analysis  B: The number of data pages  R: Number of records per page  D: (Average) time to read or write disk page  C: (Average) time to process a record)  Measuring number of page I/O’s ignores gains of pre-fetching a sequence of pages; thus, even I/O cost is only approximated.  Average-case analysis; based on several simplistic assumptions. 3-5

3. Overview(2) CS522_S16 Comparing File Organizations  Heap files (random order; insert at eof)  Sorted files, sorted on  Clustered B+ tree file, Alternative (1), search key  Heap file with unclustered B + tree index on search key  Heap file with unclustered hash index on search key 3-6

3. Overview(2) CS522_S16 Heap files  Scan m Cost: B(D+RC)  Search with equality selection m Assumption: exactly one match. m On average, half of the file is scanned m Cost: 0.5B(D+RC)  Search with range selection m The entire must be scanned m Cost: B(D+RC) 3-7

3. Overview(2) CS522_S Heap files  Insert m Fetch the last page, add the record and write the page back m Cost: 2D+C  Delete m Find the record, remove the record from the page and write the modified page back. m Cost: (search cost) + D + C

3. Overview(2) CS522_S Sorted files  Scan m Cost: B(D+RC)  Search with equality selection m Assumption: equality search matches the sort order m Locate the page: log 2 B (binary search) m Find the record: log 2 R m Cost: Dlog 2 B+C log 2 R  Search with range selection m Assume that the range selection mathches the composite key m Cost: cost of search + cost of retrieving matching records

3. Overview(2) CS522_S Sorted files  Insert m Assumption: the new record belongs in the middle of the file m Need to fetch the latter half of the file and write them back after adding the new record m Cost: cost of searching the position of the new record+2*(0.5B(D+RC))  Delete m Cost: same as that for an insert

3. Overview(2) CS522_S Clustered B+ tree, alternative 1  Page occupancy m The pages are usually at bout 67% occupancy m The number of data pages is about 1.5B  Scan m Cost: 1.5B(D+RC)  Search with equality selection m Assumption: each index node has F children m Cost of finding the page containing the desired record: Dlog F 1.5B m Cost of find the desired record: Clog 2 R m Cost: Dlog F 1.5B + Clog 2 R + the cost of reading all the qualifying records

3. Overview(2) CS522_S Clustered B+ tree, alternative 1  Search with range selection m Cost: cost of locating the first qualifying record + cost of retrieving other qualifying records  Insert: m Assumption: the leaf page has sufficient space for the new record m Find the correct leaf page, add the record and write it back m Cost: Dlog F 1.5B + Clog 2 R + D  Delete: m Same as that for an insert

3. Overview(2) CS522_S Heap file with unclustered tree index: alternative 2  Assumption m a data entry is 0.1 the size of a data record m 67% occupancy of leaf pages m Then the number of leaf pages: 0.1(1.5B) = 0.15B m The number of data entries in a page: 10(0.67R) = 6.7R  Scan m If use index Each data entry may point to a different data page Cost of read all data entries: 0.15B(D+6.7RC) Cost of read all records: BR(D+C) m Better ignoring the index and Scanning the data file directly

3. Overview(2) CS522_S Heap file with unclustered tree index: alternative 2  Search with equality selection m Cost of locating the right leaf page: Dlog F 0.15B m Cost of locating the right data entry: Clog 2 6.7R m Cost: Dlog F 0.15B+Clog 2 6.7R+D  Search with range selection m When should we use index? If 10% percent of data records satisfy the selection condition, we are better off ignoring the index

3. Overview(2) CS522_S Heap file with unclustered tree index: alternative 2  Insert m Insert the record to data file: 2D+C m Insert the corresponding data entry to the index file: DlogF0.15B+Clog 2 6.7R+D m Cost: 2D+C+DlogF0.15B+Clog 2 6.7R+D  Delete m Same as that for an insert

3. Overview(2) CS522_S Heap file with unclustered hash index: alternative 2  Assumptions m No overflow pages m A data entry is 0.1 the size the a data record m Pages are kept at 80% occupancy Leave space for future insertions Minimize overflows as the file expands m Then, the number of pages for data entries: 1.25(0.1B)=0.125B m The number of data entries in a page: 10(0.8R)=8R  Scan m Don’t use hash index

3. Overview(2) CS522_S Heap file with unclustered hash index: alternative 2  Search with equality selection m Suppose the cost of hash calculation is H m Assumptions The bucket only consists of one page. D We find the data entry after scanning half the entries: 0.5(8RC) = 4RC m Fetch the data record for the data file: D m Cost: H+2D+4RC  Search with range selection m The hash index doesn’t help

3. Overview(2) CS522_S Heap file with unclustered hash index: alternative 2  Insert m Insert the record in the data file: 2D+C m Insert the data entry to the index file: H+2D+C m Cost: H+4D+2C  Delete m Find the data entry in the index and locate the data record in the data file: H+2D+4RC m Write out the modified index page and data page: 2D m Cost: H+4D+4RC

3. Overview(2) CS522_S Comparison of I/O costs  A heap file has good storage efficiency and supports fast scanning and insertion of records, however, it’s slow for searches and deletions  A sorted file also offers good storage efficiency, but insertion and deletion of records is slow. Searches are faster than in heap files  A clustered file offers all the advantages of a sorted file and supports inserts and deletes efficiently. Searches are even faster than in sorted files

3. Overview(2) CS522_S Comparison of I/O costs  Unclustered tree and hash indexes offer fast searches, insertion and deletion, but scans and range searches with many matches are slow. Hash indexes are a litter faster on equality searches, but they don’t support range queries  In summary, no one file organization is uniformly superior to all situations

Indices and performance tuning  How to choose indices to improve performance in a database system?  Workload m Mix of queries and update operation 3. Overview(2) CS522_S

3. Overview(2) CS522_S16 Understanding the Workload  For each query in the workload m Which relations does it access? m Which attributes are involved in selection/joint condition m How selective are these conditions likely to be?  For each update in the workload m Which attributes are involved in selection/joint condition m How selective are these conditions likely to be? m The type of update (insert/delete/update), and the attributes that are affected 3-22

3. Overview(2) CS522_S16 Choice of Indices  What indices should we create? m Which relations should have indices? m What field(s) should be the search key? m Should we build several indices?  For each index, what kind of index should it be? m Clustered/Unclustered? m Hash/tree? 3-23

3. Overview(2) CS522_S16 Index Selection Guidelines  Attributes in WHERE clause are candidates for index keys. m Exact match condition suggests hash index. m Range query suggests tree index. Clustering is especially useful for range queries; can also help on equality queries if there are many duplicates.  Multi-attribute search keys should be considered when a WHERE clause contains several conditions. m Order of attributes is important for range queries. m Such indexes can sometimes enable index-only strategies for important queries. For index-only strategies, clustering is not important!  Try to choose indexes that benefit as many queries as possible. Since only one index can be clustered per relation, choose it based on important queries that would benefit the most from clustering. 3-24

Example  B+tree index on E.age  Is the index worthwhile m Depends on selectivity of the condition  Is the index useful m Depends on whether the index is clustered 3. Overview(2) CS522_S SELECT E.dno FROM Emp E WHERE E.age>40

Example  Consider the GROUP BY query.  If many tuples have E.age > 10, using E.age index and sorting the retrieved tuples may be costly.  Clustered E.dno index may be better! 3. Overview(2) CS522_S SELECT E.dno, COUNT(*) FROM Emp E WHERE E.age>10 GROUP BY E.dno

3. Overview(2) CS522_S Example SELECT E.dno, COUNT(*) FROM Emp E WHERE E.hobby=Stamps GROUP BY E.dno  Clustering on E.hobby helps  Clustering is also important for an index on a search key that does not include a candidate key

3. Overview(2) CS522_S Index with composite search keys  Composite Search Keys m Search on a combination of fields. m Equality query Each field in the search key is bound to a constant m Range query Not all fields in the search key are bound to constants Ex: age=20, any value is acceptable for sal fields

3. Overview(2) CS522_S Examples of composite key indexes sue1375 bob cal joe nameagesal 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80, Data records sorted by name Data entries in index sorted by Data entries sorted by

3. Overview(2) CS522_S Examples of composite key indexes (Cont.) To retrieve Emp records with m age=30 AND sal=4000 Index on m 20<age<30 AND 3000<sal<5000 Clustered tree index on or m Age=30 AND 3000<sal<5000 Clustered index on or Clustered index on ?

3. Overview(2) CS522_S16 Index-Only Plans  A number of queries can be answered without retrieving any tuples from one or more of the relations involved if a suitable index is available. SELECT E.dno, COUNT (*) FROM Emp E GROUP BY E.dno SELECT E.dno, MIN (E.sal) FROM Emp E GROUP BY E.dno Tree index! 3-31

3. Overview(2) CS522_S16 Index-Only Plans (Contd.)  Index-only plans are possible if the key is or we have a tree index with key m Which is better? m What if we consider the second query? SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUP BY E.dno SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUP BY E.dno 3-32