Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 How index-learning turns no student pale Yet holds.
Advertisements

Storing Data: Disks and Files
Disk Storage, Basic File Structures, and Hashing
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
File Organizations and Indexing Chapter 8
Overview of Storage and Indexing
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Indexes An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for.
Tree-Structured Indexes
Chapter 7 Indexing Structures for Files Copyright © 2004 Ramez Elmasri and Shamkant Navathe.
Number bonds to 10,
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Chapter 8 File organization and Indices.
1 File Organizations and Indexing Module 4, Lecture 2 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
1 Lecture 18: Indexes Monday, November 10, Midterm Problem 1a: select student.sname, avg(takes.grade) from student, takes where student.sid =
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Introduction to Database Systems1 Indexing Techniques Storage Technology: Topic 4.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
Storage and Indexing February 26 th, 2003 Lecture 19.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Indexing - revisited CS 186, Fall 2012 R & G Chapter 8.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
1 Physical Data Organization and Indexing Lecture 14.
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.
Database Management Systems,Shri Prasad Sawant. 1 Storing Data: Disks and Files Unit 1 Mr.Prasad Sawant.
Chapter Ten. Storage Categories Storage medium is required to store information/data Primary memory can be accessed by the CPU directly Fast, expensive.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Overview of Storage and Indexing Content based on Chapter 4 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
File Organizations and Indexing
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Data Indexing Herbert A. Evans.
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Record Storage, File Organization, and Indexes
CS 540 Database Management Systems
Indexing Goals: Store large files Support multiple search keys
Indexing and hashing.
CS522 Advanced database Systems
File organization and Indexing
Chapter 11: Indexing and Hashing
File Organizations and Indexing
File Organizations and Indexing
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 19: Data Storage and Indexes
CSE 544: Lecture 11 Storing Data, Indexes
Storage and Indexing.
Indexing 4/11/2019.
Indexing February 28th, 2003 Lecture 20.
Lecture 20: Indexes Monday, February 27, 2006.
Presentation transcript:

Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006

2 Copyright Many contents are from Database Management Systems, 3rd edition. Raghu Ramakrishnan & Johannes Gehrke McGraw-Hill, Some slides are from Prof. Beng Chin Ooi’s lecture notes for CS5226 in NUS Some are from the internet

3 Outline Cost Model and File Organization The memory hierarchy Magnetic disk Cost model Heap files The concept of index Indexes : General Concepts and Properties Queries, keys and search keys Simple index files Properties of indexes Indexed sequential access method (ISAM) B + -tree Structure Search Insertion Deletion Other basic indexes Hashing Bitmap

4 The Memory Hierarchy CPU L1 Cache CPU Die L2 Cache Main Memory Hard Disk Registers Intel PIII CPU: 450MHz Memory: 512MB 40 ns 90 ns 14 ms = ns Cost is affected significantly by disk accesses !

5 The Hard (Magnetic) disk The time for a disk block access, or disk page access or disk I/O access time = seek time + rotational delay + transfer time IBM Deskstar 14GPX: 14.4GB Seek time: 9.1 msec Rotational delay: 4.17 msec Transfer rate: 13MB/sec, that is, 0.3msec/4KB

6 The Cost Model Cost measure: number of page accesses Objective A simple way to estimate the cost (in terms of execution time) of database operations Reason Page access cost is usually the dominant cost of database operations An accurate model is too complex for analyzing algorithms Note This cost model is for disk based databases; NOT applicable to main memory databases Blocked access: sequential access

7 Basic File Organization : Heap Files File : a logical collection of data, physically stored as a set of pages. Heap File (Unordered File) Linked list of pages The DBMS maintains the header page: Operations Insertion Deletion Search Advantage: Simple Disadvantage: Inefficient Header Page Data Page Data Page Data Page Data Page Data Page Data Page Pages with free space Full pages

8 The Concept of Index Searching a Book… Index A data structure that helps us find data quickly Note Can be a separate structure (we may call it the index file) or in the records themselves. Usually sorted on some attribute Index by Author Index by Title Database by Rao Database by Rui Programming by Alistair CatalogsBooks Where are books by Rui ? Where is “Database” ?

9 Query, Key and Search Key Queries Exact match (point query) Q1: Find me the book with the name “Database” Range query Q2: Find me the books published between year Searching methods Sequential scan - too expensive Through index – if records are sorted on some attribute, we may do a binary search If sorted on “book name”, then we can do binary search for Q1 If sorted on “year published”, then we can do binary search for Q2 Key vs. Search key Key: the indexed attribute Search key: the attribute queried on

10 Simple Index File (Clustered, Dense) Sorted records Dense index

11 Simple Index File (Clustered, Sparse) Sorted records Sparse index

12 Simple Index File (Clustered, Multi-level) Sorted records Sparse 2nd level

13 Simple Index File (Unclustered, Dense) Unsorted records Dense index sparse high level

14 Simple Index File (Unclustered, Sparse ?) does not make sense! Unsorted records Sparse index

15 Properties of Indexes Alternatives for entries in an index An actual data record A pair (key, ptr) A pair (key, ptr-list) Clustered vs. Unclustered indexes Dense vs. Sparse indexes Dense index on clustered or unclustered attributes Sparse index only on clustered attribute Primary vs. Secondary indexes Primary index on clustered attribute Secondary index on unclustered attribute

16 Indexes on Composite Keys Q3: age=20 & sal=10 Index on two or more attributes: entries are sorted first on the first attribute, then on the second attribute, the third … Q4: age=20 & sal>10 Q5: sal=10 & age>20 Note Different indexes are useful for different queries sue1375 bob cal joe nameagesal 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80, Data records sorted by name Examples of composite key indexes using lexicographic order.

17 Indexed sequential access method (ISAM) Tree structured index Support queries Point queries Range queries Problems Static: inefficient for insertions and deletions Data pages Overflow page

18 The B + -Tree: A Dynamic Index Structure Grows and shrinks dynamically Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. Height-balanced Insert/delete at log f N cost (f = fanout, N = No. leaf pages) Pointers to sibling pages Non-leaf pages (internal pages) Leaf pages If directory page, same as non-leaf pages; pointers point to data page addresses If data page P 0 K 1 P 1 K 2 P 2 K 3 … K m P m K 0 D 0 K 1 D 1 K 2 D 2 … K m D m

19 Searching in a B + -Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM) Search for 5, 15, all data entries >= What about all entries <= 24 ? Root

20 Insertion in a B + -Tree Find correct leaf L Put data entry onto L If L has enough space, done! Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “ grow ” tree; root split increases height. Tree growth: gets wider or one level taller at top.

21 Inserting 7 & 8 into the B + -Tree Observe how minimum occupancy is guaranteed in both leaf and index pg splits Root (Note that 5 is copied up and continues to appear in the leaf.)

22 Inserting 8 into the B + -Tree (continued) Note the difference between copy up and push up (17 is pushed up and only appears once in the index. Contrast this with a leaf split.)

23 The B + -Tree After Inserting 8 Note that root was split, leading to increase in height We can avoid splitting by re-distributing entries. However, this is usually not done in practice. Why? 23 Root

24 Deletion in a B + -Tree Start at root, find leaf L where the entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to redistribute, borrowing from sibling (adjacent node with same parent as L). If redistribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height.

25 Deleting 19 & 20 from the B + -Tree Deleting 20 is done with re-distribution. Note how the middle key is copied up 23 Root Root

26 Deleting 24 from the B + -Tree Merge happens Observe ‘ toss ’ of index entry (on right), and ‘ pull down ’ of index entry (below). Note the decrease in height Root

27 Summary of the B + -Tree A dynamic structure – height balanced – robustness Scalability Typical order: 100. Typical fill-factor: 67%. average fanout = 133 Typical capacities (root at Level 1, and has 133 entries) Level 5: 1334 = 312,900,700 records Level 4: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 Mbytes Efficient point and range queries – performance Concurrency Essential properties of a DBMS

28 Other Basic Indexes: Hash Tables Static hashing: static table size, with overflow chains Extendible hashing: dynamic table size, with no overflows Efficient point query, but inefficient range query... M-1 0  (1 +  )

29 Other Basic Indexes: Bitmap Bitmap with bit position to indicate the presence of a value Advantages Efficient bit-wise operations Efficient for aggregation queries: counting bits with 1 ’ s More compact than trees-- amenable to the use of compression techniques Limitations Only good for domain of small cardinality MF cusidnamegenderrating 112JoeM3 115RamM5 119SueF5 120WooM genderrating

30 B + -Tree For All and Forever? Can the B+-tree being a single-dimensional index be used for emerging applications such as: Spatial databases High-dimensional databases Temporal databases Main memory databases String databases Genomic/sequence databases... Coming next … Indexing Multidimensional Data