Data Organization - B-trees. A simple index Brighton A-217 700 Downtown A-101 500 Downtown A-110 600 Mianus A-215 700 Perry A-102 400...... A-101 A-102.

Slides:



Advertisements
Similar presentations
Indexes An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for.
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
B+-Trees and Hashing Techniques for Storage and Index Structures
Hashing and Indexing John Ortiz.
Data Organization - B-trees. 11.2Database System Concepts A simple index Brighton A Downtown A Downtown A Mianus A Perry.
1 Tree-Structured Indexes Module 4, Lecture 4. 2 Introduction As for any index, 3 alternatives for data entries k* : 1. Data record with key value k 2.
1 Lecture 8: Data structures for databases II Jose M. Peña
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Chapter 9 of DBMS First we look at a simple (strawman) approach (ISAM). We will see why it is unsatisfactory. This will motivate the B+Tree Read 9.1 to.
Introduction to Database Systems1 Indexing Techniques Storage Technology: Topic 4.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Range Searches  `` Find all students with gpa > 3.0 ’’  If data is in sorted file, do.
CS 4432lecture #71 CS4432: Database Systems II Lecture #7 Professor Elke A. Rundensteiner.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
ISAM: Indexed-Sequential-Access-Method
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
CS4432: Database Systems II
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Storage and Indexing February 26 th, 2003 Lecture 19.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Index Structures for Files Indexes speed up the retrieval of records under certain search conditions Indexes called secondary access paths do not affect.
Adapted from Mike Franklin
Comp 335 File Structures B - Trees. Introduction Simple indexes provided a way to directly access a record in an entry sequenced file thereby decreasing.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
Temple University – CIS Dept. CIS331– Principles of Database Systems V. Megalooikonomou Indexing and Hashing I (based on notes by Silberchatz, Korth, and.
Index Tuning Conventional index. Overview.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
CS4432: Database Systems II
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Data Organization - B-trees
Indexing Goals: Store large files Support multiple search keys
Indexing and hashing.
Tree-Structured Indexes: Introduction
Tree-Structured Indexes
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Lecture 20: Indexing Structures
C. Faloutsos Indexing and Hashing – part I
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
(Slides by Hector Garcia-Molina,
B+-Trees and Static Hashing
Lecture 21: Indexes Monday, November 13, 2000.
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Adapted from Mike Franklin
Indexing 1.
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Storage and Indexing.
Indexing 4/11/2019.
General External Merge Sort
Indexing February 28th, 2003 Lecture 20.
Lecture 20: Indexes Monday, February 27, 2006.
Tree-Structured Indexes
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Presentation transcript:

Data Organization - B-trees

A simple index Brighton A Downtown A Downtown A Mianus A Perry A A-101 A-102 A-110 A-215 A Index of depositors on acct_no Index records: To answer a query for “acct_no=A-110” we: 1. Do a binary search on index file, searching for A “Chase” pointer of index record Index file

Index Choices 1. Primary: index search key = physical order search key vs Secondary: all other indexes Q: how many primary indices per relation? 2. Dense: index entry for every search key value vs Sparse: some search key values not in the index 3. Single level vs Multilevel (index on the indices)

Measuring ‘ goodness ’ On what basis do we compare different indices? 1. Access type: what type of queries can be answered: –selection queries (ssn = 123)? –range queries ( 100 <= ssn <= 200)? 2. Access time: what is the cost of evaluating queries –Measured in # of block accesses 3. Maintenance overhead: cost of insertion / deletion? (also BA ’ s) 4. Space overhead : in # of blocks needed to store the index

Indexing Primary (or clustering) index on SSN

Indexing Secondary (or non-clustering) index: duplicates may exist Address-index Can have many secondary indices but only one primary index

Indexing secondary index: typically, with ‘postings lists’ Postings lists

Indexing Primary/sparse index on ssn (primary key) >=123 >=456

Indexing Secondary / dense index Secondary on a candidate key: No duplicates, no need for posting lists

Summary DenseSparse Primaryrareusual secondaryusual All combinations are possible at most one sparse index as many as desired dense indices usually: one primary index (probably sparse) and a few secondary indices (dense)

ISAM >=123 >=456 block 2 nd level sparse index on the values of the 1 st level What if index is too large to search in memory?

Another example [12, Mark, 3.5] [16, John, 3] [18, Azer, 3.6] [21, Wayne, 3.4] [22, Margrit 3.8] [24, Stan, 3.7] [26, Leonid, 4] [27, Leo, 3.9] [31, Peter, 3.5] [33, Steve, 3] [34, George, 3.99] [35, Rich, 3.4] [42, Scott, 2.9] [44, Murat, 3.8] [45, Ibrahim, 3.6] [46, Gene, 3.4] [49, Stan, 3.5] [54, Mike, 3.89] [56, Teo, 3.6] [58, Jose, 3.4]

ISAM - observations What about insertions/deletions? >=123 >= ; peterson; fifth ave.

ISAM - observations What about insertions/deletions? 124; peterson; fifth ave. overflows Problems?

ISAM - observations What about insertions/deletions? 124; peterson; fifth ave. overflows overflow chains may become very long - what to do?

ISAM - observations What about insertions/deletions? 124; peterson; fifth ave. overflows overflow chains may become very long - thus: shut-down & reorganize start with ~80% utilization

ISAM - observations if index is too large, store it on disk and keep index on the index (in memory) usually two levels of indices, one first- level entry per disk block typically, blocks: 80% full initially (what are potential problems / inefficiencies?)

Tree Index file may still be quite large. But we can apply the idea repeatedly! * Leaf pages contain data entries. P 0 K 1 P 1 K 2 P 2 K m P m index entry Non-leaf Pages Overflow page Primary pages Leaf

B+ Tree: The Most Widely Used Index Insert/delete at log F N cost; keep tree height- balanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (Direct search)

Example B+ Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM). Search for 5*, 15*, all data entries >= 24*... * Based on the search for 15*, we know it is not in the tree! Root * 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13

B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. –average fanout = 133 Typical capacities: –Height 4: = 312,900,700 records –Height 3: = 2,352,637 records Can often hold top levels in buffer pool: –Level 1 = 1 page = 8 Kbytes –Level 2 = 133 pages = 1 Mbyte –Level 3 = 17,689 pages = 133 MBytes

Inserting a Data Entry into a B+ Tree Find correct leaf L. Put data entry onto L. –If L has enough space, done! –Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively –To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. –Tree growth: gets wider or one level taller at top.

Example B+ Tree - Inserting 8* Root * 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13

Example B+ Tree - Inserting 8* v Notice that root was split, leading to increase in height. v In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

Inserting 8* into Example B+ Tree Observe how minimum occupancy is guaranteed in both leaf and index pg splits. Note difference between copy- up and push-up; be sure you understand the reasons for this. 2* 3* 5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is continues to appear in the leaf.) s copied up and appears once in the index. Contrast Entry to be inserted in parent node. (Note that 17 is pushed up and only this with a leaf split.) … …

Deleting a Data Entry from a B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. –If L is at least half-full, done! –If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height.

Example Tree (including 8*) Delete 19* and 20*... Deleting 19* is easy. 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

Example Tree (including 8*) Delete 19* and 20*... Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 2*3* Root *16* 33*34* 38* 39* 135 7*5*8* 22*24* 27 27*29*

... And Then Deleting 24* Must merge. Observe `toss’ of index entry (on right), and `pull down’ of index entry (below) *27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root

Example of Non-leaf Re-distribution Tree is shown below during deletion of 24*. (What could be a possible initial tree?) In contrast to previous example, can re-distribute entry from left child of root to right child. Root *16* 17*18* 20*33*34* 38* 39* 22*27*29*21* 7*5*8* 3*2*

After Re-distribution Intuitively, entries are re-distributed by `pushing through’ the splitting entry in the parent node. It suffices to re-distribute index entry with key 20 through the parent (move 22 down and move 20 to the parent) 14*16* 33*34* 38* 39* 22*27*29* 17*18* 20*21* 7*5*8* 2*3* Root

Prefix Key Compression Important to increase fan-out. (Why?) Key values in index entries only `direct traffic’; can often compress them. –E.g., If we have adjacent index entries with search key values Dannon Yogurt, David Smith and Devarakonda Murthy, we can abbreviate David Smith to Dav. (The other keys can be compressed too...) Is this correct? Not quite! What if there is a data entry Davey Jones? (Can only compress David Smith to Davi) In general, while compressing, must leave each index entry greater than every key value (in any subtree) to its left. Insert/delete must be suitably modified.

Bulk Loading of a B+ Tree If we have a large collection of records, and we want to create a B+ tree on some field, doing so by repeatedly inserting records is very slow. –Also leads to minimal leaf utilization --- why? Bulk Loading can be done much more efficiently. Initialization: Sort all data entries, insert pointer to first (leaf) page in a new (root) page. 3* 4* 6*9*10*11*12*13* 20* 22* 23*31* 35* 36*38*41*44* Sorted pages of data entries; not yet in B+ tree Root

Bulk Loading (Contd.) Index entries for leaf pages always entered into right-most index page just above leaf level. When this fills up, it splits. (Split may go up right-most path to the root.) Much faster than repeated inserts, especially when one considers locking! 3* 4* 6*9*10* 11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Data entry pages not yet in B+ tree * 4* 6*9*10* 11*12*13* 20*22* 23*31* 35* 36*38*41*44* 6 Root not yet in B+ tree Data entry pages

Summary of Bulk Loading Option 1: multiple inserts. –Slow. –Does not give sequential storage of leaves. Option 2: Bulk Loading –Has advantages for concurrency control. –Fewer I/Os during build. –Leaves will be stored sequentially (and linked, of course). –Can control “fill factor” on pages.

A Note on `Order’ Order (d) concept replaced by physical space criterion in practice (`at least half-full’). –Index pages can typically hold many more entries than leaf pages. –Variable sized records and search keys mean different nodes will contain different numbers of entries. –Even with fixed length fields, multiple records with the same search key value (duplicates) can lead to variable-sized data entries (if we use Alternative (3)). Many real systems are even sloppier than this --- only reclaim space when a page is completely empty.

Alternatives for Data Entry k* in Index Three alternatives:  Actual data record (with key value k)  Choice is orthogonal to the indexing technique. –Examples of indexing techniques: B+ trees, hash-based structures, R trees, … –Typically, index contains auxiliary information that directs searches to the desired data entries

Alternatives for Data Entries (Contd.) Alternative 1: Actual data record (with key value k) –If this is used, index structure is a file organization for data records (like Heap files or sorted files). –At most one index on a given collection of data records can use Alternative 1. –This alternative saves pointer lookups but can be expensive to maintain with insertions and deletions.

Alternatives for Data Entries (Contd.) Alternative 2 and Alternative 3 –Easier to maintain than Alt 1. –If more than one index is required on a given file, at most one index can use Alternative 1; rest must use Alternatives 2 or 3. –Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length. –Even worse, for large rid lists the data entry would have to span multiple blocks!