Indexing CS 400/600 – Data Structures. Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000.

Slides:



Advertisements
Similar presentations
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Advertisements

0 Course Outline n Introduction and Algorithm Analysis (Ch. 2) n Hash Tables: dictionary data structure (Ch. 5) n Heaps: priority queue data structures.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Indexing Structures for Files.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
B-Trees. Motivation for B-Trees Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
Other time considerations Source: Simon Garrett Modifications by Evan Korth.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
1 Database indices Database Systems manage very large amounts of data. –Examples: student database for NWU Social Security database To facilitate queries,
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
CS4432: Database Systems II
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Index Structures for Files Indexes speed up the retrieval of records under certain search conditions Indexes called secondary access paths do not affect.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.
1 Physical Data Organization and Indexing Lecture 14.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
CSE AU B-Trees1 B-Trees CSE 373 Data Structures.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Indexing.
COSC 2007 Data Structures II Chapter 15 External Methods.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
B-Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it.
Comp 335 File Structures B - Trees. Introduction Simple indexes provided a way to directly access a record in an entry sequenced file thereby decreasing.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Chapter 8 External Storage. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices  Disk drives  Tape.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
 B-tree is a specialized multiway tree designed especially for use on disk  B-Tree consists of a root node, branch nodes and leaf nodes containing the.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Multiway Search Trees Data may not fit into main memory
CS 728 Advanced Database Systems Chapter 18
B-Trees B-Trees.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
B+-Trees.
B+-Trees.
B+-Trees.
(edited by Nadia Al-Ghreimil)
B+-Trees (Part 1).
File Storage and Indexing
(edited by Nadia Al-Ghreimil)
Indexing 4/11/2019.
General External Merge Sort
B-Trees.
Presentation transcript:

Indexing CS 400/600 – Data Structures

Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000 to 1,000,000  !  Over the past 10 years: CPU speeds increased 30  (100 MHz to ~3GHz) Disk speeds increased ~2  Disk sizes increased ~80x Disk speed relative to the amount of data stored has gotten much lower

Indexing3 Disk Organization  As programmers, we see a logical file. It appears as a contiguous sequence of bytes.  In reality, the corresponding physical file might be spread all over the disk.  The file manager (part of the O/S) takes requests for logical files, accesses the corresponding physical files, and feeds the information to your program

Indexing4 Disk Architecture All the corresponding tracks on each platter are collectively called a cylinder, and are accessed at the same time.

Indexing5 Constant Rotational Speed  The smallest unit of I/O is a sector  The disk spins at a constant speed, unlike a CD- ROM

Indexing6 Sector Interleaving  After reading a sector, it takes time to process the data. Meanwhile, the disk is still rotating.  The next sector is not stored immediately following the current sector, so that it will be approaching the head after the processing completes Interleaving factor = 3

Indexing7 Performing a Disk Read  Seek time: time to move head to the right cylinder  Rotational latency: time for the correct sector to rotate under the head (3600, 5400, 7200 RPM)  Transfer time: time to read a sector of data  By far, the highest cost is the seek time

Indexing8 Clusters  Most operating systems don’t allow you allocate a single sector for a file  The smallest unit of file allocation is a cluster  A cluster is a group of consecutive sectors

Indexing9 Seek Times Revisted  Disk manufacturers usually specify two kinds of seek times: Moving from one track to the next track: track-to- track cost Average seek time for a random access: average seek time  With interleaving, it may take several rotations to read a full track of data!

Indexing10 Access Time Example  16.8 GB disk with 10 platters  13,085 tracks per platter  256 sectors per track, with 512 bytes per sector  Track-to-track seek: 2.2 ms  Average seek: 9.5 ms  Interleaving factor: 3  O/S specs Cluster size = 8 sectors (4KB) = 32 clusters/track

Indexing11 Access Time Example (2)  5400 RPM = 11.1 ms per rotation It takes 3 rotations to read a full track (32 clusters)  Suppose we want to read a 1MB file 1MB  4KB (per cluster) = 256 clusters  The performance depends on how the physical file is stored on the disk…

Indexing12 Contiguous Track Access  Suppose the file is stored entirely on contiguous tracks… Initial track:  Seek to initial track: 9.5 ms  Rotate to start sector (on average ½ a rotation): 11.1/2 ms  Three rotations to read track: 3  11.1 ms  Total: 48.4 ms Subsequent 7 tracks need only a track-to-track seek:  2.2 ms /2 ms + 3  11.1 ms = 41.1 ms Total: 48.4 ms + 7  41.1 ms = ms

Indexing13 Random Track Access  Suppose the clusters are spread randomly over the disk To read one cluster (8 sectors out of 256 per track) would require 8/256 of a rotation if they were consecutive sectors  With an interleaving factor of 3, the section of the track that the 8 clusters occupy is 3 times as long, so we need (3  8)/256 of a rotation to read a cluster  1 ms Total time is: 256  ( /2 + (3  8)/256) = 3877 ms Almost 4 sec vs. about 1/3 of a second

Indexing14 Locality and Buffers  The O/S often maintains a buffer pool that acts as a cache for the disk  When a sector is read, subsequent sectors are cached in memory in buffers  Reading subsequent buffers if fast  The principle of locality ensures that we will often access the buffered data

Indexing15 Rule of Thumb  When processing data stored on disk: Minimize the number of disk accesses  If you must read from the disk, try to read from consecutive sectors.

Indexing16 Database Organization  Primary Key – a unique key for each record, like employee ID Often inconvenient for searching  Secondary key(s) – non-unique indices, more convenient for search  Common types of queries: Simple select Range query Sort – visit all records in key-sorted order

Indexing17 Memory Indexes  When data are small enough, indexes can be stored in memory

Indexing18 Linear Indexing  If the indexes won’t fit in memory, tree indexing is difficult, because random access is required.  Linear files with key/pointer pairs Sorted by key, so we can use binary search Pointer to the disk location, or to the primary key

Indexing19 Linear Indexing (2)  If the index is large, binary search again requires random access  A two level index can be used to show which disk clusters contain each chunk of the primary linear index Problems: 1) Updates are expensive. 2) Duplicate secondary key values waste space. Each value is stored in the index.

Indexing20 Other linear indices  2-D linear index:  Inverted list: Jones Smith Zukowski AA10 AX33 ZQ99 AB12 AX35 AB39 ZX45 FF37 Zukowski Smith Jones 3 1 0AA10 AX33 ZQ99 AB12 AX35 AB39 ZX45 FF

Indexing21 Problems with linear indexing  Updates are expensive – a single update can require changing the position of every key in the index  Inverted lists help, but insertion of new secondary key values is still expensive Ok when there are few secondary keys and lots of duplication  Can we break the linear index into blocks so that updates will only require changing a single disk block?

Indexing22 ISAM  Records are stored in primary key order on disk  Memory table stores lowest key of each cylinder  Cylinder index stores lowest key of each block  IBM, 1960’s Cylinder keys (memory) Cylinder index Records Overflow Cylinder 1 Cylinder index Records Overflow Cylinder 2 System Overflow Note the level of O/S control required.

Indexing23 Tree Indexing  In order to store an index in a tree: Subtrees should be in the same disk block The tree should be height balanced (like an AVL) tree to prevent many disk accesses Balancing should not require many random disk accesses 2-3 Trees  B-Trees  B+ Trees

Indexing Trees  A 2-3 tree is a BST, with the following properties: Each node contains either one or two keys Every internal node has either two children and one key, or three children and two keys All leaves in the tree are at the same level

Indexing tree search  Similar to BST, but with more keys (sometimes):

Indexing Tree Insert  Like a BST, we find a leaf node for insertion  Case 1: Only one key -- simple

Indexing Insert (2)  Case 2: The leaf node is full – node split Create new node for the largest of the 3 keys  Original node L, gets smallest new node L´ gets largest  Pass the middle key and a pointer to L´ to the parent  If there is room, the middle key becomes the rightmost key of the parent, and the pointer to L´ becomes the rightmost pointer of the parent  Otherwise, split and promote again

Indexing Insert (3) Insert 55

Indexing Cascading Insert Insert 19

Indexing Cascading Insert (2)

Indexing Tree Delete  2-3 Tree delete is similar to a BST delete, but is very complex Delete from a leaf node is easy When deleting from an internal node, we can replace the key with a similar key from a subtree, much like a BST delete If the subtree is sparse, there may be no node with a similar key that can be removed and maintain at least one key per node  In this case, you must merge nodes together

Indexing32 B-Tree  R. Bayer and E. McCreight, 1972  By 1979, the standard file organization for applications requiring insertion, deletion, and key-range searches  The 2-3 tree is a B-tree of order 3 That is, the B-tree is a generalization of the 2-3 tree

Indexing33 B-Tree Properties  Always height-balanced, with all leaves at the same level  Update and search affect only a few disk blocks  Related keys are stored in the same disk block Locality of reference is good  Guarantee that every node is full to a minimum percentage

Indexing34 B-Tree Order  For a B-Tree of order m The root is either a leaf or has at least 2 children Each internal node has between  m/2  and m children A leaves are on the same level of the tree  M is chosen so that a node fills a disk block Often 100 or more children!

Indexing35 B-Tree Search  Perform binary search on the keys in the current node. If key not found, follow correct pointer and repeat. A B-tree of order four

Indexing36 B-tree insert  Similar to 2-3 tree insert  Find a leaf node If there is room insert If not, split and promote the middle key to the parent If the parent is full, repeat  Guarantees that leaf nodes will always be at least ½ full.

Indexing37 B + Trees  Stores records or pointers only at the leaf nodes Leaf nodes may store more or less than m records, depending on the relationship between record size and pointer size Leaf nodes must remain half full  Internal nodes have only key values  Leaf nodes are linked in a doubly linked list Makes range queries very easy!

Indexing38 B + Tree Search  Search for 33 progress all the way down to the proper leaf node…

Indexing39 B + Tree Delete  If deletion causes underflow, borrow from a neighboring node, if possible Note the changed index value

Indexing40 B + Tree Delete  If neither sibling can donate, then the node gives its nodes to a sibling and is deleted  There must be room because the node is less than half full and the sibling is at most half full

Indexing41 B + Tree Delete This node now has only one child, so we borrow from the left subtree of the root

Indexing42 B * Tree  The B + tree is always at least ½ full  The B * tree is a variant with different merge/split rules that is always at least 2/3 full.

Indexing43 B-Tree analysis  Search, insert and delete are  (log n), where n is the number of records in the tree.  However, the base of the log is the branching factor, which is generally very high (often 100 or more), so the trees are very shallow and these operations are very efficient.  To minimize disk access, the upper levels of the tree are often kept in memory, and blocks are stored in memory buffers

Indexing44 Size Analysis  Consider a B + tree of order 100 with leaf nodes containing 100 records One level tree: 50 to 100 records  Root is a leaf Two level tree: 100 to 10,000 records  Two leaves with 50 records to 100 leaves with 100 records each Three level tree: 5000 to 1 million records  Two second-level nodes with 50 children containing 50 records each, to 100 with 100 full children each Four level tree: 250,000 to 100 million records