Lecture 19: Data Storage and Indexes

Slides:

Advertisements

Similar presentations

B+-Trees and Hashing Techniques for Storage and Index Structures

Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.

1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

COMP 451/651 Indexes Chapter 1.

Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.

1 Lecture 18: Indexes Monday, November 10, Midterm Problem 1a: select student.sname, avg(takes.grade) from student, takes where student.sid =

1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.

1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)

1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.

1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.

DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.

Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,

Storage and Indexing February 26 th, 2003 Lecture 19.

1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.

Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.

Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.

CS411 Database Systems Kazuhiro Minami 10: Indexing-1.

CS4432: Database Systems II

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.

Storage and File Organization

Module 11: File Structure

CS522 Advanced database Systems

Record Storage, File Organization, and Indexes

CS 540 Database Management Systems

Indexing Goals: Store large files Support multiple search keys

CS522 Advanced database Systems

Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.

Database Management Systems (CS 564)

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.

Disk Storage, Basic File Structures, and Buffer Management

Database Implementation Issues

Disk storage Index structures for files

Lecture 12 Lecture 12: Indexing.

B+-Trees and Static Hashing

Lecture 21: Indexes Monday, November 13, 2000.

Lecture 15: Midterm Review Data Storage

CSE 544: Lectures 13 and 14 Storing Data, Indexes

Tree-Structured Indexes

Lecture 21: B-Trees Monday, Nov. 19, 2001.

Lecture 6: Data Storage and Indexes

Storage and Indexing May 17th, 2002.

Adapted from Mike Franklin

Database Design and Programming

DATABASE IMPLEMENTATION ISSUES

CSE 544: Lecture 11 Storing Data, Indexes

CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.

Storage and Indexing.

General External Merge Sort

Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.

File Organization.

Indexing February 28th, 2003 Lecture 20.

Lecture 11: B+ Trees and Query Execution

Database Implementation Issues

Lecture 20: Indexes Monday, February 27, 2006.

CS4433 Database Systems Indexing.

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.

Lecture 15: Data Storage Tuesday, February 20, 2001.

Database Implementation Issues

Lecture 20: Representing Data Elements

CS222P: Principles of Data Management UCI, Fall Notes #06 B+ trees

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

Lecture 19: Data Storage and Indexes Wednesday, November 15, 2006

Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)

Files and Tables A disk = a sequence of blocks A file = a subsequence of blocks, usually contiguous Need to store tables/records/indexes in files/block

Representing Data Elements Relational database elements: A tuple is represented as a record The table is a sequence of records CREATE TABLE Product ( pid INT PRIMARY KEY, name CHAR(20), description VARCHAR(200), maker CHAR(10) REFERENCES Company(name) )

Issues Represent attributes inside the records Represent the records inside the blocs

Record Formats: Fixed Length pid name descr maker L1 L2 L3 L4 Base address (B) Address = B+L1+L2 Information about field types same for all records in a file; stored in system catalogs. Finding i’th field requires scan of record. Note the importance of schema information! 9

Record Header To schema length pid name descr maker L1 L2 L3 L4 header timestamp Need the header because: The schema may change for a while new+old may coexist Records from different relations may coexist 9

Variable Length Records Other header information header pid name descr maker L1 L2 L3 L4 length Place the fixed fields first: F1 Then the variable length fields: F2, F3, F4 Null values take 2 bytes only Sometimes they take 0 bytes (when at the end) 9

Storing Records in Blocks Blocks have fixed size (typically 4k – 8k) BLOCK R4 R3 R2 R1

Spanning Records Across Blocks When records are very large Or even medium size: saves space in blocks block header block header R1 R2 R3 R2

BLOB Binary large objects Supported by modern database systems E.g. images, sounds, etc. Storage: attempt to cluster blocks together CLOB = character large objec Supports only restricted operations

File Types Unsorted (heap) Sorted (e.g. by pid)

Modifications: Insertion File is unsorted: add it to the end (easy ) File is sorted: Is there space in the right block ? Yes: we are lucky, store it there Is there space in a neighboring block ? Look 1-2 blocks to the left/right, shift records If anything else fails, create overflow block

Overflow Blocks Blockn-1 Blockn Blockn+1 Overflow After a while the file starts being dominated by overflow blocks: time to reorganize

Modifications: Deletions Free space in block, shift records Maybe be able to eliminate an overflow block Can never really eliminate the record, because others may point to it Place a tombstone instead (a NULL record) How can we point to a record in an RDBMS ?

Modifications: Updates If new record is shorter than previous, easy  If it is longer, need to shift records, create overflow blocks

Pointers Logical pointer to a record consists of: Logical block number Where do we need them in RDBMS ? Pointers Logical pointer to a record consists of: Logical block number An offset in the block’s header Note: review what a pointer in C is

Indexes An index on a file speeds up selections on the search key fields for the index. Any subset of the fields of a relation can be the search key for an index on the relation. Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation). An index contains a collection of data entries, and supports efficient retrieval of all data entries with a given key value k. 7

Index Classification Primary/secondary Clustered/unclustered Primary = may reorder data according to index Secondary = cannot reorder data Clustered/unclustered Clustered = records close in the index are close in the data Unclustered = records close in the index may be far in the data Dense/sparse Dense = every key in the data appears in the index Sparse = the index contains only some keys B+ tree / Hash table / …

Primary Index File is sorted on the index attribute Dense index: sequence of (key,pointer) pairs 10 20 10 20 30 40 30 40 50 60 70 80 50 60 70 80

Primary Index Sparse index 10 20 30 40 50 60 70 80 10 30 50 90 70 110 130 150 50 60 70 80

Secondary Indexes To index other attributes than primary key Always dense (why ?) 20 30 10 20 30 20 20 30 10 20 10 30

Clustered/Unclustered Primary indexes = usually clustered Secondary indexes = usually unclustered

Clustered vs. Unclustered Index Data entries Data entries (Index File) (Data file) Data Records Data Records CLUSTERED UNCLUSTERED 12

Secondary Indexes Applications: index other attributes than primary key index unsorted files (heap files) index clustered data

B+ Trees Search trees Idea in B Trees: Idea in B+ Trees: make 1 node = 1 block Idea in B+ Trees: Make leaves into a linked list (range queries are easier)

B+ Trees Basics Parameter d = the degree Each node has >= d and <= 2d keys (except root) Each leaf has >=d and <= 2d keys: 30 120 240 Keys k < 30 Keys 30<=k<120 Keys 120<=k<240 Keys 240<=k 40 50 60 Next leaf 40 50 60

B+ Tree Example d = 2 Find the key 40 80 40  80 20 60 100 120 140 20 < 40  60 10 15 18 20 30 40 50 60 65 80 85 90 30 < 40  40 10 15 18 20 30 40 50 60 65 80 85 90

B+ Tree Design How large d ? Example: 2d x 4 + (2d+1) x 8 <= 4096 Key size = 4 bytes Pointer size = 8 bytes Block size = 4096 byes 2d x 4 + (2d+1) x 8 <= 4096 d = 170

Searching a B+ Tree Exact key values: Range queries: Start at the root Proceed down, to the leaf Range queries: As above Then sequential traversal Select name From people Where age = 25 Select name From people Where 20 <= age and age <= 30

B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. average fanout = 133 Typical capacities: Height 4: 1334 = 312,900,700 records Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = 1 page = 8 Kbytes Level 2 = 133 pages = 1 Mbyte Level 3 = 17,689 pages = 133 MBytes

Insertion in a B+ Tree Insert (K, P) Find leaf where K belongs, insert If no overflow (2d keys or less), halt If overflow (2d+1 keys), split node, insert in parent: If leaf, keep K3 too in right node When root splits, new root has 1 key only parent parent K3 K1 K2 K3 K4 K5 P0 P1 P2 P3 P4 p5 K1 K2 P0 P1 P2 K4 K5 P3 P4 p5

Insertion in a B+ Tree Insert K=19 80 20 60 100 120 140 10 15 18 20 30 50 60 65 80 85 90 10 15 18 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After insertion 80 20 60 100 120 140 10 15 18 19 20 30 40 50 60 65 80 85 90 10 15 18 19 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree Now insert 25 80 20 60 100 120 140 10 15 18 19 30 40 50 60 65 80 85 90 10 15 18 19 20 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After insertion 80 20 60 100 120 140 10 15 18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Insertion in a B+ Tree But now have to split ! 80 20 60 100 120 140 10 15 18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Insertion in a B+ Tree After the split 80 20 30 60 100 120 140 10 15 18 19 20 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Deletion from a B+ Tree Delete 30 80 20 30 60 100 120 140 10 15 18 19 25 30 40 50 60 65 80 85 90 10 15 18 19 20 25 30 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 30 May change to 40, or not 80 20 30 60 100 120 140 10 15 18 19 20 25 40 50 60 65 80 85 90 10 15 18 19 20 25 40 50 60 65 80 85 90

Deletion from a B+ Tree Now delete 25 80 20 30 60 100 120 140 10 15 18 19 20 25 40 50 60 65 80 85 90 10 15 18 19 20 25 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 25 Need to rebalance Rotate 80 20 30 60 100 120 140 10 15 18 19 20 40 50 60 65 80 85 90 10 15 18 19 20 40 50 60 65 80 85 90

Deletion from a B+ Tree Now delete 40 80 19 30 60 100 120 140 10 15 18 50 60 65 80 85 90 10 15 18 19 20 40 50 60 65 80 85 90

Deletion from a B+ Tree After deleting 40 Rotation not possible Need to merge nodes 80 19 30 60 100 120 140 10 15 18 19 20 50 60 65 80 85 90 10 15 18 19 20 50 60 65 80 85 90

Deletion from a B+ Tree Final tree 80 19 60 100 120 140 10 15 18 19 20 50 60 65 80 85 90 10 15 18 19 20 50 60 65 80 85 90

Summary on B+ Trees Default index structure on most DBMS Very effective at answering ‘point’ queries: productName = ‘gizmo’ Effective for range queries: 50 < price AND price < 100 Less effective for multirange: 50 < price < 100 AND 2 < quant < 20