CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.

Slides:



Advertisements
Similar presentations
CS 440 Database Management Systems RDBMS Architecture and Data Storage 1.
Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
CS 540 Database Management Systems
Hashing and Indexing John Ortiz.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
CS CS4432: Database Systems II Basic indexing.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
1 Lecture 18: Indexes Monday, November 10, Midterm Problem 1a: select student.sname, avg(takes.grade) from student, takes where student.sid =
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Efficient Storage and Retrieval of Data
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
Storage and Indexing February 26 th, 2003 Lecture 19.
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
1 Physical Data Organization and Indexing Lecture 14.
1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
CS 540 Database Management Systems
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.
Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Chapter 5 Record Storage and Primary File Organizations
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
1 Indexing Lecture HW#3 & Project See course page for new instructions: submit source code and output of program on the given pairs of actors Can.
CS4432: Database Systems II
Big Data Yuan Xue CS 292 Special topics on.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Select Operation Strategies And Indexing (Chapter 8)
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
CS 540 Database Management Systems
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
CS522 Advanced database Systems
Record Storage, File Organization, and Indexes
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
CS 540 Database Management Systems
Indexing and hashing.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Disk Storage, Basic File Structures, and Buffer Management
Database Implementation Issues
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 19: Data Storage and Indexes
DATABASE IMPLEMENTATION ISSUES
CSE 544: Lecture 11 Storing Data, Indexes
Indexing 1.
Indexing Lecture 15.
Storage and Indexing.
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Database Implementation Issues
Lecture 20: Indexes Monday, February 27, 2006.
Database Implementation Issues
Presentation transcript:

CS 440 Database Management Systems Lecture 6: Data storage & access methods 1

Database System Implementation 2 Conceptual Design Physical Storage Schema Entity Relationship(ER) Model Relational Model Files and Indexes User Requirements

The advantage of RDBMS It separates logical level (schema) from physical level (implementation). Physical data independence – Users do not worry about how their data is stored and processes on the physical devices. – It is all SQL! – Their queries work over (almost) all RDBMS deployments. 3

4 DBMS Architecture Query Executor Buffer Manager Storage Manager Storage Transaction Manager Logging & Recovery Lock Manager Buffers Lock Tables Main Memory User/Web Forms/Applications/DBA query transaction Query Optimizer Query Rewriter Query Parser Files & Access Methods Process manager

Challenges in physical level Processor: – MIPS Main memory: around 10 Gb/ sec. Secondary storage: higher capacity and durability Disk random access – Seek time + rotational latency + transfer time – Seek time: 4 ms - 15 ms! – Rotational latency: 2 ms – 7 ms! – Transfer time: at most 1000 Mb/ sec – Read, write in blocks. 5

Gloomy future: Moor’s law Speed of processors and cost and maximum capacity of storage increase exponentially over time. But storage (main and secondary) access time grows much more slowly. 6

Random access versus sequential access Disk random access : Seek time + rotational latency + transfer time. Disk sequential access: reading blocks next to each other – No seek time or rotational latency – Much faster than random access 7

Units of data on physical device Fields: data items Records Blocks Files 8

Fields Fixed size – Integer, Boolean, … Variable length – Varchar, … – Null terminated – Size at the beginning of the string 9

Records: sets of fields Schema – Number of fields, types of fields, order, … Fixed format and length – Record holds only the data items Variable format and length – Record holds fields and their size, type, … information Range of formats in between 10

Record header Pointer to the record schema ( record type) Record size Timestamp Other info … 11

Blocks Collection of records Reduces number of I/O access Different from OS blocks – Why should RDBMS manage its own blocks? It knows the access pattern better than OS. Separating records in a block – Fixed size records: no worry! – Markers between records – Keep record size information in records or block header. 12

Spanned versus un-spanned Un-spanned – Each records belongs to only one block Spanned – Records may be stored across multiple blocks – Saves space – The only way to deal with large records and fields: blob, image, … 13

Block header Data about block File, relation, DB IDs Block ID and type Record directory Pointer to free space Timestamp Other info … 14

Heap versus sorted files Heap files – There is not any order in the file – New blocks are inserted at the end of the file. Sorted files – Order blocks (and records) based on some key. – Physically contiguous or using links to the next blocks. 15

Average cost of data operations Insertion – Heap files are more efficient. – Overflow areas for sorted files. Search for a record or a range of records – Sorted files are more efficient. Deletion – Heap files are more efficient – Although we find the record faster in the sorted file. 16

Row versus column stores We have talked about row store – All fields of a record are stored together. 17

Row versus column stores We can store the fields in columns. – We can store SSNs implicitly. 18

Row versus column store Column store – Compact storage – Faster reads on data analysis and mining operations Row store – Faster writes – Faster reads for record access (OLTP) Further reading – Mike Stonebreaker, et al, “C-Store, a column oriented DBMS”, VLDB’05. 19

Access paths The methods that RDBMS uses to retrieve the data. Attribute value(s)  Tuple(s) 20

Types of search queries Point query over Beers(name, manf) Select * From Beers Where name = ‘Bud’; Range query over Sells(bar, beer, price) Select * From Sells Where price > 2 AND price < 10; 21

Types of access paths Full table scan – Heap files – Inefficient for both point and range queries. Sequential access – Sorted files – Efficient for both point and range queries. – Inefficient to maintain Middle ground? 22

Indexing An old idea 23

Index A data structure that speeds up selecting tuples in a relation based on some search keys. Search key – A subset of the attributes in a relation – May not be the same as the (primary) key Entries in an index – (k, r) – k is the search key. – r is the pointer to a record (record id). 24

Index Data file stores the table data. Index file stores the index data structure. Index file is smaller than the data file. Ideally, the index should fit in the main memory Data FileIndex File

Index categorizations Clustered vs. unclustered – Records are stored according to the index order. – Records are stored in another order, or not any order. Dense vs. sparse – Each record is pointed by an entry in the index. – Each block has an entry in the index. – Size versus time tradeoff. Primary vs. secondary – Primary key is the search key – Other attributes. 26

Index categorizations Clustered and dense DATAINDEX

Index categorizations Clustered and sparse DATAINDEX 70 80

Duplicate search keys Clustered and dense DATAINDEX 40 50

Duplicate search keys Clustered and sparse: – Any problem? DATAINDEX 40 50

Duplicate search keys Clustered and sparse: – Point to the lowest new search key in every block DATAINDEX 40 50

Unclustered Index Dense / sparse? DATAINDEX 10 40

Well known index structures B+ trees: – very popular Hash tables: – Not frequently used 33

B+ trees The index of a very large data file gets too large. How about building an index for the index file? A multi-level index, or a tree 34

B+ trees Degree (order) of the tree: d Each node (except root) stores [d, 2d] keys: [A, 10)[10, 32)[32, 94)[94, B) Non-leaf nodes Leaf nodes Records

36 Example d = 2

B+ tree tuning How to choose the value of d? – Each node should fit in a block. Example – Key value: 8 byte – Record pointer: 16 bytes – Block size: 4096 bytes – 2d * 8 + (2d + 1) * 16 <= 4096 – d <= 85 37

Retrieving tuples using B+ tree Point queries – Start from the root and follow the links to the leaf. Range queries – Find the lowest point in the range. – Then, follow the links between the nodes. The top levels are kept in the buffer pool. 38

B+ tree and index categories B+ tree index could be – Dense / sparse? – Clustered/ unclustered? 39

Inserting a new key Pick the proper leaf node and insert the key. If the node contains more than 2d keys, split the node and insert the extra node in the parent. – If leaf level, add K3 to the right node 40 K1K2K3K4K5 R0R1R2R3R4R5 K1K2 R0R1R2 K4K5 R3R4R5 (K3, ) parent

41 Insertion Insert K = 18

42 Insertion Insert K = 18 18

43 Insertion Insert K=

44 Insertion Need to split the node

45 Insertion Split and update the parent node. What if we need to split the root?

46 Deletion Delete K =

47 Deletion Note: K = 21 may still remain in the internal levels

48 Deletion Delete K =

49 Deletion We need to update the number of keys on the node: Borrow from siblings: rotate

50 Deletion We need to update the number of keys on the node: Borrow from siblings: rotate

51 Deletion We need to update the number of keys on the node: Borrow from siblings: rotate

52 Deletion What if we cannot borrow from siblings? Example: delete K =

53 Deletion What if we cannot borrow from siblings? Merge with a sibling

54 Deletion What if we cannot borrow from siblings? Merge siblings!

55 Deletion What to do with the dangling key and pointer? simply remove them

56 Deletion Final tree

Index creation CREATE TABLE Person(Name varchar(50), Pos int, Age int); CREATE INDEX Person_ID ON Person(ID); CLUSTER Person USING ON Person_ID; CREATE INDEX Pos_Age ON Person(Pos, Age); 57 Default is normally B-tree. Cluster Person_ID index Multi-attribute index

Index selection Let’s index every attribute on every table to speed up all queries! Indexes generally slow down data manipulation – INSERT, DELETE, UPDATE. 58

Index selection Given a query workload and a schema, find the set of indexes that optimize the execution. The query workload: – Queries and their frequencies. – Queries are both data retrieval (SELECT) and data manipulation (INSERT, UPDATE, DELETE). 59

Index selection Part of physical database design – File structure, indexing, tuning queries,… Physical database design may affect logical design! – Change the schema to run the queries faster 60

Index selection Generally a hard problem. RDBMS vendors provide wizards: – Started with AutoAdmin project for SQL Server – SQL Server/ Oracle Index Tuning Wizard – DB2 Index Advisor They try many configurations and pick the one that minimizes the time and overheads. 61