COMP 430 Intro. to Database Systems

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Hashing and Indexing John Ortiz.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
INDEXING AND HASHING.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
1 Physical Data Organization and Indexing Lecture 14.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Indexing and Hashing By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
File Organizations and Indexing
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Data Indexing Herbert A. Evans.
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
CS 540 Database Management Systems
Indexing Goals: Store large files Support multiple search keys
Indexing and hashing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Storage and Indexes Chapter 8 & 9
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Extra: B+ Trees CS1: Java Programming Colorado State University
Hash-Based Indexes Chapter 11
Database Management Systems (CS 564)
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
File organization and Indexing
Chapter 11: Indexing and Hashing
Lecture 12 Lecture 12: Indexing.
Hash Tables.
Yan Huang - CSCI5330 Database Implementation – Access Methods
(Slides by Hector Garcia-Molina,
File Organizations and Indexing
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
Hash-Based Indexes Chapter 11
Database Management System
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
Chapter 11 Indexing And Hashing (1)
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Indexing 4/11/2019.
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CS4433 Database Systems Indexing.
Chapter 11: Indexing and Hashing
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

COMP 430 Intro. to Database Systems Indexing

How does DB find records quickly? Various forms of indexing An index is automatically created for primary key. SQL gives us some control, so we should understand the options. Concerned with user-visible effects, not underlying implementation. CREATE INDEX index_city_salary ON Employees (city, salary); CREATE UNIQUE CLUSTERED INDEX idx ON MyTable (attr1 DESC, attr2 ASC); Options vary.

Phone book model Data is stored with search key. Clustered index. Organized by one search key: Last name, first name. Searching by any other key is slow.

Library card catalog model Index stores pointers to data. Non-clustered index. Organized by search key(s). Author last name, first name. Title Subject

Evaluating an indexing scheme Access type flexibility Specific key value – e.g., “John”, “Smith” Key value range – e.g., salary between $50K and $60K Advantage: Access time Disadvantages: Update time Space overhead Creating an index can slow down the system!

Ordered indices Sorted, as in previous human-oriented examples

Types of indices we’ll see Dense vs. sparse Primary vs. secondary Unique vs. not-unique Single- vs. multi-level These ideas can be combined in various ways.

Primary index – the clustering index 10 Index Data 30 80 100 140 … 20 40 50 60 10 Index Data 20 30 40 50 … 60 Sparse Dense What were primary search keys in phone book & library card catalog? Typically unique index also, since primary search key typically same as primary key.

Primary index – trade-off dense vs. sparse 10 Index Data 40 70 100 130 … 20 30 50 60 Dense – faster access Sparse – less update time, less space overhead Good trade-off: Sparse, but link to first key of each file block.

Secondary index – a non-clustering index 10 30 Index Data 20 50 100 … 140 What were secondary search keys in phone book & library card catalog? Needs to be dense, otherwise we can’t find all search keys efficiently.

Secondary index typically not unique 10 30 Index Data 20 80 100 … Buckets

Same problem & solution as for page tables in virtual memory. Index size Dense index – search key + pointer per record Sparse index – search key + pointer per file block (typically) Many records, but want index to fit in memory Solution: Multi-level index Same problem & solution as for page tables in virtual memory.

Multi-level primary index 10 (Semi-)dense Index Data 40 70 100 130 … 20 30 50 60 80 90 160 190 220 250 Sparse Index

Multi-level secondary index Sparse Index Dense Index Data 10 10 70 50 20 40 90 30 30 130 40 100 … 50 10 60 90 70 60 80 20 90 80 … 50 …

Multi-level index summary Access time now very locality-dependent Update time increased – must update multiple indices Total space overhead increased

Updating database & indices – inserting 10 Index Data 40 70 100 130 … 20 30 50 60 Data 10 20 25 30 40 50 60 Moving all records is too expensive. Have same issue when updating indices. 70

Updating database & indices – deleting 10 Index Data 40 70 100 130 … 20 30 50 60 Data 10 20 25 30 40 60 Moving all records is too expensive. Have same issue when updating indices. 70

Updating file leads to fragmentation Performance degrades with time. Need to periodically reorganize data & indices. Solution: B+-trees

B+-tree indices

Or, is the default in many DBMSs. B+-trees Balanced search trees optimized for disk-based access Reorganizes a little on every update Shallow & wide (high fan-out) Typically, node size = disk block Slight differences from more commonly-known B-trees All data in leaf nodes Leaf nodes sequentially linked Easily get all data in order. Or, is the default in many DBMSs. CREATE INDEX … ON … USING BTREE;

B+-tree example … Data records … 100 30 120 150 180 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200 … Data records …

B+-tree widely used in relational DBMSs B+-tree performance Very similar to multi-level index structure Slightly higher per-operation access & update time No degradation over time No periodic reorganization B+-tree widely used in relational DBMSs

What about non-unique search keys? Allow duplicates in tree. Maintain  order instead of <. Slightly complicates tree operations. Make unique by adding record-ID. Extra storage. But, record-ID useful for other purposes, too. List of duplicate records for each key. Trivial to get all duplicates. Inefficient when lists get long.

Indexing on VARCHAR keys Key size variable, so number of keys that fit into a node also varies. Techniques to maximize fan-out: Key values at internal nodes can be prefixes of full key. E.g., “Johnson” and “Jones” can be separated by “Jon”. Key values at leaf nodes can be compressed by sharing common prefixes. E.g., “Johnson” and “Jones” can be stored as “Jo” + “hnson”/”nes”

Hash indices

Basic idea of a hash function value1 h() distributes values uniformly among the buckets. value2 value3 h() distributes typical subsets of values uniformly among the buckets. value4

Using hash function for indexing Motivation: Constant-time hash instead of multi-level/tree-based. value1 value1 value4 value3 Use hash function to find bucket. Search/insert/delete from bucket. Bucket items possibly sorted. value3 value4 CREATE INDEX … ON … USING HASH;

Buckets can overflow Overflow some buckets – skewed usage Overflow many buckets – not enough space reserved Solutions: Chain additional buckets – degrades to linear search Stop and reorganize Dynamic hashing (extensible or linear hashing) – techniques that allow the number of buckets to grow without rehashing existing data

Advantages & disadvantages of hash indexing One hash function vs. multiple levels of indexing or B+-tree Storage about the same. Don’t need multiple levels. But, need buckets about half empty for good performance. Can’t easily access a data range. Simple hashing degrades poorly when data is skewed relative to the hash function.

Multiple indices & Multiple keys

Using indices for multiple attributes What if queries like the following are common? SELECT … FROM Employee WHERE job_code = 2 AND performance_rating = 5;

Strategy 1 – Index one attribute CREATE INDEX idx_job_code on Employee (job_code); SELECT … FROM Employee WHERE job_code = 2 AND performance_rating = 5; Internal strategy: Use index to find Employees with job_code = 2. Linear search of those to check performance_rating = 5.

Strategy 2 – Index both attributes CREATE INDEX idx_job_code on Employee (job_code); CREATE INDEX idx_perf_rating on Employee (performance_rating); SELECT … FROM Employee WHERE job_code = 2 AND performance_rating = 5; Internal strategy – chooses between Use job_code index, then linear search on performance_rating. Use performance_rating index, then linear search on job_code. Use both indices, then intersect resulting sets of pointers.

Strategy 3 – Index attribute set CREATE INDEX idx_job_perf on Employee (job_code, performance_rating); SELECT … FROM Employee WHERE job_code = 2 AND performance_rating = 5; This strategy typically uses ordered index, not hashing. Attribute sets ordered lexicographically: (jc1, pr1) < (jc2, pr2) iff either jc1 < jc2 jc1 = jc2 and pr1 < pr2 Note that this prioritizes job_code over performance_rating!

Strategy 3 – Index attribute set CREATE INDEX idx_job_perf on Employee (job_code, performance_rating); SELECT … FROM Employee WHERE job_code = 2 AND performance_rating = 5; SELECT … FROM Employee WHERE job_code = 2 AND performance_rating < 5; Efficient CREATE INDEX idx_job_perf on Employee (job_code, performance_rating); SELECT … FROM Employee WHERE job_code < 2 AND performance_rating = 5; SELECT … FROM Employee WHERE job_code = 2 OR performance_rating = 5; Inefficient

Strategy 4 – Grid indexing 𝑛 attributes viewed as being in 𝑛-dimensional grid Numerous implementations Grid file, K-D tree, R-tree, … Mainly used for spatial data CREATE SPATIAL INDEX … ON …;

Strategy 5 – Bitmap indices Coming next…

Bitmap indices

Basic idea of bitmap indices Assume records numbered 0, 1, …, and easy to find record #𝑖. Bitmaps Record # id gender income_level 51351 M 1 73864 F 2 13428 3 53718 4 83923 M F 1 2 3 4 5 F’s bitmap: 01101 CREATE BITMAP INDEX … ON …;

Queries use standard bitmap operations SELECT … FROM … WHERE gender = ‘F’ AND income_level = 1; F’s bitmap 01101 1’s bitmap 10100 & = 00100 SELECT … FROM … WHERE gender = ‘F’ OR income_level = 1; F’s bitmap 01101 1’s bitmap 10100 | = 11101 SELECT … FROM … WHERE income_level <> 1; 1’s bitmap 10100 ~ = 01011

Space overhead Normal: #records  (#values + 1) bits, if nullable Typically used when # of values for attribute is low. Encoded: #records  log(#values) But need to use more bitmaps Compressed: ~50% of normal Can do bitmap operations on compressed form.

A couple details When deleting records, either … Delete bit from every bitmap for that table, or Have an existence bitmap indicating whether that row exists. Can use idea in B+-tree leaves.

Summary of index types Single-level Index generally too large to fit in memory Multi-level Degrades due to fragmentation B+-tree General-purpose choice Spatial Good for spatial coordinates Hash Good for equality tests; potentially degrades due to skew & overflow Bitmap Good for multiple attributes each with few values