BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…

Slides:



Advertisements
Similar presentations
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
1 Lecture 18: Indexes Monday, November 10, Midterm Problem 1a: select student.sname, avg(takes.grade) from student, takes where student.sid =
External Sorting R & G Chapter 13 One of the advantages of being
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Storage and Indexing February 26 th, 2003 Lecture 19.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
External Sorting Chapter 13
Lecture#12: External Sorting (R&G, Ch13)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 19: Data Storage and Indexes
Lecture 21: B-Trees Monday, Nov. 19, 2001.
Lecture 28: Index 3 B+ Trees
CSE 544: Lecture 11 Storing Data, Indexes
Storage and Indexing.
General External Merge Sort
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
External Sorting.
Files and access methods
Indexing February 28th, 2003 Lecture 20.
Database Systems (資料庫系統)
Lecture 11: B+ Trees and Query Execution
Lecture 20: Indexes Monday, February 27, 2006.
External Sorting Chapter 13
Presentation transcript:

BTrees & Sorting 11/3

Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…

Indexing “If you don’t find it in the index, look very carefully through the entire catalog” -- Sears, Roebuck, and Co., Consumers Guide, 1897 “If you don’t find it in the index, look very carefully through the entire catalog” -- Sears, Roebuck, and Co., Consumers Guide, 1897

Index Motivation A file contains some records, say products We want faster access to those records – E.g., Give me all products made by Sony Intuition: Build a second file that organizes the records “by product” to make this faster – NB: we don’t always have to build a second file

Indexes An index on a file speeds up selections on the search key fields for the index. – Search key properties Any subset of fields is not the same as key of a relation Product(name, maker, price) On which attributes would you build indexes?

More precisely An index contains a collection of data entries, and supports efficient retrieval of all data entries k* with a given key value k. Product(name, maker, price) Sample queries? Indexing is one the most important facilities provided by a database for performance

What’s stored in an index? Three main options: a key k maps to a data entry which is 1.k*, the actual record 2.(k,rid), the key plus the record id 3.(k, rid list) To avoid duplication of records, we usually only have one index that uses choice 1

Operations on an Index Search: Given a key find all records – More sophisticated variants as well. Why? Insert /Remove entries – Bulk Load. Why? Real difference between structures: costs of ops determines which index you pick and why

Data File with Several Index Files NameAgeSal Bob1210 Cal1180 Joe1220 Sue Equality Query: Age = 12 and Sal = 90? Range Query: Age = 5 and Sal > 5? Composite keys in Dictionary Order

High-level of Index Structures

Outline Btrees – Very good for range queries, sorted data – Some old databases only implemented Btrees Hash Tables – There are variants of this basic structure to deal with IO The data structures we present here are “IO aware”

B+ Trees Search trees – B does not mean binary! Idea in B Trees: – make 1 node = 1 physical page – Balanced, height adjusted tree (not the B either) Idea in B+ Trees: – Make leaves into a linked list (range queries)

Each node has >= d and <= 2d keys (except root) B+ Trees Basics Keys k < 30 Keys 30<=k<120Keys 120<=k<240 Keys 240<=k Next leaf Each leaf has >=d and <= 2d keys: Parameter d = the degree Internal Nodes Leaf Nodes

B+ Tree Example d = 2 1. No data in internal nodes. 2. Links between leaf pages.

Searching a B+ Tree Exact key values: – Start at the root – Proceed down, to the leaf Range queries: – As above – Then sequential traversal Select name From people Where age = 25 Select name From people Where 20 <= age and age <= 30

B+ Tree Example K = 30? 30 < in [20,60) To the data!

B+ Tree Example K in [30,85] 30 < in [20,60) To the data! Use those leaf pointers!

B+ Tree Design How large is d ? Example: – Key size = 4 bytes – Pointer size = 8 bytes – Block size = 4096 byes 2d x 4 + (2d+1) x 8 <= 4096 d = 170 Observable Universe contains ≈ atoms. What is height of a B+tree that indexes it? NB: Oracle allows 64k pages TiB is 2 40 bytes. What is the height to index with 64k Pages?

B+ Trees in Practice Typical order: 100. Typical fill-factor: 67%. – average fanout = 133 Typical capacities: – Height 4: = 312,900,700 records – Height 3: = 2,352,637 records Top levels of tree sit in the buffer pool: – Level 1 = 1 page = 8 Kbytes – Level 2 = 133 pages = 1 Mbyte – Level 3 = 17,689 pages = 133 MBytes Typically, pay for one IO!

Insert!

Insertion in a B+ Tree Insert (K, P) Find leaf where K belongs, insert If no overflow (2d keys or less), halt If overflow (2d+1 keys), split node, insert in parent: If leaf, keep K3 too in right node When root splits, new root has 1 key only K1K2K3K4K5 P0P1P2P3P4p5 K1K2 P0P1P2 K4K5 P3P4p5 (K3, ) to parent

Insertion in a B+ Tree Insert K=19

Insertion in a B+ Tree After insertion

Insertion in a B+ Tree Now insert 25

Insertion in a B+ Tree After insertion 50

Insertion in a B+ Tree But now have to split ! 50

Insertion in a B+ Tree After the split

Key concepts (exam) How to search in a B+tree – which pages are touched Understanding the impact of various design decisions. Will not ask for the details of the insert algorithm, but do need to know it remains balanced.

Clustered Indexes

Index Classification An index is clustered if the data is ordered in the same way as the underlying data.

Clustered vs. Unclustered Index Index entries direct search Data entries (Index File) (Data file) Data Records CLUSTERED UNCLUSTERED Clustered (or not) dramatically impacts cost

A Simple Cost Models

Operations on an Index Search: Given a key find all records – More sophisticated variants as well. Real difference between structures: costs of ops which index you pick and why

Cost Model for Our Analysis We ignore CPU costs, for simplicity: – N: The number of records – R: Number of records per page Measure number of page I/O’s Goal: Good enough to show the overall trends…

Clustered v. Unclustered Fanout of Tree is F. Range query for M entries (100 per page) IOs to search for a single item? Traversal of the tree: log F (1.5N) Range search Query : log F (1.5N) + ceil(M/100) Traversal of the tree: log F (1.5N) Range search Query : log F (1.5N) + M Unclustered Clustered Which of these IOs are random/sequential?

Plugging in Some numbers Clustered: log F (1.5N) + ceil(M/100) ~ 1 Random IO (10ms) Unclustered: log F (1.5N) + M Random IO (M*10ms) If M is 1 then there is no difference! If M is 100,000 records, ~10 minutes vs. 10ms

Takeaway B+Tree are a workhorse index. You can write down a cost model. – Databases actually do this! Clustered v. unclustered is a big deal.

Sorting.

Why Sort? Data requested in sorted order – e.g., find students in increasing GPA order Sorting is first step in bulk loading B+ tree index. A classic problem in computer science!

More reasons to sort… Sorting useful for eliminating duplicate copies in a collection of records (Why?) Sort-merge join algorithm involves sorting. Problem: sort 1Tb of data with 1Gb of RAM. – why not use virtual memory?

Do people care? Sort benchmark bares his name

Simplified External Sorts.

Two Ideas behind external sort I/O optimized: long sequential disk access Observation: Merging sorted files is easy Sort small sets in memory, then merge.

3 Buffer Pages Sort 18,8,5,30 Main Memory 44,10,33,55

Phase I: Buffer with 3 Pages Sort Main Memory 44,10,33,55 Sort it! (Quicksort) 10,33,44,55 Phase 1, Per Page 2 IOs (1 Read,1 Write) 18,8,5,30 5,8,18,30 End: All pages sorted.

Phase II: Merge Main Memory 10,33,44,55 5,8,18,30 10,33,44,55 5,8,18,30 1. Read Pages 5,8,10,18 5 5,8 2. Merge

Phase II: Merge Main Memory 10,33,44,55 5,8,18,30 10,33,44,55 5,8,18,30 1. Read Pages 5,8,10,18 5 5,8 2. Merge

Phase II: Merge Main Memory 10,33,44,55 5,8,18,30 10,33,44,55 5,8,18,30 1. Read Pages 5,8,10,18 5,8 5,8,10 2. Merge

Phase II: Merge Main Memory 10,33,44,55 5,8,18,30 10,33,44,55 5,8,18,30 1. Read Pages 5,8,10,18 5,8,10 5,8,10,18 2. Merge 3 rd Page is filled

Phase II: Merge Main Memory 10,33,44,55 5,8,18,30 10,33,44,55 5,8,18,30 1. Read Pages 5,8,10,18 2. Merge 3. Write Back Keep on Merging!

3 Buffer Pages Sort 30,33,44,55 Main Memory 5,8,10,18 Now, runs of length 2. If our file has 16 pages, what is next?

Phase II: Merge Main Memory 10,33,44,55 5,8,18,30 10,33,44,55 5,8,18,30 1. Read Pages 5,8,10,18 2. Merge 3. Write Back

Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So toal cost is: Idea: Divide and conquer: sort subfiles and merge Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,45,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8

More Buffer Pages Sort 18,8,5,30 Main Memory 44,10,33,55 What if we have B+1 Buffer Pages? Sort IOs: 2N(1 + log B (N/(B+1))) 1 st Pass: Runs of Length B+1 Merge Phase: B-way Merge.

Number of Passes of External Sort Engineer’s rule of thumb: You sort in 3 passes

Other Optimizations Can get twice as long runs – Tournament sort (used in Postgres) Can Improve IO performance by using bigger buffers to “prefetch” or “double buffer” – Prefetch: Hide latency – Bigger Batch Sizes: Amortize expensive sequential reads and writes.