Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin.

Slides:



Advertisements
Similar presentations
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Advertisements

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Lecture 11: DMBS Internals
Index tuning Performance Tuning.
1 Physical Data Organization and Indexing Lecture 14.
1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu
CSCE Database Systems Chapter 15: Query Execution 1.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Storage and Indexing1 Overview of Storage and Indexing.
1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander Pope ( )
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Overview of Storage and Indexing Content based on Chapter 4 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
T-SQL: Simple Changes That Go a Long Way DAVE ingeniousSQL.com linkedin.com/in/ingenioussql.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “If you don’t find it in the index, look very.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Cloudera Kudu Introduction
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
CS411 Database Systems Kazuhiro Minami 10: Indexing-1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Decibel: The Relational Dataset Branching System
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
UPI: A Primary Index for Uncertain Databases (VLDB 10)
CS522 Advanced database Systems Huiping Guo Department of Computer Science California State University, Los Angeles 3. Overview of data storage and indexing.
CS 540 Database Management Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
COMP 430 Intro. to Database Systems
Database Management Systems (CS 564)
Evaluation of Relational Operations
Lecture 11: DMBS Internals
Chapter 11: Indexing and Hashing
Database Management Systems (CS 564)
CPSC-310 Database Systems
Lecture 20: Indexes Monday, February 27, 2006.
Presentation transcript:

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)

Two observations

1. Correlations abound Attributes tend to encode related info (these are soft functional dependencies) 02116BostonMA71° 05'WHonda2007Civic HybridReceiptdateShipdate {zip code, city, state, long/latitude} {manufacturer, model, year} {shipdate, receiptdate} Geographic

2. Secondary indexes are often useless for range and aggregation queries

Clustered access pattern Unclustered access pattern How can we improve the access pattern of a secondary index? SELECT * FROM lineitem WHERE orderdate=‘ ’ One seek Sorted by orderdate (clustered index on orderdate) Sorted by order_id (secondary index on orderdate) Many seeks

Our contribution: Exploiting correlations to improve secondary index performance

lineitem access pattern Clustered by primary key (uncorrelated) SELECT * FROM lineitem WHERE orderdate = Clustered by shipdate (correlated)

Correlation determines index performance Very Correlated Poorly Correlated Different sort orders

Our system: 1. Cost model with correlations 2. Correlation maps 3. Multi-attribute keys 4. Evaluation

i j shipdate (clustered) receiptdate (unclustered) 1. Cost model with correlations SELECT * FROM lineitem WHERE receiptdate IN (i, j) c_per_u: average number of clustered attribute values per unclustered attribute value 2 lookups 3 c_per_u 10ms 3 levels 1ms 3 pages per shipdate 20s

Correlation Map Clustered B+Tree 2. Correlation Maps CREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer); SELECT * FROM Salaries WHERE city=`Boston’; Clustered Attribute: State Unclustered Attribute: City

CMs: Usage Populated using initial scan of the table Insertions/deletions: keep a co-occurrence count for each (u, c) pair Physically stored as a B+Tree in the DB

CMs: Compression CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table) Achieves compression by mapping values → values, not values → tuples Possible to build many CMs; dedicated CM per query Improve performance by reducing buffer pool pressure

3. Multi-attribute keys Combined attributes may predict the clustered key better than either attr alone (longitude, latitude) → zip_code Challenges: –Finding these is non-trivial –Combining attributes leads to many-valued keys leads to large CMs

CM Advisor The CM Advisor considers all possible attribute combinations for clustered and unclustered keys given a training set of queries Buckets: collapse a range of key values into one Bucketing clustered keys –Leads to longer sequential disk reads –Boston:MA versus Boston:MA,MI Bucketing unclustered keys –Merging two unclustered buckets may increase disk seeks –Boston:MA versus Boise,Boston:ID,MA ClusteredUnclustered ClusteredUnclustered

4. Experimental evaluation SELECT … WHERE City IN (Boston, Springf) AND State IN (MA,NH,OH) SELECT … WHERE City IN (Boston, Springf)

Benefit of correlation SELECT * FROM lineitem WHERE shipdate IN ( , …)

eBay category data Hierarchies of products in categories antiques→architectural→hardware→locks & keys 24,000 categories up to 6 levels deep Clustered by catID Correlation: catID → price Generated unique ItemIDs for 43 million rows (3.5GB)

Maintenance costs: CM vs B+Tree Index updates fit in memory Each B+Tree: 1.5GB Each CM: 300K

Mixed workload performance (5 indexes each) Selects slow down inserts even more due to buffer pool pressure! Total B+Tree size: 7.7GB Total CM size: 1.4MB

SDSS Skyserver data Celestial objects and their optical properties PhotoObj: right ascension (ra), declination (dec) Clustered by fieldID Correlation: (ra, dec) → fieldID Initial data: 200k tuples Copied ra and dec windows 10x to produce 20M tuples, 3GB

Multi-attribute index performance SELECT COUNT(*) FROM PhotoObj WHERE < ra < AND 1.41 < dec < 1.55 AND 23 < g+rho < 25 CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) Correlation: (ra, dec) → fieldID

Related ideas BHUNT/CORDS –Similar measure of correlation for query opt. –Doesn’t discuss indexing, no cost model ADC Clustering –Proposes reclustering, but no cost model/designer Microsoft SQL Server: datetime clustering –Limited to datetime types Index compression (Prefix B+Tree, delta encoding, …) –Compression rates in the range of 2x

Summary Correlations between attributes arise naturally in a variety of applications Correlations determine the cost of secondary index lookups We presented a correlation-aware cost model and advisor to decide when to build CMs Multi-attribute CMs capture more correlations; bucketing keeps them tiny Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically x smaller than secondary B+Trees

Model accuracy SELECT Avg(Price) FROM Ebay WHERE Category=X

Isolated CM performance vs. secondary B+Tree Slightly slower on isolated query; CM must filter unmatching tuples B+Tree: 860MB CM: 900KB

Bucketing Acceptable performance Smaller size Random-sample synopsis from table Try unclustered bucket sizes: 2², 2³, … Output candidates grouped by size, ordered by c_per_u