Download presentation
Presentation is loading. Please wait.
Published byDeirdre Patterson Modified over 8 years ago
1
Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)
2
Two observations
3
1. Correlations abound Attributes tend to encode related info (these are soft functional dependencies) 02116BostonMA71° 05'WHonda2007Civic HybridReceiptdateShipdate {zip code, city, state, long/latitude} {manufacturer, model, year} {shipdate, receiptdate} Geographic
4
2. Secondary indexes are often useless for range and aggregation queries
5
Clustered access pattern Unclustered access pattern How can we improve the access pattern of a secondary index? SELECT * FROM lineitem WHERE orderdate=‘2009-08-26’ One seek Sorted by orderdate (clustered index on orderdate) Sorted by order_id (secondary index on orderdate) Many seeks
6
Our contribution: Exploiting correlations to improve secondary index performance
7
lineitem access pattern Clustered by primary key (uncorrelated) SELECT * FROM lineitem WHERE orderdate = 2007-01-03 Clustered by shipdate (correlated)
8
Correlation determines index performance Very Correlated Poorly Correlated Different sort orders
9
Our system: 1. Cost model with correlations 2. Correlation maps 3. Multi-attribute keys 4. Evaluation
10
i j shipdate (clustered) receiptdate (unclustered) 1. Cost model with correlations SELECT * FROM lineitem WHERE receiptdate IN (i, j) c_per_u: average number of clustered attribute values per unclustered attribute value 2 lookups 3 c_per_u 10ms 3 levels 1ms 3 pages per shipdate 20s
11
Correlation Map Clustered B+Tree 2. Correlation Maps CREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer); SELECT * FROM Salaries WHERE city=`Boston’; Clustered Attribute: State Unclustered Attribute: City
12
CMs: Usage Populated using initial scan of the table Insertions/deletions: keep a co-occurrence count for each (u, c) pair Physically stored as a B+Tree in the DB
13
CMs: Compression CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table) Achieves compression by mapping values → values, not values → tuples Possible to build many CMs; dedicated CM per query Improve performance by reducing buffer pool pressure
14
3. Multi-attribute keys Combined attributes may predict the clustered key better than either attr alone (longitude, latitude) → zip_code Challenges: –Finding these is non-trivial –Combining attributes leads to many-valued keys leads to large CMs
15
CM Advisor The CM Advisor considers all possible attribute combinations for clustered and unclustered keys given a training set of queries Buckets: collapse a range of key values into one Bucketing clustered keys –Leads to longer sequential disk reads –Boston:MA versus Boston:MA,MI Bucketing unclustered keys –Merging two unclustered buckets may increase disk seeks –Boston:MA versus Boise,Boston:ID,MA ClusteredUnclustered ClusteredUnclustered
16
4. Experimental evaluation SELECT … WHERE City IN (Boston, Springf) AND State IN (MA,NH,OH) SELECT … WHERE City IN (Boston, Springf)
17
Benefit of correlation SELECT * FROM lineitem WHERE shipdate IN (2009-01-03, …)
18
eBay category data Hierarchies of products in categories antiques→architectural→hardware→locks & keys 24,000 categories up to 6 levels deep Clustered by catID Correlation: catID → price Generated unique ItemIDs for 43 million rows (3.5GB)
19
Maintenance costs: CM vs B+Tree Index updates fit in memory Each B+Tree: 1.5GB Each CM: 300K
20
Mixed workload performance (5 indexes each) Selects slow down inserts even more due to buffer pool pressure! Total B+Tree size: 7.7GB Total CM size: 1.4MB
21
SDSS Skyserver data Celestial objects and their optical properties PhotoObj: right ascension (ra), declination (dec) Clustered by fieldID Correlation: (ra, dec) → fieldID Initial data: 200k tuples Copied ra and dec windows 10x to produce 20M tuples, 3GB
22
Multi-attribute index performance SELECT COUNT(*) FROM PhotoObj WHERE 193.1 < ra < 194.5 AND 1.41 < dec < 1.55 AND 23 < g+rho < 25 CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) Correlation: (ra, dec) → fieldID
23
Related ideas BHUNT/CORDS –Similar measure of correlation for query opt. –Doesn’t discuss indexing, no cost model ADC Clustering –Proposes reclustering, but no cost model/designer Microsoft SQL Server: datetime clustering –Limited to datetime types Index compression (Prefix B+Tree, delta encoding, …) –Compression rates in the range of 2x
24
Summary Correlations between attributes arise naturally in a variety of applications Correlations determine the cost of secondary index lookups We presented a correlation-aware cost model and advisor to decide when to build CMs Multi-attribute CMs capture more correlations; bucketing keeps them tiny Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically 10-1000x smaller than secondary B+Trees
26
Model accuracy SELECT Avg(Price) FROM Ebay WHERE Category=X
27
Isolated CM performance vs. secondary B+Tree Slightly slower on isolated query; CM must filter unmatching tuples B+Tree: 860MB CM: 900KB
28
Bucketing Acceptable performance Smaller size Random-sample synopsis from table Try unclustered bucket sizes: 2², 2³, … Output candidates grouped by size, ordered by c_per_u
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.