Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)
Two observations
1. Correlations abound Attributes tend to encode related info (these are soft functional dependencies) 02116BostonMA71° 05'WHonda2007Civic HybridReceiptdateShipdate {zip code, city, state, long/latitude} {manufacturer, model, year} {shipdate, receiptdate} Geographic
2. Secondary indexes are often useless for range and aggregation queries
Clustered access pattern Unclustered access pattern How can we improve the access pattern of a secondary index? SELECT * FROM lineitem WHERE orderdate=‘ ’ One seek Sorted by orderdate (clustered index on orderdate) Sorted by order_id (secondary index on orderdate) Many seeks
Our contribution: Exploiting correlations to improve secondary index performance
lineitem access pattern Clustered by primary key (uncorrelated) SELECT * FROM lineitem WHERE orderdate = Clustered by shipdate (correlated)
Correlation determines index performance Very Correlated Poorly Correlated Different sort orders
Our system: 1. Cost model with correlations 2. Correlation maps 3. Multi-attribute keys 4. Evaluation
i j shipdate (clustered) receiptdate (unclustered) 1. Cost model with correlations SELECT * FROM lineitem WHERE receiptdate IN (i, j) c_per_u: average number of clustered attribute values per unclustered attribute value 2 lookups 3 c_per_u 10ms 3 levels 1ms 3 pages per shipdate 20s
Correlation Map Clustered B+Tree 2. Correlation Maps CREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer); SELECT * FROM Salaries WHERE city=`Boston’; Clustered Attribute: State Unclustered Attribute: City
CMs: Usage Populated using initial scan of the table Insertions/deletions: keep a co-occurrence count for each (u, c) pair Physically stored as a B+Tree in the DB
CMs: Compression CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table) Achieves compression by mapping values → values, not values → tuples Possible to build many CMs; dedicated CM per query Improve performance by reducing buffer pool pressure
3. Multi-attribute keys Combined attributes may predict the clustered key better than either attr alone (longitude, latitude) → zip_code Challenges: –Finding these is non-trivial –Combining attributes leads to many-valued keys leads to large CMs
CM Advisor The CM Advisor considers all possible attribute combinations for clustered and unclustered keys given a training set of queries Buckets: collapse a range of key values into one Bucketing clustered keys –Leads to longer sequential disk reads –Boston:MA versus Boston:MA,MI Bucketing unclustered keys –Merging two unclustered buckets may increase disk seeks –Boston:MA versus Boise,Boston:ID,MA ClusteredUnclustered ClusteredUnclustered
4. Experimental evaluation SELECT … WHERE City IN (Boston, Springf) AND State IN (MA,NH,OH) SELECT … WHERE City IN (Boston, Springf)
Benefit of correlation SELECT * FROM lineitem WHERE shipdate IN ( , …)
eBay category data Hierarchies of products in categories antiques→architectural→hardware→locks & keys 24,000 categories up to 6 levels deep Clustered by catID Correlation: catID → price Generated unique ItemIDs for 43 million rows (3.5GB)
Maintenance costs: CM vs B+Tree Index updates fit in memory Each B+Tree: 1.5GB Each CM: 300K
Mixed workload performance (5 indexes each) Selects slow down inserts even more due to buffer pool pressure! Total B+Tree size: 7.7GB Total CM size: 1.4MB
SDSS Skyserver data Celestial objects and their optical properties PhotoObj: right ascension (ra), declination (dec) Clustered by fieldID Correlation: (ra, dec) → fieldID Initial data: 200k tuples Copied ra and dec windows 10x to produce 20M tuples, 3GB
Multi-attribute index performance SELECT COUNT(*) FROM PhotoObj WHERE < ra < AND 1.41 < dec < 1.55 AND 23 < g+rho < 25 CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) Correlation: (ra, dec) → fieldID
Related ideas BHUNT/CORDS –Similar measure of correlation for query opt. –Doesn’t discuss indexing, no cost model ADC Clustering –Proposes reclustering, but no cost model/designer Microsoft SQL Server: datetime clustering –Limited to datetime types Index compression (Prefix B+Tree, delta encoding, …) –Compression rates in the range of 2x
Summary Correlations between attributes arise naturally in a variety of applications Correlations determine the cost of secondary index lookups We presented a correlation-aware cost model and advisor to decide when to build CMs Multi-attribute CMs capture more correlations; bucketing keeps them tiny Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically x smaller than secondary B+Trees
Model accuracy SELECT Avg(Price) FROM Ebay WHERE Category=X
Isolated CM performance vs. secondary B+Tree Slightly slower on isolated query; CM must filter unmatching tuples B+Tree: 860MB CM: 900KB
Bucketing Acceptable performance Smaller size Random-sample synopsis from table Try unclustered bucket sizes: 2², 2³, … Output candidates grouped by size, ordered by c_per_u