Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin.

Similar presentations


Presentation on theme: "Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin."— Presentation transcript:

1 Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin (Brown), Samuel Madden (MIT CSAIL), Stanley B. Zdonik (Brown)

2 Two observations

3 1. Correlations abound Attributes tend to encode related info (these are soft functional dependencies) 02116BostonMA71° 05'WHonda2007Civic HybridReceiptdateShipdate {zip code, city, state, long/latitude} {manufacturer, model, year} {shipdate, receiptdate} Geographic

4 2. Secondary indexes are often useless for range and aggregation queries

5 Clustered access pattern Unclustered access pattern How can we improve the access pattern of a secondary index? SELECT * FROM lineitem WHERE orderdate=‘2009-08-26’ One seek Sorted by orderdate (clustered index on orderdate) Sorted by order_id (secondary index on orderdate) Many seeks

6 Our contribution: Exploiting correlations to improve secondary index performance

7 lineitem access pattern Clustered by primary key (uncorrelated) SELECT * FROM lineitem WHERE orderdate = 2007-01-03 Clustered by shipdate (correlated)

8 Correlation determines index performance Very Correlated Poorly Correlated Different sort orders

9 Our system: 1. Cost model with correlations 2. Correlation maps 3. Multi-attribute keys 4. Evaluation

10 i j shipdate (clustered) receiptdate (unclustered) 1. Cost model with correlations SELECT * FROM lineitem WHERE receiptdate IN (i, j) c_per_u: average number of clustered attribute values per unclustered attribute value 2 lookups 3 c_per_u 10ms 3 levels 1ms 3 pages per shipdate 20s

11 Correlation Map Clustered B+Tree 2. Correlation Maps CREATE TABLE Salaries( State string PRIMARY_KEY, City string, Salary integer); SELECT * FROM Salaries WHERE city=`Boston’; Clustered Attribute: State Unclustered Attribute: City

12 CMs: Usage Populated using initial scan of the table Insertions/deletions: keep a co-occurrence count for each (u, c) pair Physically stored as a B+Tree in the DB

13 CMs: Compression CMs typically 10x-1000x smaller than a secondary B+Tree (1KB for a 5GB table) Achieves compression by mapping values → values, not values → tuples Possible to build many CMs; dedicated CM per query Improve performance by reducing buffer pool pressure

14 3. Multi-attribute keys Combined attributes may predict the clustered key better than either attr alone (longitude, latitude) → zip_code Challenges: –Finding these is non-trivial –Combining attributes leads to many-valued keys leads to large CMs

15 CM Advisor The CM Advisor considers all possible attribute combinations for clustered and unclustered keys given a training set of queries Buckets: collapse a range of key values into one Bucketing clustered keys –Leads to longer sequential disk reads –Boston:MA versus Boston:MA,MI Bucketing unclustered keys –Merging two unclustered buckets may increase disk seeks –Boston:MA versus Boise,Boston:ID,MA ClusteredUnclustered ClusteredUnclustered

16 4. Experimental evaluation SELECT … WHERE City IN (Boston, Springf) AND State IN (MA,NH,OH) SELECT … WHERE City IN (Boston, Springf)

17 Benefit of correlation SELECT * FROM lineitem WHERE shipdate IN (2009-01-03, …)

18 eBay category data Hierarchies of products in categories antiques→architectural→hardware→locks & keys 24,000 categories up to 6 levels deep Clustered by catID Correlation: catID → price Generated unique ItemIDs for 43 million rows (3.5GB)

19 Maintenance costs: CM vs B+Tree Index updates fit in memory Each B+Tree: 1.5GB Each CM: 300K

20 Mixed workload performance (5 indexes each) Selects slow down inserts even more due to buffer pool pressure! Total B+Tree size: 7.7GB Total CM size: 1.4MB

21 SDSS Skyserver data Celestial objects and their optical properties PhotoObj: right ascension (ra), declination (dec) Clustered by fieldID Correlation: (ra, dec) → fieldID Initial data: 200k tuples Copied ra and dec windows 10x to produce 20M tuples, 3GB

22 Multi-attribute index performance SELECT COUNT(*) FROM PhotoObj WHERE 193.1 < ra < 194.5 AND 1.41 < dec < 1.55 AND 23 < g+rho < 25 CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) CM(ra) CM(dec) CM(ra,dec) BTree(ra,dec) Correlation: (ra, dec) → fieldID

23 Related ideas BHUNT/CORDS –Similar measure of correlation for query opt. –Doesn’t discuss indexing, no cost model ADC Clustering –Proposes reclustering, but no cost model/designer Microsoft SQL Server: datetime clustering –Limited to datetime types Index compression (Prefix B+Tree, delta encoding, …) –Compression rates in the range of 2x

24 Summary Correlations between attributes arise naturally in a variety of applications Correlations determine the cost of secondary index lookups We presented a correlation-aware cost model and advisor to decide when to build CMs Multi-attribute CMs capture more correlations; bucketing keeps them tiny Experiments show that correlated lookups with CMs are 2-38x faster, and CMs are typically 10-1000x smaller than secondary B+Trees

25

26 Model accuracy SELECT Avg(Price) FROM Ebay WHERE Category=X

27 Isolated CM performance vs. secondary B+Tree Slightly slower on isolated query; CM must filter unmatching tuples B+Tree: 860MB CM: 900KB

28 Bucketing Acceptable performance Smaller size Random-sample synopsis from table Try unclustered bucket sizes: 2², 2³, … Output candidates grouped by size, ordered by c_per_u


Download ppt "Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin."

Similar presentations


Ads by Google