Presentation is loading. Please wait.

Presentation is loading. Please wait.

Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

Similar presentations


Presentation on theme: "Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion."— Presentation transcript:

1 Harikrishnan Karunakaran Sulabha Balan CSE 6339

2  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

3  Analysis of data in data warehouses useful in decision support  Users of decision support systems want interactive systems OLAP – Online Analytical Processing  Aggregate Query Answering Systems (AQUA) developed to reduce response time to desirable levels  Tolerant of approximate results

4  Various Approaches  Sampling-based  Histogram-based  Clustering  Probabilistic  Wavelet-based

5 BranchStateSales 1CA80K 2TX42K 3CA40K 4CA42K 5TX75K 6CA48K 7TX55K 8TX38K 9CA40K 10CA41K BranchStateSales 2TX42K 4CA42K 6CA48K 8TX38K 10CA41K 50% Sample SELECT SUM(sales) x 2 AS cnt FROM s_sales WHERE state = ‘TX’ S_sales scale factor Sales

6 Sample relation for aggregation query workload regarding Texas branches BranchStateSales 1CA80K 2TX42K 3CA40K 4CA42K 5TX75K 6CA48K 7TX55K 8TX38K 9CA40K 10CA41K BranchStateSales 2TX42K 4CA42K 5TX75K 7TX55K 8TX38K Sales S_sales

7  All tuples in a Uniform Random Sample are treated as equally important for answering queries  Sample needs to be tuned to contain tuples which are more relevant to answer queries in a workload  Need for a dynamic algorithm that changes the sample as and according to suit the queries being executed in the workload

8  Join of a Uniform Random Sample of a Fact Table with a set of accompanying Dimension Tables  SELECT COUNT(*), AVG(LI Extendedprice),  SUM(LI Extendedprice)  FROM LI, C, O, S, N, R  WHERE C Custkey=O Custkey AND O Orderkey=LI Orderkey  AND LI Suppkey=S Suppkey AND C Nationkey=N Nationkey  AND N Regionkey=R Regionkey AND R Name=North America  AND O Orderdate AND O Orderdate ;

9  Any aggregate query on the fact table can be answered approximately using exactly one of a smaller number of synopses  Uniform Random Sample of Relation wastes memory  OLAP queries exhibit locality in their data access

10  Class of samples to capture data locality of aggregate queries of foreign key joins  Identify focus of a query workload and sample accordingly  Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R)  Is a non-uniform sample of the original relation R

11

12

13 Algorithm is efficient due to  Uniform Random Sample of L ensures tuple’s selection in its icicle is proportional to it’s frequency  Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload Reservoir Sampling Algorithm

14 SELECT average(*) FROM widget-tuners WHERE date.month = ‘April’

15 In spite of unified sampling being used the result is a biased sample Frequency Relation maintained over all tuples in relation Different Estimation mechanisms for Average, Count and Sum

16  Average Average taken over set of distinct sample tuples that satisfy the query predicate of the average query is a pretty good estimate of the average  Count Sum of Expected Contributions of all tuples in the sample that satisfy the given query  Sum Estimate is given by the product of the average and the count estimates

17  Frequency Attribute added to the Relation  Starting Frequency set to 1 for all tuples  Incremented each time tuple is used to answer a query  Frequencies of relevant tuples updated only when icicle updated with new query

18  When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation  Accuracy improves with increase in number of tuples used to compute it  Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle

19

20 SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LI, C, O, S, N, R WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LICOS-icicle, N, R WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= Q workload : Template for generating workloads Template for obtaining approximate answers

21

22  The Error Plots for Comparison  Static uniform random sample on Join Synopsis  Icicle as it evolves with the workload  Icicle-Complete which is formed after entire workload has been executed once

23

24 Mixed Workload

25  Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples  Icicle plot shows a convergence to the Icicle- Complete plot  Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast

26  Improvement due to usage of icicles is not significant  Can be concluded that icicles are at worst as good as the static samples

27  Icicles provide class of samples that adapt according to the characteristics of the workload  It can never be worse than the case of static sampling  It focuses on relatively small subsets in the relation

28  There is no significant gains in the case of Uniform Workload  There is a trade-off between accuracy and cost  Restricted to certain scenarios where the queries tend to be increasingly focused towards the workload.

29  V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference  S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999

30 Thank You Questions?


Download ppt "Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion."

Similar presentations


Ads by Google