Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes

Goal of the papers we have studied so far is to approximately answer aggregation queries A ccurately Efficiently Main Idea of This Paper : Tailor the choice of samples to be robust for W similar not necessary identical to give workloads.

Two ways of Data mining for analyzing Large DB. 1)OLAP :- Online analytical processing is an approach to quickly answer multi-dimensional analytical queries using OLAP cubes. OLAP cube :- is a data structure that allows fast analysis of data. Example Data of Product 1)By city 2)By type 3)By product type Drawbacks: 1)Expensive 2)Resource intensive -wiki

2) Pre-computed samples -This method gives approximate answers very efficiently. -But may have larger errors, cause finding the samples with large variance is almost impossible. 3) Using Workloads What is a Workload ? Set of Transact-SQL statements that execute against a DB or databases that have to be tuned. - msdn Tuned :- Optimize performance of DB

How Sample set is generated from Workload? - In practical world Queries fall under a particular pattern. Eg : Info of Texas state. -The queries are run on the entire data base (R). -A column is added to the records. This column holds the data that states if the record was selected by the queries. (tagging) -If the record was selected by many queries its probability of being in the sample is higher. -This way Workload is used to generate Sample set.

R Icicle R(Q1) R(Q2) R(Q3) ICICLES : A new class of samples that tune themselves to a dynamic Workload. Outliers : Identify the tuples with Outlier values and store them in a separate relation. 1)Run query on T1 2)URSAMP on T2 3)Estimate the true result 4)Using methods in paper combine results. Outlier table (T1) No Outlier values Table (T2)

Drawbacks of Previous studies :- 1)They have intuitive appeal they lack rigorous problem formulation. 2)Do not deal with uncertainty in Workloads. 3)Ignore data variance in the data distribution of the aggregate column.

Architecture for Approximate Query Processing Offline component for selecting samples from (R) a)Rewrite the queries to use the sample to answer the query approximately. b)Report the answers with estimate errors.

Offline component selects samples from R. 1)Each record has Scale Factor Column (or in different relation) 2)Value for the aggregate column of each record is scaled up by multiplying by scale factor and then aggregated.

Error Metrics : y = correct ans y’ = approximate ans Relative error :E(Q) = (|y-y’|)/|y| Squared relative error : SE(Q) = (|y-y’|) 2 / |y| 2 Group By Query includes g groups yi = correct ans for ith group SE(Q) = (1/g) Σ i (yi – yi’) 2 / yi 2 A group by query with g groups g select queries with 1/g weight each.

pw = Probability Distribution pw(Q) = probability of query Q is given Mean Square error is MSE(pw) = ΣQ pw(Q) * SE(Q) Root mean squared error :

Fundamental Regions R = Relation W = Workspace n = no. of records in R R is divided into min no. of regions R1,R2,…… Rr Rj = each query in the W selects either all records in Rj or none. Upper Bound on no. of regions is min (2 |w|, n) Q1 = select no. between 10 and 50 Q2= select values between 40 and 70

Fixed Workload Problem Statement : (Fixed Samp) Input : R, W, k Output: A sample of k records such that MSE(W) is minimized. Soultion : (3 steps) 1)Identify fundamental regions 2)Picking exactly one record from each imp fundamental region 3)Assign appropriate values to additional columns in the sample records.

Step 1 : Number of fundamental regions (r) is induced by the Workload W. Case 1 : (r k) ____________________________ Case 1: (r<=k) Step 2 : Pick one sample from each fundamental region (10,40,60,80) Step 3: Column RegionCount and AggSum are updated Region Count = { 3,2,2,2,) AggSum = {60,90,130,170) We can ans COUNT, SUM and AVG queries.

Case 2 :- (r>k) Step 2: 1) Sort all regions by importance Importance = f j *n j 2 fj = sum of weights of all the queries in W that select the region j. nj = no. of records in the region j. (fj) = measures the weights of the queries that are affected by Rj (nj 2 ) = measures the effect on the error by not including Rj. 2) Pick the top k Step 3 Assign the values to additional coulmns (Regioncount, AggSum) Not same as pervious (k rec, 2k unknows ({RC1, … RCk} & {AS1,,ASk} ) MSE(W) = quadratic eq. on Differentiation we get Linear eq. Disadvantage : If queries not identical unpredictable errors.

Lifting Workload Q an Q’ => similar when records selected by them significantly overlap R = Relation Q = Query RQ = rec selected by the query p {q} (R’) => denotes the probability of occurrence of any query that selects the set of records R’.

δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) Define the degree to which W influences Q’s distribution. δ Probability that an incoming query will select this record γ For any record inside R Q For any record outside R Q n 1, n 2, n 3, and n 4 are the counts of records in the regions. n 2 or n 4 large (large overlap), P {Q} (R’) is high n 1 or n 3 large (small overlap), P {Q} (R’) is low

δ = inside RQ &selected by R’ γ = outside RQ &selected by R’ δ => 1 γ => 0 RQ and R’ identical δ => 1 γ => ½ R’ super set of RQ δ => ½ γ => 0 R’ subset of RQ δ => ½ γ => ½ R’ is unrestricted

STRATIFIED SAMPLING PROBLEM : SAMP Input : R, pw, k (pw probability distribution fun specified by W) Output : a sample of k records such that MSE(pw) is minimized

Solution STRAT for single-table selection queries with Aggregation 3 Steps : 1.Stratification (a) How many strata to partition R into ? (b) How many records in each strata? 2. Allocation  Determine the number of samples required across each strata 3.Sampling

Count Aggregation 1)Stratification : No of statum = No. of fundamental regions (Lemma1) 2) Allocation : We want to minimize the error over queries in p w. k 1, … k r are unknown variables such that Σk j = k. From Equation on an earlier slide, MSE(p W ) can be expressed as a weighted sum of the MSE of each query in the workload: Lemma 2: MSE(p W ) = Σ i w i MSE(p {Q} )

Lemma 3 : For a COUNT query Q in W, let ApproxMSE(p {Q} ) = Then Expected squared error in estimating the count (R Q union R j ) Expected squared error in estimation the count of (R Q union R/R Q ) Expected relative squared error in estimating count of R Q

Since we have an (approximate) formula for MSE(p {Q} ), we can express MSE(p w ) as a function of the k j ’s variables. Corollary 1 : MSE(p w ) = Σ j (α j / k j ), where each α j is a function of n 1,…,n r, δ, and γ. α j captures the “importance” of a region; it is positively correlated with n j as well as the frequency of queries in the workload that access R j. Now we can minimize MSE(p w ). Lemma 4: Σ j (α j / k j ) is minimized subject to Σ j k j = k if k j = k * ( sqrt(α j ) / Σ i sqrt(α i ) ) This provides a closed-form and computationally inexpensive solution to the allocation problem since α j depends only on δ, γ and the number of tuples in each fundamental region.

Solution For Sum Aggregate 1) Stratification -Can not use same stratification as in Count. -Use Bucketing Technique Fundamental regions are with large variance are divided into finer regions with significantly lower internal variance. Each finer region treated as a strata.

2) Allocation -Like COUNT, we express an optimization problem with h*r unknowns k 1,…, k h*r. -Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p {Q} ). -Let y j (Y j ) be the average (sum) of the aggregate column values of all records in region R j. -Since the variance = small, so approximate each value in the region to y j.

Thus to express MSE(p {Q} ) as a function of the k j ’s for a SUM query Q in W: As with COUNT, MSE(p W ) for SUM is functionally of the form Σ j (α j / k j ), and α j depends on the same parameters n 1, …n h*r, δ, and γ (Corollary 1).

Pragmatic Issues -Identifying Fundamental Regions -Handling Large Number of Fundamental Regions -Obtaining Integer Solutions -Obtaining an Unbiased Estimator

Extensions for more general Workloads Group Queries : - -Q partitions R into g groups -Lifting model : - Replace Q with g separate selection queries -Tagging step :- append (c = column id used in Group by. V = value of Group by in record t.) Join Query :- Star queries contains -One source relation and a set of dimension relations connected via foreign key joins. -Group by and selection on source and dimension relation -Aggregation over columns of the source relation. -Approach 1 - identify samples only over source relation. ( source relation is large) -Approach 2 - Identify source relation and precompute its join with all dimension relations.

Experimental Results : FIXED – solution for FIXEDSAMP, fixed workload, identical queries STRAT – solution for SAMP, workloads with single-table selection queries with aggregation PREVIOUS WORK USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with weighted sampling CONG – Congressional sampling

Conclusion: The solutions FIXED and STRAT handle the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.

Similar presentations

Presentation on theme: "Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.

Similar presentations

Presentation on theme: "Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes."— Presentation transcript:

Similar presentations

About project

Feedback