Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002

Approximate Query Evaluation  Why?  Handling load – streams coming too fast  Data stream is archived in a off-site data warehouse, expensive access of archived data  Avoid unbounded storage and computation  Ad hoc queries need approximate history  Try to look at the data items only once and in a fixed order

Synopses  Queries may access or aggregate past data  Need bounded-memory history-approximation  Synopsis? – Succinct summary of old stream tuples – Like indexes/materialized-views, but base data is unavailable  Examples – Sliding Windows – Samples – Sketches – Histograms – Wavelet representation

Sketching Techniques  Self-Join Size Estimation  Stream of values from dom(A) = {1,2,…,n}  Let f(i) = frequency of value i  Consider SJ(A) = Σ i  dom(A) f(i) 2, or Gini’s index of homogeneity.  Useful in parallel DB applications, error estimation in query result size estimation and access plan costs.  Equivalent query: Q = COUNT (R |><| A R)

Evaluating SJ(A) = Σ f i 2  To update S, keep a counter f(i) for each value i in the domain D   (|dom(A)|) bits of storage  This is true for any deterministic algorithm  Question – estimating S in sub-linear space (O(log |dom(A)|))  A randomized technique that offers strong probabilistic guarantees on the quality of SJ(A)

Self-Join Size Estimation  AMS Technique (randomized sketches)  Given (f(1),f(2),…,f(n))  Generate a family of 4-wise independent binary random variables {Z i : i = 1,…, |dom(A)|}  Z i = random{-1,1}  P[Z i = +1] = P[Z i = -1] = ½ (E[Z i ] = 0)  By using tools like orthogonal arrays, such families can be constructed online using O(log|dom(A)|) space.

Self-Join Size Estimation  Define X = Σ i  dom(A) f(i)Z i (X incrementally computable)  Theorem Exp[X 2 ] = Σ f(i) 2 – Cross-terms f(i)Z i f(j)Z j have 0 expectation – Square-terms f(i)Z i f(i)Z i = f(i) 2  Hence, X 2 is an unbiased estimator for SJ(A)  Space = log (N + Σ f(i))  Independent samples X k reduce variance

Estimation Quality  How can independent samples X k improve the quality of estimation?  Keep s 1 x s 2 iid instantiations for X k  Use independent random seeds to get a family of Z i ’s for each instance  s 1 reduces variance, s 2 boosts confidence Avg(X 1j 2 ) Avg(X 2j 2 ) Avg(X 3j 2 ) Avg(X 4j 2 ) Avg(X 5j 2 ) Median (sketch) s1s1 s2s2 Atomic Sketch

Sample Run of AMS 36257V = Z 1 = 1 1111 1 Z 2 = X 1 = 5, X 1 2 = 25 X 2 = 14, X 2 2 = 196 Est = 110.5Σv i 2 = 123 46257V = Z 1 = 1 1111 1 Z 2 = X 1 = 6, X 1 2 = 36,X 2 = 12, X 2 2 = 144, Est = 90Σ v i 2 = 130,

Comments on AMS  The self-join size can be computed on-line  Sufficiently small variance (controlled by s 1 and s 2 )  Can estimate SJ(A) in one pass  Probabilistic guarantee: a relative error of at most  with probability at least 1-   Can this method be extended for other queries?

Complex Aggregate Queries  A. Dobra et al. extend the idea of AMS to provide approximate answers to complex aggregate queries. AGG   SELECT AGG FROM R 1,R 2,…,R r where   AGG: COUNT/SUM/AVERAGE   : conjunction of equality joins (R i.A j = R k.A l )  The error of these estimates is at most ε with probability 1-δ.

 Attribute Constraint of  .  Each attribute of a relation appears at most once in the join condition .  For any attribute R i.A j occuring m > 1 times in , add m-1 new “attributes” to R i  Replace m-1 occurrences in  with the new attributes   = ((R 1.A 1 =R 2.A 1 ) & (R 1.A 1 =R 3.A 1 )) becomes ((R 1.A 1 =R 2.A 1 ) & (R 1.A 2 =R 3.A 1 ))

The COUNT Query  Q COUNT : # of tuples in the cross-product of R 1,…,R r that satisfy the equality constraints in .  SELECT COUNT(*) from R 1,…,R r where   Approach: Express the above query using mathematical operators.  Rename the 2n join attributes in  to A 1, A 2,…, A 2n such that each constraint in  is A j = A n+j for 1  j  n.

Projection of Domains  D = dom(A 1 ) x dom(A 2 ) x…x dom(A 2n )  S k is the set of join attributes appearing in R k  D k = dom(A k1 ) x dom(A k2 ) x…x dom(A k|Sk| ); A k1,…,A k|Sk| are attributes in S k  D k is a projection of D on attributes in S k

Value Assignment I  An assignment I assigns values to (a subset of) join attributes  If I  D, each join attribute is assigned a value I[j] by I.  If I  D k, then I only assigns a value I[j] to attributes j  S k.  Use letter j to refer to attribute A j

Value Assignment I  I[S k ] is the projection of I on attributes in S k.  f k (I) is the number of tuples in R k that match I.

Example  SELECT COUNT (*) FROM R1, R2, R3 where R1.A1 = R2.A1 and R2.A2 = R3.A1  Renaming: SELECT COUNT (*) FROM R1, R2, R3 where A1 = A3 and A2 = A4 – A1 = R1.A1A2 = R2.A2 – A3 = R2.A1 A4 = R3.A1  Note: Joins in the form A j = A n+j  S1 = {1}, S2 = {2,3}, S3 = {4}

Example (cont.)  D = dom(A1) x dom(A2) x dom(A3) x dom (A4)  I 1 =  D; I 1 [3] = 2  D1 = dom(A1); D2 = dom(A2) x dom(A3); D3 = dom(A4)  I 1 [S 1 ] =, I 1 [S 2 ] =, I 1 [S 3 ] =  f 2 ( ) = 5 means 5 tuples from R 2 have A2=3, A3=2

Exact Q COUNT  Q COUNT =Σ I  D,  j:I[j]=I[n+j] Π r k=1 f k (I[S k ]) – Product of the # of tuples in each relation that match a value assignment I – Summed over all I  D that satisfy the equi-join constraints ε

Approximate Q COUNT  Similar to self-join size!  Construct a random variable X that is an unbiased estimator for Q COUNT  Variance can be bounded from above  Use averaging and median-selection tricks to boost the accuracy and confidence of X  Achieve small relative error with high probability

Constructing X  For each pair of join attributes j, n+j in , build random variables {Z j,l : l=1,…,|dom(A j )|},where Z j,l  {-1, +1}  Random variables belonging to families for different attribute pairs are independent  Space:Σ n j=1 O(log|dom(A j )|)

Constructing X (2)  For each relation R k, define atomic sketch X k = Σ I  Dk (f k (I)Π j  Sk Z j,I[j] )  X = unbiased estimator of Q COUNT = Π r k=1 X k  E(X) = Q COUNT  Each X k can be efficiently computed as tuples of R k are streaming in. – Initialize X k to 0 – For each tuple t in the R k stream, addΠ j  Sk Z j,t[j] to X k

Join Graph and Acyclicity  A join graph is an undirected graph – a node for each relation, and, – an edge for each join attribute pair j, n+j  If the join graph is acyclic, the variance of X is bounded from above.  Many join queries are acyclic, like chain joins and star joins.

Probabilistic Guarantees  If Q COUNT is acyclic, multi-join COUNT query over relations R 1,…, R r, then a sketch of size O(log|dom(A j )|) is possible to approximate Q COUNT so that: – the relative error of the estimate is at most ε with probability 1-δ.

Basic Notions of Approximation  For aggregate queries (e.g., SUM, COUNT), approximation quality can be measured by absolute relative error: – |Estimated value – Actual value| / Actual value  Open question: for queries involving more than simple aggregation, how should we define approximation?  Consider S |><| B T: (S: {A,B}, T: {B,C}) ABC 1020.5Doctor 810.3Lawyer 310.2Teacher ABC 810.3Lawyer 310.2Teacher Actual Result Approximate Result

Basic Notions of Approximation (2)  Can we accept this kind of approximation? ABC 1020.5Doctor 810.3Lawyer 310.2Teacher Actual ResultApproximate Result ABC 1121.6Doctor 810.3Student 310.2Teacher

Basic Notions of Approximation (3)  Can we provide useful (semantically correct) but stale results? ABC 1020.5Doctor 810.3Lawyer 310.2Teacher Actual Result (at time t) Approximate Result (correct result at time t -  ) ABC 1020.5Doctor 810.3Lawyer

Related Issues  Metric for set-valued queries  Composition of approximate operators  How is it understood/controlled by user?  Integrate into query language  Query planning and interaction with resource allocation  Accuracy-efficiency-storage tradeoff and global metric

Conclusions  To evaluate aggregate queries, O(|dom(A)|) size is needed to obtain exact answers.  True for any deterministic algorithms  We introduce randomized algorithms that uses only O(Σ n j=1 log|dom(A j )|) space.  Only 1 pass is needed for estimation.

Conclusions (2)  Express the target query by mathematical operators  Use an unbiased estimator for the expression  Independent samples reduce variance  A probabilistic guarantee: the relative error of the estimate is at most ε with probability 1-δ.  Question: Can we use this technique to answer queries involving more general selection predicates (e.g., inequality joins)?

References N. Alon, Y. Matias, M. Szegedy. The Space Complexity of Approximating the Frequency Moments, STOC ’96. A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries over Data Streams, SIGMOD ’02.

Thank You!

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

Similar presentations

Presentation on theme: "Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.

Similar presentations

Presentation on theme: "Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002."— Presentation transcript:

Similar presentations

About project

Feedback