Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Fast Algorithms For Hierarchical Range Histogram Constructions
Copyright 2004 Koren & Krishna ECE655/DataRepl.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
Searching on Multi-Dimensional Data
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
Optimal Workload-Based Weighted Wavelet Synopsis
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams Amit Manjhi, Suman Nath, Phillip B. Gibbons Carnegie Mellon University.
A Quick Introduction to Approximate Query Processing Part II
Dependency-Based Histogram Synopses for High-dimensional Data Amol Deshpande, UC Berkeley Minos Garofalakis, Bell Labs Rajeev Rastogi, Bell Labs.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Fundamental Techniques
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Wavelet Synopses with Error Guarantees Minos Garofalakis Intel Research Berkeley
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
The BSP-tree from Prof. Seth MIT.. Motivation for BSP Trees: The Visibility Problem We have a set of objects (either 2d or 3d) in space. We have.
A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis.
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
One-Pass Wavelet Synopses for Maximum-Error Metrics Panagiotis Karras Trondheim, August 31st, 2005 Research at HKU with Nikos Mamoulis.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Adversarial Search 2 (Game Playing)
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
A paper on Join Synopses for Approximate Query Answering
Backtracking And Branch And Bound
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Optimal Configuration of OSPF Aggregates
Lattice Histograms: A Resilient Synopsis Structure
Spatial Online Sampling and Aggregation
Dynamic Programming.
Dynamic Programming.
SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS
Spatial Indexing I R-trees
Efficient Aggregation over Objects with Extent
Presentation transcript:

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent [ Joint work with Amit Kumar (IIT–Delhi) ]

Outline Preliminaries & Motivation –Approximate query processing –Haar wavelet decomposition, conventional wavelet synopses –The problem Earlier Approach: Probabilistic Wavelet Synopses [Garofalakis & Gibbons, SIGMOD’02] –Randomized Selection and Rounding Our Approach: Efficient Deterministic Thresholding Schemes –Deterministic wavelet synopses optimized for maximum relative/absolute error Extensions to Multi-dimensional Haar Wavelets –Efficient polynomial-time approximation schemes Conclusions & Future Directions

Approximate Query Processing Exact answers NOT always required –DSS applications usually exploratory: early feedback to help identify “interesting” regions –Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” Construct effective data synopses ?? SQL Query Exact Answer Decision Support Systems (DSS) Long Response Times! GB/TB Compact Data Synopses “Transformed” Query KB/MB Approximate Answer FAST!!

Haar Wavelet Decomposition Wavelets:Wavelets: mathematical tool for hierarchical decomposition of functions/signals Haar wavelets:Haar wavelets: simplest wavelet basis, easy to understand and implement –Recursive pairwise averaging and differencing at different resolutions Resolution Averages Detail Coefficients D = [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, 4][0, -1, -1, 0] [1.5, 4][0.5, 0] [2.75][-1.25] Haar wavelet decomposition:[2.75, -1.25, 0.5, 0, 0, -1, -1, 0] Construction extends naturally to multiple dimensions

Haar Wavelet Coefficients Hierarchical decomposition structure ( a.k.a. Error Tree ) –Conceptual tool to “visualize” coefficient supports & data reconstruction Reconstruct data values d(i) –d(i) = (+/-1) * (coefficient on path) Range sum calculation d(l:h) –d(l:h) = simple linear combination of coefficients on paths to l, h Only O(logN) terms Original data 3 = (-1.25) (-1) 6 = 4* *(-1.25)

Wavelet Data Synopses Compute Haar wavelet decomposition of D Coefficient thresholding : only B<<|D| coefficients can be kept –B is determined by the available synopsis space Approximate query engine can do all its processing over such compact coefficient synopses (joins, aggregates, selections, etc.) –Matias, Vitter, Wang [SIGMOD’98]; Vitter, Wang [SIGMOD’99]; Chakrabarti, Garofalakis, Rastogi, Shim [VLDB’00 ] Conventional thresholding: Take B largest coefficients in absolute normalized value –Normalized Haar basis: divide coefficients at resolution j by –All other coefficients are ignored (assumed to be zero) –Provably optimal in terms of the overall Sum-Squared (L2) Error Unfortunately, this methodology gives no meaningful approximation- quality guarantees for –Individual reconstructed data values –Individual range-sum query results

Problems with Conventional Synopses An example data vector and wavelet synopsis (|D|=16, B=8 largest coefficients retained) Original Data Values Wavelet Answers Large variation in answer quality –Within the same data set, when synopsis is large, when data values are about the same, when actual answers are about the same –Heavily-biased approximate answers! Root causes –Thresholding for aggregate L2 error metric –Independent, greedy thresholding ( large regions without any coefficient!) –Heavy bias from dropping coefficients without compensating for loss Always accurate! Over 2,000% relative error! Estimate = 195, actual values: d(0:2)=285, d(3:5)=93!

Solution: Optimize for Maximum-Error Metrics Key metric for effective approximate answers: Relative error with sanity bound –Sanity bound “s” to avoid domination by small data values To provide tight error guarantees for all reconstructed data values –Minimize maximum relative error in the data reconstruction Another option: Minimize maximum absolute error Minimize

Determine = the probability of retaining ci –yi = “fractional space” allotted to coefficient ci ( yi B ) –Flip biased coins to select coefficients for the synopsis Probabilistically control maximum relative error by minimizing the maximum Normalized Standard Error (NSE) M[j,b] = optimal value of the maximum NSE for the subtree rooted at coefficient cj for a space allotment of b Earlier Approach: Probabilistic Wavelet Synopses [GG,SIGMOD’02] j 2j2j+1 Quantize choices for y to {1/q, 2/q,..., 1} –q = input integer parameter – time, memory

But, still… Potential concerns for probabilistic wavelet synopses –Pitfalls of randomized techniques Possibility of a “bad” sequence of coin flips resulting in a poor synopsis –Dependence on a quantization parameter/knob q Effect on optimality of final solution is not entirely clear “Indirect” Solution:“Indirect” Solution: Schemes in [GG,SIGMOD’02] try to probabilistically control maximum relative error through appropriate probabilistic metrics E.g., minimizing maximum NSE Natural QuestionNatural Question –Can we design an efficient deterministic thresholding scheme for minimizing non-L2 error metrics, such as maximum relative error? Completely avoid pitfalls of randomization Guarantee error-optimal synopsis for a given space budget B

Do the GG Ideas Apply? Unfortunately, DP formulations in [GG,SIGMOD’02] rely on –Ability to assign fractional storage to each coefficient ci –Optimization metrics (maximum NSE) with monotonic/additive structure over the error tree M[j,b] = optimal NSE for subtree T(j) with space b Principle of OptimalityPrinciple of Optimality –Can compute M[j,*] from M[2j,*] and M[2j+1,*] When directly optimizing for maximum relative (or, absolute) error with storage {0,1}, principle of optimality fails! –Assume that M[j,b] = optimal value for with at most b coefficients selected in T(j) –Optimal solution at j may not comprise optimal solutions for its children Remember that = (+/-)* SelectedCoefficient, where coefficient values can be positive or negative BUT, it can be done!!Schemes in [GG,SIGMOD’02] do not apply – BUT, it can be done!! j 2j2j+1 +-

Our Approach: Deterministic Wavelet Thresholding for Maximum Error Key Idea: Dynamic-Programming formulation that conditions the optimal solution on the error that “enters” the subtree (through the selection of ancestor nodes) j 2j2j+1 +- root=0 S = subset of proper ancestors of j included in the synopsis Our DP table: M[j, b, S] = optimal maximum relative (or, absolute) error in T(j) with space budget of b coefficients (chosen in T(j)), assuming subset S of j’s proper ancestors have already been selected for the synopsis –Clearly, |S| min{B-b, logN+1} –Want to compute M[0, B, ] Basic Observation: Depth of the error tree is only logN+1 we can explore and tabulate all S-subsets for a given node at a space/time cost of only O(N) !

Base Case for DP Recurrence: Leaf (Data) Nodes Base case in the bottom-up DP computation: Leaf (i.e., data) node –Assume for simplicity that data values are numbered N, …, 2N-1 Again, time/space complexity per leaf node is only O(N) j/2 +- root=0 Selected coefficient subset S M[j, b, S] is not defined for b>0 –Never allocate space to leaves For b=0 for each coefficient subset with |S| min{B, logN+1} –Similarly for absolute error

DP Recurrence: Internal (Coefficient) Nodes Two basic cases when examining node/coefficient j for inclusion in the synopsis: (1) Drop j; (2) Keep j In this case, the minimum possible maximum relative error in T(j) is j 2j2j+1 +- root=0 S = subset of selected j-ancestors Case (1): Drop Coefficient j –Optimally distribute space b between j’s two child subtrees Note that the RHS of the recurrence is well-defined –Ancestors of j are obviously ancestors of 2j and 2j+1

DP Recurrence: Internal (Coefficient) Nodes (cont.) In this case, the minimum possible maximum relative error in T(j) is Case (2): Keep Coefficient j –Take 1 unit of space for coefficient j, and optimally distribute remaining space –Selected subsets in RHS change, since we choose to retain j Again, the recurrence RHS is well-defined j 2j2j+1 +- root=0 S = subset of selected j-ancestors Finally, define Overall complexity: time, space

Outline Preliminaries & Motivation –Approximate query processing –Haar wavelet decomposition, conventional wavelet synopses –The problem Earlier Approach: Probabilistic Wavelet Synopses [Garofalakis & Gibbons, SIGMOD’02] –Randomized Selection and Rounding Our Approach: Efficient Deterministic Thresholding Schemes –Deterministic wavelet synopses optimized for maximum relative/absolute error Extensions to Multi-dimensional Haar Wavelets –Efficient polynomial-time approximation schemes Conclusions & Future Directions

Multi-dimensional Haar Wavelets Haar decomposition in d dimensions = d-dimensional array of wavelet coefficients –Coefficient support region = d-dimensional rectangle of cells in the original data array –Sign of coefficient’s contribution can vary along the quadrants of its support Support regions & signs for the 16 nonstandard 2-dimensional Haar coefficients of a 4X4 data array A

Multi-dimensional Haar Error Trees Conceptual tool for data reconstruction – more complex structure than in the 1-dimensional case –Internal node = Set of (up to) coefficients (identical support regions, different quadrant signs) –Each internal node can have (up to) children (corresponding to the quadrants of the node’s support) Maintains linearity of reconstruction for data values/range sums Error-tree structure for 2-dimensional 4X4 example (data values omitted)

Problem:Problem: Even though depth is still O(logN), each node now comprises up to coefficients, all of which contribute to every child –Data-value reconstruction involves up to coefficients –Number of potential ancestor subsets (S) explodes with dimensionality –Space/time requirements of our DP formulation quickly become infeasible (even for d=3,4) Our Solution:Our Solution: -approximation schemes for multi-d thresholding Can we Directly Apply our DP? Up to ancestor subsets per node! dimensionality d=2

Time/space efficient approximation schemes for deterministic multi- dimensional wavelet thresholding for maximum error metrics Propose two different approximation schemes –Both are based on approximate dynamic programs –Explore a much smaller number of options while offering -approximation gurantees for the final solution Scheme #1:Scheme #1: Sparse DP formulation that rounds off possible values for subtree-entering errors to powers of – time –Additive -error guarantees for maximum relative/absolute error Scheme #2:Scheme #2: Use scaling & rounding of coefficient values to convert a pseudo-polynomial solution to an efficient approximation scheme – time – -approximation algorithm for maximum absolute error Details in the paper… Approximate Maximum-Error Thresholding in Multiple Dimensions

Conclusions & Future Work Introduced the first efficient schemes for deterministic wavelet thresholding for maximum-error metrics –Based on a novel DP formulation –Avoid pitfalls of earlier probabilistic solutions –Meaningful error guarantees on individual query answers Extensions to multi-dimensional Haar wavelets –Complexity of exact solution becomes prohibitive –Efficient polynomial-time approximation schemes based on approximate DPs Future Research DirectionsFuture Research Directions –Streaming computation/incremental maintenance of max-error wavelet synopses –Extend methodology and max-error guarantees for more complex queries (joins??) –Hardness of multi-dimensional max-error thresholding? –Suitability of Haar wavelets, e.g., for relative error? Other bases??

Thank you! Questions?