PODS 2000, 5/17/00AT&T Labs1 Selectivity Estimation For Boolean Queries Zhiyuan Chen (Speaker) Flip Korn Nick Koudas S.Muthukrishnan.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Representing and Querying Correlated Tuples in Probabilistic Databases
Fast Algorithms For Hierarchical Range Histogram Constructions
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Indian Statistical Institute Kolkata
Efficient Query Evaluation on Probabilistic Databases
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Program Verification as Probabilistic Inference Sumit Gulwani Nebojsa Jojic Microsoft Research, Redmond.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
ADVISE: Advanced Digital Video Information Segmentation Engine
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Computer Science Department Andrés Corrada-Emmanuel and Howard Schultz Presented by Lawrence Carin from Duke University Autonomous precision error in low-
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Finding Similar Items.
Lecture 20: April 12 Introduction to Randomized Algorithms and the Probabilistic Method.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Quantum Computing MAS 725 Hartmut Klauck NTU TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A.
Querying Structured Text in an XML Database By Xuemei Luo.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
CPSC 335 Randomized Algorithms Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Histograms for Selectivity Estimation
Chapter 23: Probabilistic Language Models April 13, 2004.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Database Management System
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Structure and Value Synopses for XML Data Graphs
Probabilistic Data Management
Query-Friendly Compression of Graph Streams
Spatial Online Sampling and Aggregation
Data Integration with Dependent Sources
Randomized Algorithms CS648
Locality Sensitive Hashing
Approximate Frequency Counts over Data Streams
Reachability on Suffix Tree Graphs
Minwise Hashing and Efficient Search
Presentation transcript:

PODS 2000, 5/17/00AT&T Labs1 Selectivity Estimation For Boolean Queries Zhiyuan Chen (Speaker) Flip Korn Nick Koudas S.Muthukrishnan

PODS 2000, 5/17/00AT&T Labs2 Motivation(1) Boolean queries on substring predicates are ubiquitous. Information Retrieval. Bibliographic search. Web searching. 40 millions per day at AltaVista. RDB queries. Document 1. Peanut butter lover’s club... Document 2. … peanut stock... Document 3. …butter, the natural choice… … Document How many documents contain substring peanut but not butter?

PODS 2000, 5/17/00AT&T Labs3 Motivation(2) zUse estimates for. yQuery optimization. xBest Filtering order. yInteractive query refinement. xHard to write a query having answers. xRanking approach does not always work and expensive. xEstimate -> refine query ->... -> exact answer zComputing exact answers is expensive! – Either super-linear space or linear time.

PODS 2000, 5/17/00AT&T Labs4 Outline zProblem Definition zRelated work zOur approach zExperiments zConclusions

PODS 2000, 5/17/00AT&T Labs5 Problem Definition zSubstring predicate  (s) is true iff string s contains  as substring. zBoolean queries: ySubstring predicates concatenated with AND, OR, NOT. zFor a string set S, a Boolean query q: ySelectivity P(q) is the fraction of strings in S that satisfy q. Document 1. Peanut butter lover’s club... Document 2. … peanut stock... … Document How many documents contain substring peanut but not butter?

PODS 2000, 5/17/00AT&T Labs6 Related Work (1) zHistograms: Not suitable for strings – Selectivity of adjacent substrings often differs a lot! zEnd-biased histograms(IP95). –Structure of substring dependence is not used. –If  is pruned, for any  Count(  ) = Count(  ) = Default ????

PODS 2000, 5/17/00AT&T Labs7 Related Work (2) zExisting work for substring queries. yConjunction-only queries: KVI96, WVI97, JNS99, JKNS99 yPreprocess: a compact data structure Pruned Suffix Tree. yCorrelation between predicates explicitly stored. x Otherwise - independence assumption on substring predicates. yPruned case: xParse into subqueries & estimate each subquery. xProbabilistic formula to combine estimates. independence assumption. Maximal overlap (conditioning on overlaps).

PODS 2000, 5/17/00AT&T Labs8 Use Previous Approach? Independence Assumption: P(peanut  butter) = P(peanut) * P(butter) = 2/100 * 2/100 = 0.04% NO! - Exponential (2 2 m ) space to store correlation between substring predicates. (m as number of suffixes) 25 times smaller than true count! Not to store correlation? - No! Correlation is important! Document 1. Peanut butter lover’s club... Document 2. … peanut stock… Document 3. …butter, the natural choice… … Document

PODS 2000, 5/17/00AT&T Labs9 Set-Oriented Approach - Store Correlation Implicitly BaseSet(peanut  butter) = BaseSet(peanut)  BaseSet(butter) = {1} Base sets can be huge in the worst case!!! O(number of strings) Peanutbutter {1,2}{1,3} Base set: the set of IDs of strings that contain the substring. Document 1. Peanut butter lover’s club... Document 2. … peanut stock... Document 3. …butter, the natural choice… … Document

PODS 2000, 5/17/00AT&T Labs10 Set-Hashing Approach zA Monte Carlo technique(Cohen94,Broder98) AB Set inclusion-exclusion for unions and complements. Two Sets A, B Generate a fixed length signature for each set. Estimate |A  B| by manipulating signatures.

PODS 2000, 5/17/00AT&T Labs11 Signature Generation Universe = {1,2,3,4,5} A={1,2,3},B={5,2,3} Generate signatures with length 3 Randomly permute universe 3 times Signature of A 1 Pick first element in A = {1,2,3} Signature of B

PODS 2000, 5/17/00AT&T Labs12 Reconstruction |A  B| = r / (1+ r) * (|A| +|B|) = 2/3 / (1+2/3) * (3+3) = A’s signature S A 232 B’s signature S B Definition: r = # of pair-wise matches of S A and S B / |S A | =2 /3 Theorem:

PODS 2000, 5/17/00AT&T Labs13 Implementation Issues zApproximate permutations: yUse a set of independent hash functions. yPick the minimal hash images as signature components. Sig(A) = min{h(x)| x in A}. zSignature of unions: Sig(peanut  butter) yPair-wise min of Sig(p) and Sig(b).

PODS 2000, 5/17/00AT&T Labs14 Algorithm Outline - No Pruning zWith negations. y| (Peanut  butter)   sandwich | = | (p ^  s)  ( b   s) |(Convert to DNF) = |p   s| + | b   s| - |p  b   s| (Eliminate disjunction by set-inclusion-exclusion) =|p| - |p  s| + |b| - | b  s| - |p  b| + |p  b  s| (Eliminate negations) Comments: Works fine with short queries. Without negations. – Convert to CNF forms. (Peanut  butter)  sandwich – Estimate using Sig(Peanut  butter), Sig(sandwich).

PODS 2000, 5/17/00AT&T Labs15 Pruned Suffix Tree Case Combine them using probabilistic formula. Maximal overlap parsing and conditioning on overlap. E.g. P((abc  12)   23) = P(23) - P(abc  12  23) = P(23) - P(ab  12  23) * P(bc  12  23 | ab  12  23)  P(23) - P(ab  12  23) * P(bc  12  23 | b  12  23) = P(23) - P(ab  12  23) * P(bc  12  23) / P(b  12  23) Parse a query into subqueries only have predicates in suffix trees. Use signatures to estimate each subquery. abc parsed into ab and bc

PODS 2000, 5/17/00AT&T Labs16 Complexity zTheorem: yPreprocessing: building tree and signatures O(signature length * database size) time and space. yOnline estimate: O(2 O(L) ), L is the query length. zOnline time only related to query length. zL is small in real life. yBelow 1 millisecond in experiments.

PODS 2000, 5/17/00AT&T Labs17 Experiments - Setup zData set: real AT&T data - service description. y130 K strings, 2.5 MB. zQueries: yTemplates: xT1: (A  B)  (C  D) xT2: (A  B)  (C  D)  (E  F)  (G  H) yWith a certain probability of negation. yPositive & Negative queries. zCompare with independence assumption. zRun on an Intel PC 350 MHz, 128 MB RAM. y1 minute preprocessing, < 1 millisecond estimate time.

PODS 2000, 5/17/00AT&T Labs18 PST-Positive Queries Probability of negations = 5% Average absolute relative error. T1: (A  B)  (C  D) T2: (A  B)  (C  D)  (E  F)  (G  H)

PODS 2000, 5/17/00AT&T Labs19 PST-Negative Queries Probability of negations = 5% Average root-mean-square error (count) T1: (A  B)  (C  D)T2: (A  B)  (C  D)  (E  F)  (G  H)

PODS 2000, 5/17/00AT&T Labs20 Conclusions zContributions: yA novel problem. yA novel approach of implicitly storing correlation and generating correlation as needed by set-hashing. yFar superior than independence assumption. x1.0% space, < 1 ms. x4 times more accurate for positive queries, many orders for negative queries. zOngoing & Future work. yTwig estimation for XML documents. yRegular expressions, position constraints.