1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.
Chapter 4 Sampling Distributions and Data Descriptions.
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Myra Shields Training Manager Introduction to OvidSP.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt RhymesMapsMathInsects.
Mean, Median, Mode & Range
Polygon Scan Conversion – 11b
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
1. PHOTO INDEX Bayside: Page 5-7 Other Colour Leon: Page 8-10 Cabrera Page Canaria Page Driftwood Page 16 Florence Florence and Corfu Page.
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
Database Performance Tuning and Query Optimization
PP Test Review Sections 6-1 to 6-6
1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.
1 The Blue Café by Chris Rea My world is miles of endless roads.
Bright Futures Guidelines Priorities and Screening Tables
Microsoft Confidential. We look at the world... with our own eyes...
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Before Between After.
Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.
Subtraction: Adding UP
Equal or Not. Equal or Not
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
How to create Magic Squares
Presentation transcript:

1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

2 Motivating Example Find 4-bedroom houses priced at $350,000 zExact matches often too restrictive zRank of houses that are closest to specification more desirable

3 Motivating Example (cont.) Find 4-bedroom houses priced at $350,000 yHouse 1: 5 bedrooms; $400,000; Score=0.9 yHouse 2: 4 bedrooms; $485,000; Score=0.8 yHouse 3: 6 bedrooms; $785,000; Score=0.3

4 Top-K Queries over Precise Relational Data ySupport approximate matches with minimal changes to the relational engine yInitial focus: Selection queries with “equality” conditions

5 Outline zDefinition of top-k queries zExecution alternatives zMapping of top-k queries to selection queries zExperiments

6 Top-K Selection Queries zSpecify an n-dimensional target point zDefine scoring function zSpecify k Answer: k objects with the best score for the target point (i.e., the “top k” objects)

7 Specifying Top-K Queries using SQL zSelect * zFrom R zOrder [k] By Scoring_Function

8 Scoring Functions Measure Degree of Match zAssume attributes defined over metric space zScore on any one attribute is well defined zHow to aggregate scores across attributes?

9 Scoring Functions zNormalize attribute scores to be in [0,1] range zCombine scores using popular aggregate functions yMin yEuclidean ySum, Max, …

10 Some Example Scoring Functions Let q=(q 1, …, q n ) be the target point and t=(t 1, …, t n ) a tuple: zMin(q, t) = min{1-|q 1 -t 1 |, …, 1-|q n -t n |} zEuclidean(q, t) = 1- sqrt((q 1 -t 1 ) 2 /n+ … + (q n -t n ) 2 /n)

11 Executing Top-K Queries zKnown techniques require at least one sequential scan (or a functional index) yEvaluate Scoring_Function for each tuple ySort tuples [Carey & Kossman ‘97; ‘98] zQuestion: How to avoid sequential scans? Exploit implicit selectivity of top-k queries

12 Mapping a Top-K Query to a Selection Query zDetermine a search score S such that: yExpected # of tuples with score > S is k yNo false dismissals zTurn the condition that score > S into a range selection condition(s) zEvaluate selection query using existing query processor and access paths

13 Mapping a Top-K Query to a Selection Query 4-bedrooms; $350,000; k=10 zRetrieve all tuples with score > 0.5 (at least k=10 tuples expected) zAnalyze scoring function to determine selection range: Bedrooms: [3, 5] and Price: [$250K, $450K]

14 Mapping a Search Score to a Selection Range For search score S, target point q=(q 1, q 2 ), and scoring function Min: Selection range: yt 1 IN [q 1 - (1.0-S), q 1 + (1.0-S)] yt 2 IN [q 2 - (1.0-S), q 2 + (1.0-S)]

15 Determining a Search Score zMonotonicity: Consider tuple t that is no further from target than t’ on any attribute: Score of t should be at least that of t’ zTherefore, Score cannot be high “far away” from target ySphere for Euclidean yBox for Min …centered at target point “Tightness” of enclosing range varies with scoring functions a b c

16 The Min Scoring Function

17 The Euclidean Scoring Function

18 Comments on Mapping zSearch score determines efficiency, not correctness zIssues in efficiency: yAvoid retrieving too many tuples yAvoid retrieving fewer than k top tuples (restarts) How to determine good search scores?

19 Determining Search Scores zFind k points in data zCompute their score zSet search score to lowest score Challenges: yDetermining the initial k points to optimize execution yTaking original query into account

20 Using Histograms Q

21 Picking K Representative “Tuples” zCollapse histogram bucket to a single representative point yFurthest from Q in bucket (“NoRestarts”) yClosest to Q in bucket (“Restarts”) zAssign bucket frequency to the single representative point zInclude closest representative points until we have k tuples

22 Using Histograms: “NoRestarts” Q

23 Using Histograms: “Restarts” Q

24 Other Strategies for Determining Search Scores zCalculate search score for: yn = NoRestarts (“pessimistic” extreme) yr = Restarts (“optimistic” extreme) zUse intermediate scores: yInter 1 = (2n + r)/3 yInter 2 = (n + 2r)/3 0RestartsNoRestarts1

25 Evaluating the Generated Selection Query zSequential scan zIntersection of a set of indexes, followed by data access zSpecial case: index-only access

26 Indexes and Statistics zIndexes n-dim (concatenated-key) B-trees zStatistics yMaxDiff as base 1-dim histogram yMultidimensional histograms: AVI, Phased, MHist

27 Experimental Evaluation zIs mapping to selection queries an effective technique? zSensitivity of relevant parameters: yScoring functions yData skew and dimensionality yStatistics

28 Data Generation Characterized by Z = Characterized by Z = zGenerate N tuples by Zipfian distribution z 1 zGroup tuples by attr 1 zFor a partition with attr 1 = a with N 1 tuples: yGenerate N 1 values w 1,..., w N1 using Zipfian distribution z 2 yCreate pairs (a, w 1 ), …, (a, w N1 ) zRepeat steps to fill in all attribute values

29 Metrics for Comparison zFraction of data tuples accessed may be compared to: yIdeal: k yWorst case: size of data set z% of restarts

30 Exploring Limits zIntrinsic limitations of range-query approach: yEnclose actual top-k tuples in tight n- rectangle yRetrieve all tuples in n-rectangle Less than 1% of database tuples in n-rectangle (k=10; 100,000 tuples) zEffect of retrieving tuples with score > S using an n-rectangle

31 Effect of Scoring Functions zMin has little/no gap between target region and enclosing n-rectangle As k increases, fraction of retrieved tuples grows slowest for Min zEuclidean performs worse Less tight n-rectangle

32 Tuples with Score > S v. Data Skew (Euclidean; PHASED histogram of 5KB; n=3)

33 Effect of Mapping Strategies and Histograms zMultidimensional histograms aid computation of tight search scores zNoRestarts dominates at high data skew

34 Tuples Retrieved v. Data Skew (PHASED histogram of 5KB; n=3)

35 Restarts v. Data Skew (PHASED histogram of 5KB; n=3)

36 Related Work (1) z[Fagin ‘96; ‘98] yMultimedia attributes with query “subsystem” yMultiple index scans yIndependence assumption z[Chaudhuri & Gravano ‘96] yMultimedia attributes with query “subsystem” yMap top-k queries to “selection” queries yIndependence assumption yLimited scoring functions

37 Related Work (2) z[Carey & Kossman ‘97; ‘98] Optimized sorting phase using k zNearest-neighbor literature z[Donjerkovic & Ramakrishnan ‘99] yProbabilistic optimization framework yNo multidimensional scoring functions yIndependence assumptions

38 Summary zDefined mapping of top-k queries to traditional selection queries Exploit existing database statistics and query processors zStudied effect of scoring functions, data skew, statistics on mapping Full experimental analysis forthcoming!

39 Tuples Retrieved v. Histogram Size (Euclidean; n=3; Z21)

40 Tuples Retrieved v. n (PHASED histogram of 5KB; Z21)

41 Restarts v. n (PHASED histogram of 5KB; Z21)

42 Tuples Retrieved v. k (PHASED histogram of 5KB; Z21; n=3)

43 Restarts v. k (PHASED histogram of 5KB; Z21; n=3)

44 Restarts v. Data Skew (Euclidean; PHASED histogram of 5KB; n=3)

45 Tuples Retrieved v. Histogram Size (Census Database; PHASED)

46 Tuples Retrieved v. Data Skew (Euclidean; PHASED histogram of 5KB; n=3)

47 The Sum Scoring Function

48 The Max Scoring Function