1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University.

Presentation on theme: "1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University."— Presentation transcript:

1 Evaluating Top-K Selection Queries Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University

2 Motivating Example Find 4-bedroom houses priced at \$350,000 zExact matches often too restrictive zRank of houses that are closest to specification more desirable

3 Motivating Example (cont.) Find 4-bedroom houses priced at \$350,000 yHouse 1: 5 bedrooms; \$400,000; Score=0.9 yHouse 2: 4 bedrooms; \$485,000; Score=0.8 yHouse 3: 6 bedrooms; \$785,000; Score=0.3

4 Top-K Queries over Precise Relational Data ySupport approximate matches with minimal changes to the relational engine yInitial focus: Selection queries with “equality” conditions

5 Outline zDefinition of top-k queries zExecution alternatives zMapping of top-k queries to selection queries zExperiments

6 Top-K Selection Queries zSpecify an n-dimensional target point zDefine scoring function zSpecify k Answer: k objects with the best score for the target point (i.e., the “top k” objects)

7 Specifying Top-K Queries using SQL zSelect * zFrom R zOrder [k] By Scoring_Function

8 Scoring Functions Measure Degree of Match zAssume attributes defined over metric space zScore on any one attribute is well defined zHow to aggregate scores across attributes?

9 Scoring Functions zNormalize attribute scores to be in [0,1] range zCombine scores using popular aggregate functions yMin yEuclidean ySum, Max, …

10 Some Example Scoring Functions Let q=(q 1, …, q n ) be the target point and t=(t 1, …, t n ) a tuple: zMin(q, t) = min{1-|q 1 -t 1 |, …, 1-|q n -t n |} zEuclidean(q, t) = 1- sqrt((q 1 -t 1 ) 2 /n+ … + (q n -t n ) 2 /n)

11 Executing Top-K Queries zKnown techniques require at least one sequential scan (or a functional index) yEvaluate Scoring_Function for each tuple ySort tuples [Carey & Kossman ‘97; ‘98] zQuestion: How to avoid sequential scans? Exploit implicit selectivity of top-k queries

12 Mapping a Top-K Query to a Selection Query zDetermine a search score S such that: yExpected # of tuples with score > S is k yNo false dismissals zTurn the condition that score > S into a range selection condition(s) zEvaluate selection query using existing query processor and access paths

13 Mapping a Top-K Query to a Selection Query 4-bedrooms; \$350,000; k=10 zRetrieve all tuples with score > 0.5 (at least k=10 tuples expected) zAnalyze scoring function to determine selection range: Bedrooms: [3, 5] and Price: [\$250K, \$450K]

14 Mapping a Search Score to a Selection Range For search score S, target point q=(q 1, q 2 ), and scoring function Min: Selection range: yt 1 IN [q 1 - (1.0-S), q 1 + (1.0-S)] yt 2 IN [q 2 - (1.0-S), q 2 + (1.0-S)]

15 Determining a Search Score zMonotonicity: Consider tuple t that is no further from target than t’ on any attribute: Score of t should be at least that of t’ zTherefore, Score cannot be high “far away” from target ySphere for Euclidean yBox for Min …centered at target point “Tightness” of enclosing range varies with scoring functions a b c

16 The Min Scoring Function

17 The Euclidean Scoring Function

18 Comments on Mapping zSearch score determines efficiency, not correctness zIssues in efficiency: yAvoid retrieving too many tuples yAvoid retrieving fewer than k top tuples (restarts) How to determine good search scores?

19 Determining Search Scores zFind k points in data zCompute their score zSet search score to lowest score Challenges: yDetermining the initial k points to optimize execution yTaking original query into account

20 Using Histograms Q 4 20 11 10

21 Picking K Representative “Tuples” zCollapse histogram bucket to a single representative point yFurthest from Q in bucket (“NoRestarts”) yClosest to Q in bucket (“Restarts”) zAssign bucket frequency to the single representative point zInclude closest representative points until we have k tuples

22 Using Histograms: “NoRestarts” Q 4 20 11 10

23 Using Histograms: “Restarts” 4 20 11 10 Q

24 Other Strategies for Determining Search Scores zCalculate search score for: yn = NoRestarts (“pessimistic” extreme) yr = Restarts (“optimistic” extreme) zUse intermediate scores: yInter 1 = (2n + r)/3 yInter 2 = (n + 2r)/3 0RestartsNoRestarts1

25 Evaluating the Generated Selection Query zSequential scan zIntersection of a set of indexes, followed by data access zSpecial case: index-only access

26 Indexes and Statistics zIndexes n-dim (concatenated-key) B-trees zStatistics yMaxDiff as base 1-dim histogram yMultidimensional histograms: AVI, Phased, MHist

27 Experimental Evaluation zIs mapping to selection queries an effective technique? zSensitivity of relevant parameters: yScoring functions yData skew and dimensionality yStatistics

28 Data Generation Characterized by Z = Characterized by Z = zGenerate N tuples by Zipfian distribution z 1 zGroup tuples by attr 1 zFor a partition with attr 1 = a with N 1 tuples: yGenerate N 1 values w 1,..., w N1 using Zipfian distribution z 2 yCreate pairs (a, w 1 ), …, (a, w N1 ) zRepeat steps to fill in all attribute values

29 Metrics for Comparison zFraction of data tuples accessed may be compared to: yIdeal: k yWorst case: size of data set z% of restarts

30 Exploring Limits zIntrinsic limitations of range-query approach: yEnclose actual top-k tuples in tight n- rectangle yRetrieve all tuples in n-rectangle Less than 1% of database tuples in n-rectangle (k=10; 100,000 tuples) zEffect of retrieving tuples with score > S using an n-rectangle

31 Effect of Scoring Functions zMin has little/no gap between target region and enclosing n-rectangle As k increases, fraction of retrieved tuples grows slowest for Min zEuclidean performs worse Less tight n-rectangle

32 Tuples with Score > S v. Data Skew (Euclidean; PHASED histogram of 5KB; n=3)

33 Effect of Mapping Strategies and Histograms zMultidimensional histograms aid computation of tight search scores zNoRestarts dominates at high data skew

34 Tuples Retrieved v. Data Skew (PHASED histogram of 5KB; n=3)

35 Restarts v. Data Skew (PHASED histogram of 5KB; n=3)

36 Related Work (1) z[Fagin ‘96; ‘98] yMultimedia attributes with query “subsystem” yMultiple index scans yIndependence assumption z[Chaudhuri & Gravano ‘96] yMultimedia attributes with query “subsystem” yMap top-k queries to “selection” queries yIndependence assumption yLimited scoring functions

37 Related Work (2) z[Carey & Kossman ‘97; ‘98] Optimized sorting phase using k zNearest-neighbor literature z[Donjerkovic & Ramakrishnan ‘99] yProbabilistic optimization framework yNo multidimensional scoring functions yIndependence assumptions

38 Summary zDefined mapping of top-k queries to traditional selection queries Exploit existing database statistics and query processors zStudied effect of scoring functions, data skew, statistics on mapping Full experimental analysis forthcoming!

39 Tuples Retrieved v. Histogram Size (Euclidean; n=3; Z21)

40 Tuples Retrieved v. n (PHASED histogram of 5KB; Z21)

41 Restarts v. n (PHASED histogram of 5KB; Z21)

42 Tuples Retrieved v. k (PHASED histogram of 5KB; Z21; n=3)

43 Restarts v. k (PHASED histogram of 5KB; Z21; n=3)

44 Restarts v. Data Skew (Euclidean; PHASED histogram of 5KB; n=3)

45 Tuples Retrieved v. Histogram Size (Census Database; PHASED)

46 Tuples Retrieved v. Data Skew (Euclidean; PHASED histogram of 5KB; n=3)

47 The Sum Scoring Function

48 The Max Scoring Function

Similar presentations