Download presentation

Presentation is loading. Please wait.

Published byKarla Farthing Modified over 2 years ago

1
Using the Crowd for Top-K or Group-By Queries Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington 13/20/2013ICDT 2013

2
Example: Using Wisdom of Crowd in DB Queries Database of football players pictures – Multiple photos at different ages of each player Q. Group the photos of individual players Q. Find their most recent photos 23/20/2013ICDT 2013

3
How a DBMS Thinks Q. Group the photos of individual players Group-By Queries! Use Name attribute Q. Find their most recent photos Max/Top-k Queries! Use Date attribute What if name/date is missing Image processing? Photo forensics? 33/20/2013ICDT 2013

4
Ask the Crowd! 43/20/2013ICDT 2013

5
How the Crowd Thinks - 1 Q. Group the photos of individual players 53/20/2013ICDT 2013

6
How the Crowd Thinks - 2 Q. Find their most recent photos 63/20/2013ICDT 2013

7
Crowd Sourcing 73/20/2013ICDT 2013 Using human intelligence to do tasks which are harder to automate A recent topic of interest in database and other research communities Many crowdsourcing platforms

8
This talk: Using Wisdom of Crowd in Top-K and Group-By Queries SELECT TOP 1 R.picture FROM SoccerPlayerPhotoTable AS R GROUP BY R.player ORDER BY R.date DESC Top-k/max Group By / Clustering 8 A possible query Fixed but unknown attributes: R.player, R.date 3/20/2013ICDT 2013

9
Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) Combination is easy.. 93/20/2013ICDT 2013

10
Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) 103/20/2013ICDT 2013

11
Elements: Type and Value #Elements = n – (n = 16) – Two attributes: Type and Value Type – e.g. Name = Maradona – used in Group-By – #Types = J (J clusters, J = 4) Value – e.g. Date when photo was taken – used in Top-K Unknown, but ground-truth exists for Types and Values 113/20/2013ICDT 2013

12
Type and Value Comparisons DB Queries vs. Comparisons (questions to the crowd) Type Comparisons: – Type(x) = Type(y)? Value Comparisons: – Value(x) > Value(y)? – Assumes same type Same Person? Is the first photo older? Answer is Boolean We cannot ask – What is Type(x)/Value(x)? But, answers are not always correct Crowd makes mistakes Next, error model 123/20/2013ICDT 2013

13
Constant Error Model Type comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)? Constant error model: – Standard model – Probability of wrong answer ½ -, >0 Is the first photo older? Wrong: w.p. ½ - Correct: w.p. ½ + Same person? Wrong: w.p. ½ - Correct: w.p. ½ + 133/20/2013ICDT 2013

14
Variable Error Model (this paper) For value comparisons only : Value(x) > Value(y)? Error probability 1/ f( ) - – = Distance of x, y in sorted order Is the first photo older? – Easier Is the first photo older? - Harder x1x1 e.g. Error probability when f( ) = e 14 f is 1.strictly monotone x 1 > x 2 f(x 1 ) > f(x 2 ) 2.f(x) 2 e.g. f( ) = e f( ) = + 1 f( ) = log + 2 x2x2 x3x3 x4x4 1/e 1/e 2 = 2 3/20/2013ICDT 2013 ½ -

15
Framework at a Glance 15 Internal Computation (Top-K/ Group-By) Crowd Value(a) > Value(b) Yes/No Value(b) > Value(c) Yes/No Crowd-sourced DBMS Comparison Error Crowds answer may be wrong with some prob. Cost Model Asking questions costs money Additive Count #comparisons We still want the correct answer w.h.p. 3/20/2013ICDT 2013

16
Our Goal Minimize the total #comparisons while outputting the correct answer (exact top-k or clusters) w.p. 1 – δ (given constant δ > 0) 163/20/2013ICDT 2013

17
Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) 173/20/2013ICDT 2013

18
Problem Statement: Max/Top-k Input: n elements, k – Same type – Only value comparisons are used Output: All top-k elements (w.p. 1 – δ) – (k = 2) We focus on Max: Algorithm for top-k builds on the algorithm for max 18 x3x3 x1x1 x2x2 x4x4 3/20/2013ICDT 2013

19
Max: Our Results 19 Exact Comparisons Constant Error [Feige et. al. 94] Error prob < ½ Variable Error [This paper] Error prob < 1/f( ) Upper Bound n - 1O(n) n + o(n) for any strictly monotone function f Lower Boundn - 1(n) (1+c) n, c > 0 when High success prob. required with high error Asymptotically smaller than all c.n, c>0 Required confidence: 1 – δ (δ = constant), = distance in sorted order 3/20/2013ICDT 2013 Further, f( ) = exp. n + O(1) f( ) = linear n + O(log log n) f( ) = log. n + O(n 1/1+ δ ) Further, f( ) = exp. n + O(1) f( ) = linear n + O(log log n) f( ) = log. n + O(n 1/1+ δ )

20
Review: Tournament Tree for Max n elements as leaves 20 Exact comparisons: Binary tree structure is not necessary Noisy comparisons: Repeat comparison + majority vote Constant error model: θ(n) algorithm (Feige et. al. 94) Our goal: Total no. of comparisons = n + o(n) cannot repeat even twice in most of the internal nodes Comparison at internal nodes Max #comparisons = c. 1 #comparisons = c. 3 #comparisons = c. 5 3/20/2013ICDT 2013

21
Main Steps for Max Upper levels Use Feiges algorithm Y nodes Ө(Y) cost n nodes Binary tournament tree Lower Levels Upper Levels 21 Y = o(n) for any f Y = O(1) for f = e Lower levels Just 1 comparison at each internal node No majority vote Y nodes Max does not lose in the lower levels w.h.p. Intuition: Max does not meet 2 nd Max in the lower levels w.h.p. Total no. of comparisons = n + Ө(Y) = n + o(n) 3/20/2013ICDT 2013 Goal: n + o(n) Key idea: Start with a random permutation of elements at the leaves

22
Extension to Top-K 22 Corollary If k = o( n), n + o(n) comparisons suffice to find top-k w.h.p. x 1, …, x k do not meet in the lower levels 3/20/2013ICDT 2013

23
Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) – Simple Clustering (using types only) – Clustering with Correlated Values and Types 233/20/2013ICDT 2013

24
Simple Clustering Group n elements into J clusters – Only type comparisons – Constant error model O(nJ) comparisons are sufficient – Log factor for comparison error – J can be unknown – Not surprising We show (nJ) lower bound – Even if no comparison error – Even for randomized algorithms – Even when J is known 24 ~ 3/20/2013ICDT 2013

25
O(n log J) value and type comparisons suffice to find the clusters w.h.p. ~ Clustering with Correlated Types and Values Full correlation: Elements of same type form contiguous blocks in the sorted order on values For no correlation, we gave a lower bound of (nJ) Type 1 Type 2 Type 3 High values 25 Low values 3-br 2-br 1-br Assume no error Apts. In a building Type = #bedrooms Value = rent 3/20/2013ICDT 2013

26
Algorithm for Full Correlation Linear cost in total in each inner iteration #Inner iterations = O(log J) Cost: C(n) C(n/2) + O(n log J) C(n) = O(n log J) Type 1Type 2Type 3 26 s = 13 - Find median and partition - If 2 types in a block, Repeat - Same type in a block, keep only one element Repeat Until each block has one type 4J blocks at most J blocks contain 2 types, s/2 elements - Until #elements s/2 J = 3 3/20/2013ICDT 2013

27
Type 2 Extension to Partial Correlation Partial correlation: Elements of same type form almost contiguous blocks Type 1Type 3 27 Type 1 O(n log (α J) + αJ) value and type comparisons suffice w.h.p. ~ α changes in the same type Hotels in a city Type = Star-rating Value = Avg. price 3/20/2013ICDT 2013

28
Related Work Max/Top-K: [Guo et. al. 12]: Which element has the maximum likelihood of being max, which future queries are most effective [Venetis et. al. 12]: Efficient heuristics that tune latency, cost, quality to find max [Feige et. al. 94]: Tight bounds for Max/Top-K/Sorting in constant error model Other error model and objectives exist in theory literature Clustering [Gomes et. al. 11]: Machine learning approach to define clusters [Parameswaran et. al. 12, Wang et. al. 12]: Filtering data, entity resolution Other Work: Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS… Crowd sourcing for data cleansing, data integration, entity resolution, data analytics, … 283/20/2013ICDT 2013

29
Conclusion We studied Max/Top-k and Group-By queries in crowd-sourced setting – Top-K: Proposed a variable error model, allows fewer comparisons – Group by: Obvious simple algorithm is the best possible, correlation reduces #comparisons Future Work: – Other objectives: latency, #rounds, or to reduce error when a budget on #comparisons is given – Other cost functions: e.g. more #similar queries in a round, less cost 293/20/2013ICDT 2013

30
Thank You Questions? 303/20/2013ICDT 2013

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google