Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using the Crowd for Top-K or Group-By Queries

Similar presentations


Presentation on theme: "Using the Crowd for Top-K or Group-By Queries"— Presentation transcript:

1 Using the Crowd for Top-K or Group-By Queries
Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington 3/20/2013 ICDT 2013

2 Example: Using Wisdom of Crowd in DB Queries
Database of football players pictures Multiple photos at different ages of each player Q. Group the photos of individual players Q. Find their most recent photos Some tagged, some untagged Millions of soccer fans in the world 3/20/2013 ICDT 2013

3 How a DBMS Thinks Max/Top-k Queries! What if name/date is missing
Q. Group the photos of individual players Group-By Queries! Use “Name” attribute Q. Find their most recent photos Max/Top-k Queries! Use “Date” attribute Wait.. Do we have all information for this queries? Values of name/date attributes Some tagged, some untagged Some date missing from the photos What if name/date is missing Image processing? Photo forensics? 3/20/2013 ICDT 2013

4 Ask the “Crowd”! Millions of soccer fans in the world
Possibly the more recent photo of more famous players are easily identifiable Fans of these players who collect even their childhood photos 3/20/2013 ICDT 2013

5 How the Crowd Thinks - 1 Q. Group the photos of individual players
Millions of soccer fans in the world Possibly the more recent photo of more famous players are easily identifiable Fans of these players who collect even their childhood photos 3/20/2013 ICDT 2013

6 How the Crowd Thinks - 2 Q. Find their most recent photos
Millions of soccer fans in the world Possibly the more recent photo of more famous players are easily identifiable Fans of these players who collect even their childhood photos 3/20/2013 ICDT 2013

7 Crowd Sourcing Using human intelligence to do tasks which are harder to automate A recent topic of interest in database and other research communities Many crowdsourcing platforms This talk: use wisdom of crowd for Database queruies Different fixed but unknown attributes in some table, but the group by and top-k functions are easier to compute by the crowd 3/20/2013 ICDT 2013

8 This talk: Using Wisdom of Crowd in Top-K and Group-By Queries
Top-k/max A possible query SELECT TOP R.picture FROM SoccerPlayerPhotoTable AS R GROUP BY R.player ORDER BY R.date DESC Group By / Clustering This talk: use wisdom of crowd for Database queruies Different fixed but unknown attributes in some table, but the group by and top-k functions are easier to compute by the crowd Fixed but unknown attributes: R.player, R.date 3/20/2013 ICDT 2013

9 Outline of this talk Our Model Max and Top-K Queries
Group-By Queries (Clustering) Combination is easy.. Model -- (error, guarantee in the result) Brief outline of technical results We also talk about combining top-k with clustering 3/20/2013 ICDT 2013

10 Outline of this talk Our Model Max and Top-K Queries
Group-By Queries (Clustering) Model -- (error, guarantee in the result) Three problems in the paper 3/20/2013 ICDT 2013

11 Elements: Type and Value
#Elements = n (n = 16) Two attributes: Type and Value Type e.g. Name = “Maradona” used in Group-By #Types = J (J clusters, J = 4) Value e.g. Date when photo was taken used in Top-K Unknown, but “ground-truth’’ exists for Types and Values CHECK IF ALL VALUES ARE NOT EQUAL Each element has a Type and a Value: These are the only relevant attributes “Types” partitions elements into J clusters Number of types = J (J = 4) Partitions elements into clusters type is implicit 3/20/2013 ICDT 2013

12 Type and Value Comparisons
Is the first photo older? Same Person? DB Queries vs. Comparisons (questions to the crowd) Type Comparisons: Type(x) = Type(y)? Value Comparisons: Value(x) > Value(y)? Assumes same type It may not be possible for the user to know the name of a player or the exact date when the photo was taken, still they can decide whether two photos have the same player or which photo looks younger We assume values are distinct Answer is Boolean We cannot ask “What is Type(x)/Value(x)?” But, answers are not always correct Crowd makes mistakes Next, error model 3/20/2013 ICDT 2013

13 Is the first photo older?
Constant Error Model Type comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)? Constant error model: Standard model Probability of wrong answer ≤ ½ - ,  >0 Is the first photo older? Wrong: w.p. ≤ ½ -  Correct: w.p. ≥ ½ +  Crowd makes mistake due to various reasons : ignorance, insincerity, lack of time, ½ + : so that we always answer little better than random [Values of  can be estimated by sampling, that will decide how many repeated comparisons are needed] Crowd gives erroneous answers to type or value comparisons with probability ≤ ½ -  (for some constant ) Standard error model Same person? Wrong: w.p. ≤ ½ -  Correct: w.p. ≥ ½ +  3/20/2013 ICDT 2013

14 Variable Error Model (this paper)
Is the first photo older? - Harder Is the first photo older? – Easier For value comparisons only : Value(x) > Value(y)? Error probability ≤ 1/ f() -   = Distance of x, y in sorted order f is strictly monotone x1 > x2  f(x1) > f(x2) f(x) ≥ 2 e.g. f() = e f() =  + 1 f() = log  + 2 ≤ ½ -   = 2 MENTION that it can also be applied to value-dependent definition if needed? Variable error model: more realistic Assume xi-s also denote their values These functions are chosen such that f(x) >= 2 Can be estimated by e.g. by sampling Consecutive elements also have < ½ probability of error Error probability depends on the distance between values Higher distance  smaller error Simplified expression for error probability F is error function, can be estimated by sampling Variable error saves cost Next we define what we mean by cost and what we mean by a good solution x1 x2 x3 x4 e.g. Error probability when f() = e ≤ 1/e ≤ 1/e2 3/20/2013 ICDT 2013

15 Framework at a Glance Internal Computation Crowd Crowd-sourced DBMS
Value(a) > Value(b) Internal Computation (Top-K/ Group-By) Yes/No Value(b) > Value(c) Yes/No Crowd Crowd-sourced DBMS Crowd-sourced DBMS, will delegate some of the operations to the crowd Points to be noted Crowd won’t work for free, in most cases Limitations in the #no. questions we ask Each type/value comparison costs money There is always a chance of error Comparison Error Crowd’s answer may be wrong with some prob. Cost Model Asking questions costs money Additive Count #comparisons We still want the correct answer w.h.p. 3/20/2013 ICDT 2013

16 Our Goal Minimize the total #comparisons
while outputting the correct answer (exact top-k or clusters) w.p.  1 – δ (given constant δ > 0) Each comparison costs a few cents Desired solution: Output exact top-k or exact clustering Can combine both as well \delta given (type/value) comparisons in total That’s all about our model 3/20/2013 ICDT 2013

17 Outline of this talk Our Model Max and Top-K Queries
Group-By Queries (Clustering) Model -- (error, guarantee in the result) 3/20/2013 ICDT 2013

18 Problem Statement: Max/Top-k
Input: n elements, k Same type Only value comparisons are used Output: All top-k elements (w.p.  1 – δ) (k = 2) We focus on Max: Algorithm for top-k builds on the algorithm for max Types are ignored We will mainly talk about Max 3/20/2013 ICDT 2013

19 Max: Our Results n + o(n) Further, f() = exp.  n + O(1)
Required confidence: 1 – δ (δ = constant),  = distance in sorted order Exact Comparisons Constant Error [Feige et. al. ‘94] Error prob < ½ Variable Error [This paper] Error prob < 1/f() Upper Bound n - 1 O(n) n + o(n) for any strictly monotone function f Lower Bound Ω(n) ≥ (1+c) n, c > 0 when High success prob. required with high error Asymptotically smaller than all c.n, c>0 o(n) strictly smaller than n Hiding details of , δ in the expressions assuming them constants Complement Feige’s results Known f => better analysis Further, f() = exp.  n + O(1) f() = linear  n + O(log log n) f() = log  n + O(n1/1+ δ) 3/20/2013 ICDT 2013

20 Review: Tournament Tree for Max
10 #comparisons = c . 5 #comparisons = c . 3 Comparison at internal nodes 10 #comparisons = c . 1 9 5 9 10 2 n elements as leaves Exact comparisons: Binary tree structure is not necessary Noisy comparisons: Repeat comparison majority vote Constant error model: θ(n) algorithm (Feige et. al. ’94) Our goal: Total no. of comparisons = n + o(n) cannot repeat even twice in most of the internal nodes Upper level: higher chance of losing: repeat more 3/20/2013 ICDT 2013

21 Main Steps for Max Binary tournament tree
Goal: n + o(n) Upper levels Use Feige’s algorithm Y nodes  Ө(Y) cost Y = o(n) for any f Y = O(1) for f = e Upper Levels Y nodes Lower levels Just 1 comparison at each internal node No majority vote Binary tournament tree Lower Levels n nodes We analyze the tournament tree X must be small Max never loses in lower or upper levels Random permutation is the key Key idea: Start with a random permutation of elements at the leaves Max does not lose in the lower levels w.h.p. Intuition: Max does not meet 2nd Max in the lower levels w.h.p. Total no. of comparisons = n + Ө(Y) = n + o(n) 3/20/2013 ICDT 2013

22 Extension to Top-K Corollary If k = o(n),
n + o(n) comparisons suffice to find top-k w.h.p. x1, …, xk do not meet in the lower levels We analyze the tournament tree X must be small Max never loses in lower or upper levels Random permutation is the key: n + g(k) in variable error model 3/20/2013 ICDT 2013

23 Outline of this talk Our Model Max and Top-K Queries
Group-By Queries (Clustering) Simple Clustering (using types only) Clustering with Correlated Values and Types Model -- (error, guarantee in the result) 3/20/2013 ICDT 2013

24 Simple Clustering Group n elements into J clusters
Only type comparisons Constant error model O(nJ) comparisons are sufficient Log factor for comparison error J can be unknown Not surprising We show Ω(nJ) lower bound Even if no comparison error Even for randomized algorithms Even when J is known ~ Values are ignored O(nJ) for no error Upper bound is not surprising Extra logarithmic factor of repetition + majority vote to handle comparison error Randomized algo: yao’s min-max 3/20/2013 ICDT 2013

25 Clustering with Correlated Types and Values
3-br 2-br Apts. In a building Type = #bedrooms Value = rent 1-br Type 1 Type 3 Type 2 High values Low values Full correlation: Elements of same type form contiguous blocks in the sorted order on values For no correlation, we gave a lower bound of Ω(nJ) Rent of different 2 br apartments may differ for different #bathrooms, #floor it is on, view etc. High values (more rent) Assume no error O(n log J) value and type comparisons suffice to find the clusters w.h.p. ~ 3/20/2013 ICDT 2013

26 Algorithm for Full Correlation
Type 1 Type 2 Type 3 J = 3 s = 13 - Find median and partition Repeat Until each block has one type - If ≥ 2 types in a block, Repeat Sketch of the algo Assume no error - Same type in a block, keep only one element Linear cost in total in each inner iteration #Inner iterations = O(log J) Cost: C(n) ≤ C(n/2) + O(n log J) C(n) = O(n log J) - Until #elements ≤ s/2 4J blocks  at most J blocks contain ≥ 2 types,  ≤ s/2 elements 3/20/2013 ICDT 2013

27 Extension to Partial Correlation
≤ α changes in the same type Hotels in a city Type = Star-rating Value = Avg. price Type 1 Type 1 Type 3 Type 2 Partial correlation: Elements of same type form almost contiguous blocks Almost contiguous: price may depend on other factors like location J to log J for small \alpha O(n log (α J) + αJ) value and type comparisons suffice w.h.p. ~ 3/20/2013 ICDT 2013

28 Related Work Max/Top-K:
[Guo et. al. ’12]: Which element has the maximum likelihood of being max, which future queries are most effective [Venetis et. al. ’12]: Efficient heuristics that tune latency, cost, quality to find max [Feige et. al. ’94]: Tight bounds for Max/Top-K/Sorting in constant error model Other error model and objectives exist in theory literature Clustering [Gomes et. al. ’11]: Machine learning approach to define clusters [Parameswaran et. al. ’12, Wang et. al. ’12]: Filtering data, entity resolution Other Work: Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS… Crowd sourcing for data cleansing, data integration, entity resolution, data analytics, … Crowd sourcing is of interest to DB community nowadays Max/top-k in theory community 3/20/2013 ICDT 2013

29 Conclusion We studied Max/Top-k and Group-By queries in
crowd-sourced setting Top-K: Proposed a variable error model, allows fewer comparisons Group by: Obvious simple algorithm is the best possible, correlation reduces #comparisons Future Work: Other objectives: latency, #rounds, or to reduce error when a budget on #comparisons is given Other cost functions: e.g. more #similar queries in a round, less cost We assumed linear cost function 3/20/2013 ICDT 2013

30 Thank You Questions? 3/20/2013 ICDT 2013


Download ppt "Using the Crowd for Top-K or Group-By Queries"

Similar presentations


Ads by Google