Using the Crowd for Top-K or Group-By Queries

Slides:

Advertisements

Similar presentations

한양대학교 정보보호 및 알고리즘 연구실 이재준 담당교수님 : 박희진 교수님

Advertisements

Finding The Unknown Number In A Number Sentence! NCSCOS 3 rd grade 5.04 By: Stephanie Irizarry Click arrow to go to next question.

Chapter 1 The Study of Body Function Image PowerPoint

Milan Vojnović Microsoft Research Cambridge Collaborators: E. Perron and D. Vasudevan 1 Consensus – with Limited Processing and Signalling.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

Analysis of Algorithms

Fundamental Relationship between Node Density and Delay in Wireless Ad Hoc Networks with Unreliable Links Shizhen Zhao, Luoyi Fu, Xinbing Wang Department.

October 17, 2005 Copyright© Erik D. Demaine and Charles E. Leiserson L2.1 Introduction to Algorithms 6.046J/18.401J LECTURE9 Randomly built binary.

Introduction to Algorithms 6.046J/18.401J

Introduction to Algorithms 6.046J/18.401J

© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 17 L9.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 9 Prof. Charles E. Leiserson.

©2001 by Charles E. Leiserson Introduction to AlgorithmsDay 9 L6.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 6 Prof. Erik Demaine.

Thursday, March 7 Duality 2 – The dual problem, in general – illustrating duality with 2-person 0-sum game theory Handouts: Lecture Notes.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Year 6 mental test 5 second questions

Reductions Complexity ©D.Moshkovitz.

Vote Elicitation with Probabilistic Preference Models: Empirical Estimation and Cost Tradeoffs Tyler Lu and Craig Boutilier University of Toronto.

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.

1) Read 2) Plan 3) Solve 4) Check Camels live in many places around the world. There are two kinds of camels. The Bactrian camel has two humps, while.

Sep 16, 2013 Lirong Xia Computational social choice The easy-to-compute axiom.

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Ack: Several slides from Prof. Jim Anderson’s COMP 202 notes.

Recurrences : 1 Chapter 3. Growth of function Chapter 4. Recurrences.

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.

ABC Technology Project

DIVISIBILITY, FACTORS & MULTIPLES

Shadow Prices vs. Vickrey Prices in Multipath Routing Parthasarathy Ramanujam, Zongpeng Li and Lisa Higham University of Calgary Presented by Ajay Gopinathan.

1 Hash Tables Saurav Karmakar. 2 Motivation What are the dictionary operations? What are the dictionary operations? (1) Insert (1) Insert (2) Delete (2)

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

Online Algorithm Huaping Wang Apr.21

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"

1 Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization Joint work with Andreas Krause 1 Daniel Golovin.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases Paul Beame Jerry Li Sudeepa Roy Dan Suciu University of Washington.

Constant, Linear and Non-Linear Constant, Linear and Non-Linear

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN

Lecture plan Outline of DB design process Entity-relationship model

Artificial Intelligence

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

25 seconds left…...

Sep 15, 2014 Lirong Xia Computational social choice The easy-to-compute axiom.

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Chapter_12.indd 13/17/11 5:16 PM. Chapter_12.indd 23/17/11 5:16 PM.

February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.

Chapter_15.indd 13/17/11 5:25 PM. Chapter_15.indd 23/17/11 5:25 PM.

A SMALL TRUTH TO MAKE LIFE 100%

PSSA Preparation.

Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Nov 7, 2013 Lirong Xia Hypothesis testing and statistical decision theory.

Introduction Distance-based Adaptable Similarity Search

1 Complexity ©D.Moshkovitz Cryptography Where Complexity Finally Comes In Handy…

The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.

Amit Goyal Laks V. S. Lakshmanan RecMax: Exploiting Recommender Systems for Fun and Profit University of British Columbia

The Pumping Lemma for CFL’s

Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Some Favorite Problems Dan Kleitman, M.I.T.. The Hirsch Conjecture 1. How large can the diameter of a bounded polytope defined by n linear constraints.

Presentation transcript:

Using the Crowd for Top-K or Group-By Queries Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington 3/20/2013 ICDT 2013

Example: Using Wisdom of Crowd in DB Queries Database of football players pictures Multiple photos at different ages of each player Q. Group the photos of individual players Q. Find their most recent photos Some tagged, some untagged Millions of soccer fans in the world 3/20/2013 ICDT 2013

How a DBMS Thinks Max/Top-k Queries! What if name/date is missing Q. Group the photos of individual players Group-By Queries! Use “Name” attribute Q. Find their most recent photos Max/Top-k Queries! Use “Date” attribute Wait.. Do we have all information for this queries? Values of name/date attributes Some tagged, some untagged Some date missing from the photos What if name/date is missing Image processing? Photo forensics? 3/20/2013 ICDT 2013

Ask the “Crowd”! Millions of soccer fans in the world Possibly the more recent photo of more famous players are easily identifiable Fans of these players who collect even their childhood photos 3/20/2013 ICDT 2013

How the Crowd Thinks - 1 Q. Group the photos of individual players Millions of soccer fans in the world Possibly the more recent photo of more famous players are easily identifiable Fans of these players who collect even their childhood photos 3/20/2013 ICDT 2013

How the Crowd Thinks - 2 Q. Find their most recent photos Millions of soccer fans in the world Possibly the more recent photo of more famous players are easily identifiable Fans of these players who collect even their childhood photos 3/20/2013 ICDT 2013

Crowd Sourcing Using human intelligence to do tasks which are harder to automate A recent topic of interest in database and other research communities Many crowdsourcing platforms This talk: use wisdom of crowd for Database queruies Different fixed but unknown attributes in some table, but the group by and top-k functions are easier to compute by the crowd 3/20/2013 ICDT 2013

This talk: Using Wisdom of Crowd in Top-K and Group-By Queries Top-k/max A possible query SELECT TOP 1 R.picture FROM SoccerPlayerPhotoTable AS R GROUP BY R.player ORDER BY R.date DESC Group By / Clustering This talk: use wisdom of crowd for Database queruies Different fixed but unknown attributes in some table, but the group by and top-k functions are easier to compute by the crowd Fixed but unknown attributes: R.player, R.date 3/20/2013 ICDT 2013

Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) Combination is easy.. Model -- (error, guarantee in the result) Brief outline of technical results We also talk about combining top-k with clustering 3/20/2013 ICDT 2013

Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) Model -- (error, guarantee in the result) Three problems in the paper 3/20/2013 ICDT 2013

Elements: Type and Value #Elements = n (n = 16) Two attributes: Type and Value Type e.g. Name = “Maradona” used in Group-By #Types = J (J clusters, J = 4) Value e.g. Date when photo was taken used in Top-K Unknown, but “ground-truth’’ exists for Types and Values CHECK IF ALL VALUES ARE NOT EQUAL Each element has a Type and a Value: These are the only relevant attributes “Types” partitions elements into J clusters Number of types = J (J = 4) Partitions elements into clusters type is implicit 3/20/2013 ICDT 2013

Type and Value Comparisons Is the first photo older? Same Person? DB Queries vs. Comparisons (questions to the crowd) Type Comparisons: Type(x) = Type(y)? Value Comparisons: Value(x) > Value(y)? Assumes same type It may not be possible for the user to know the name of a player or the exact date when the photo was taken, still they can decide whether two photos have the same player or which photo looks younger We assume values are distinct Answer is Boolean We cannot ask “What is Type(x)/Value(x)?” But, answers are not always correct Crowd makes mistakes Next, error model 3/20/2013 ICDT 2013

Is the first photo older? Constant Error Model Type comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)? Constant error model: Standard model Probability of wrong answer ≤ ½ - ,  >0 Is the first photo older? Wrong: w.p. ≤ ½ -  Correct: w.p. ≥ ½ +  Crowd makes mistake due to various reasons : ignorance, insincerity, lack of time, ½ + : so that we always answer little better than random [Values of  can be estimated by sampling, that will decide how many repeated comparisons are needed] Crowd gives erroneous answers to type or value comparisons with probability ≤ ½ -  (for some constant ) Standard error model Same person? Wrong: w.p. ≤ ½ -  Correct: w.p. ≥ ½ +  3/20/2013 ICDT 2013

Variable Error Model (this paper) Is the first photo older? - Harder Is the first photo older? – Easier For value comparisons only : Value(x) > Value(y)? Error probability ≤ 1/ f() -   = Distance of x, y in sorted order f is strictly monotone x1 > x2  f(x1) > f(x2) f(x) ≥ 2 e.g. f() = e f() =  + 1 f() = log  + 2 ≤ ½ -   = 2 MENTION that it can also be applied to value-dependent definition if needed? Variable error model: more realistic Assume xi-s also denote their values These functions are chosen such that f(x) >= 2 Can be estimated by e.g. by sampling Consecutive elements also have < ½ probability of error Error probability depends on the distance between values Higher distance  smaller error Simplified expression for error probability F is error function, can be estimated by sampling Variable error saves cost Next we define what we mean by cost and what we mean by a good solution x1 x2 x3 x4 e.g. Error probability when f() = e ≤ 1/e ≤ 1/e2 3/20/2013 ICDT 2013

Framework at a Glance Internal Computation Crowd Crowd-sourced DBMS Value(a) > Value(b) Internal Computation (Top-K/ Group-By) Yes/No Value(b) > Value(c) Yes/No Crowd Crowd-sourced DBMS Crowd-sourced DBMS, will delegate some of the operations to the crowd Points to be noted Crowd won’t work for free, in most cases Limitations in the #no. questions we ask Each type/value comparison costs money There is always a chance of error Comparison Error Crowd’s answer may be wrong with some prob. Cost Model Asking questions costs money Additive Count #comparisons We still want the correct answer w.h.p. 3/20/2013 ICDT 2013

Our Goal Minimize the total #comparisons while outputting the correct answer (exact top-k or clusters) w.p.  1 – δ (given constant δ > 0) Each comparison costs a few cents Desired solution: Output exact top-k or exact clustering Can combine both as well \delta given (type/value) comparisons in total That’s all about our model 3/20/2013 ICDT 2013

Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) Model -- (error, guarantee in the result) 3/20/2013 ICDT 2013

Problem Statement: Max/Top-k Input: n elements, k Same type Only value comparisons are used Output: All top-k elements (w.p.  1 – δ) (k = 2) We focus on Max: Algorithm for top-k builds on the algorithm for max Types are ignored We will mainly talk about Max 3/20/2013 ICDT 2013

Max: Our Results n + o(n) Further, f() = exp.  n + O(1) Required confidence: 1 – δ (δ = constant),  = distance in sorted order Exact Comparisons Constant Error [Feige et. al. ‘94] Error prob < ½ Variable Error [This paper] Error prob < 1/f() Upper Bound n - 1 O(n) n + o(n) for any strictly monotone function f Lower Bound Ω(n) ≥ (1+c) n, c > 0 when High success prob. required with high error Asymptotically smaller than all c.n, c>0 o(n) strictly smaller than n Hiding details of , δ in the expressions assuming them constants Complement Feige’s results Known f => better analysis Further, f() = exp.  n + O(1) f() = linear  n + O(log log n) f() = log.  n + O(n1/1+ δ) 3/20/2013 ICDT 2013

Review: Tournament Tree for Max 10 #comparisons = c . 5 #comparisons = c . 3 Comparison at internal nodes 10 #comparisons = c . 1 9 5 9 10 2 n elements as leaves Exact comparisons: Binary tree structure is not necessary Noisy comparisons: Repeat comparison + majority vote Constant error model: θ(n) algorithm (Feige et. al. ’94) Our goal: Total no. of comparisons = n + o(n) cannot repeat even twice in most of the internal nodes Upper level: higher chance of losing: repeat more 3/20/2013 ICDT 2013

Main Steps for Max Binary tournament tree Goal: n + o(n) Upper levels Use Feige’s algorithm Y nodes  Ө(Y) cost Y = o(n) for any f Y = O(1) for f = e Upper Levels Y nodes Lower levels Just 1 comparison at each internal node No majority vote Binary tournament tree Lower Levels n nodes We analyze the tournament tree X must be small Max never loses in lower or upper levels Random permutation is the key Key idea: Start with a random permutation of elements at the leaves Max does not lose in the lower levels w.h.p. Intuition: Max does not meet 2nd Max in the lower levels w.h.p. Total no. of comparisons = n + Ө(Y) = n + o(n) 3/20/2013 ICDT 2013

Extension to Top-K Corollary If k = o(n), n + o(n) comparisons suffice to find top-k w.h.p. x1, …, xk do not meet in the lower levels We analyze the tournament tree X must be small Max never loses in lower or upper levels Random permutation is the key: n + g(k) in variable error model 3/20/2013 ICDT 2013

Outline of this talk Our Model Max and Top-K Queries Group-By Queries (Clustering) Simple Clustering (using types only) Clustering with Correlated Values and Types Model -- (error, guarantee in the result) 3/20/2013 ICDT 2013

Simple Clustering Group n elements into J clusters Only type comparisons Constant error model O(nJ) comparisons are sufficient Log factor for comparison error J can be unknown Not surprising We show Ω(nJ) lower bound Even if no comparison error Even for randomized algorithms Even when J is known ~ Values are ignored O(nJ) for no error Upper bound is not surprising Extra logarithmic factor of repetition + majority vote to handle comparison error Randomized algo: yao’s min-max 3/20/2013 ICDT 2013

Clustering with Correlated Types and Values 3-br 2-br Apts. In a building Type = #bedrooms Value = rent 1-br Type 1 Type 3 Type 2 High values Low values Full correlation: Elements of same type form contiguous blocks in the sorted order on values For no correlation, we gave a lower bound of Ω(nJ) Rent of different 2 br apartments may differ for different #bathrooms, #floor it is on, view etc. High values (more rent) Assume no error O(n log J) value and type comparisons suffice to find the clusters w.h.p. ~ 3/20/2013 ICDT 2013

Algorithm for Full Correlation Type 1 Type 2 Type 3 J = 3 s = 13 - Find median and partition Repeat Until each block has one type - If ≥ 2 types in a block, Repeat Sketch of the algo Assume no error - Same type in a block, keep only one element Linear cost in total in each inner iteration #Inner iterations = O(log J) Cost: C(n) ≤ C(n/2) + O(n log J) C(n) = O(n log J) - Until #elements ≤ s/2 4J blocks  at most J blocks contain ≥ 2 types,  ≤ s/2 elements 3/20/2013 ICDT 2013

Extension to Partial Correlation ≤ α changes in the same type Hotels in a city Type = Star-rating Value = Avg. price Type 1 Type 1 Type 3 Type 2 Partial correlation: Elements of same type form almost contiguous blocks Almost contiguous: price may depend on other factors like location J to log J for small \alpha O(n log (α J) + αJ) value and type comparisons suffice w.h.p. ~ 3/20/2013 ICDT 2013

Related Work Max/Top-K: [Guo et. al. ’12]: Which element has the maximum likelihood of being max, which future queries are most effective [Venetis et. al. ’12]: Efficient heuristics that tune latency, cost, quality to find max [Feige et. al. ’94]: Tight bounds for Max/Top-K/Sorting in constant error model Other error model and objectives exist in theory literature Clustering [Gomes et. al. ’11]: Machine learning approach to define clusters [Parameswaran et. al. ’12, Wang et. al. ’12]: Filtering data, entity resolution Other Work: Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS… Crowd sourcing for data cleansing, data integration, entity resolution, data analytics, … Crowd sourcing is of interest to DB community nowadays Max/top-k in theory community 3/20/2013 ICDT 2013

Conclusion We studied Max/Top-k and Group-By queries in crowd-sourced setting Top-K: Proposed a variable error model, allows fewer comparisons Group by: Obvious simple algorithm is the best possible, correlation reduces #comparisons Future Work: Other objectives: latency, #rounds, or to reduce error when a budget on #comparisons is given Other cost functions: e.g. more #similar queries in a round, less cost We assumed linear cost function 3/20/2013 ICDT 2013

Thank You Questions? 3/20/2013 ICDT 2013