Download presentation
Presentation is loading. Please wait.
1
Deco + Crowdsourcing Summary
2
Comparison of Systems
3
Comparison of Systems CrowdDB Deco Qurk
4
Algorithms: Lots more to do!
Task specific factors Difficulty and other Interdependencies, e.g. batching Interface design and testing Training ML factors Prior information from ML algorithms Integration with AL techniques Human factors Fatigue or experience Biases Incentives Marketplace factors State of the marketplace Type of marketplace
5
Context: Other Work Incentivizing truthfulness, honesty
EM-based schemes; application specific Planning, brainstorming, editing, assisting Spamming
6
Survey of Industry Users
From: Crowdsourced Data Management: Industry and Academic Perspectives
7
Insights from Survey of Industry
Crowdsourcing is common Crowdsourcing is large-scale 100s of employees 100s of 1000s of tasks / week Millions of dollars per year Most companies host their own platforms Why? Many diverse uses of crowds E.g., monitoring news, in-house crowds
8
Insights from Survey of Industry
Most common applications: guess?
9
Insights from Survey of Industry
Most common applications: classification and entity resolution Top-3 benefits of crowds: flexible scaling, low cost, enabled previously difficult tasks. One participant: easier to justify money for crowds than another employee Quality management is primitive: mainly majority vote; more than 25% use some form of EM; little cost minimization
10
Insights from Survey of Industry
Incentivization schemes are primitive: typical per-task or hourly payment Industry users rarely use toolkits/workflows from academia Rarely do workflows have more than one crowd step
11
Use Cases
12
Benefits
13
Quality Assurance
14
Integration with Workflows
15
Survey of Marketplaces
16
Complexity of Tasks
17
Workflow Management
18
Redundancy
19
Issues with Traditional Plans
AtLeast [8] SELECT n, l, c FROM country WHERE l = ‘Spanish’ ATLEAST 8 Filter [l=‘Spanish’] Latency Multiple parents Changes in output Join (Peru, Quechua) Join Peru First, I will show you how traditional qp is inadequate for deco. Consider same query, to keep things simple, m3 (or more) for l, c and dupelim for n. Notice I’ve abbreviated … Consider basic query plan with no crowdsourcing. Since we need to stop after 8 tuples, we have an at least op. Since we may need to fetch, scan operator now needs to change First scans A then issues fetches. does a join by binding the left argument and probing right Scans right, tries to resolve and if required issues more fetches Scan returns peru, join operator now probes right argument. Lets say no peru tuples in D1. execution is stopped while waiting for crowd. May take too long, want to issue fetches in parallel Notice that fetch operator feeds two tables. Mult parents not allowed in trad plans As a side effect, output may change based on new tuples. Quechua has been passed up the plan, and two additional Spanish tuples arrived from fetch operator, the majority may change. Cannot update values that are returned up the plan. Resolve[m3] Resolve[m3] Scan D2(n, c) Resolve[d.e] Scan A(n) Scan D1(n, l) Perul Scan A(n) Fetch [n] Scan D1(n,l) Fetch [nl] Scan D2(n,c) Fetch [nl,c] (Peru, Spanish) (Peru, Spanish)
20
A New Query Processing Architecture
A hybrid push-pull model In the past: Pull: traditional query processing Push: stream systems, view maintenance We need both! Pull: (top) operators request new tuples Push: (bottom) operators push changes Two phases First: Materialize current result Second: Fetch and update Operators Asynchronous, pass messages So we need a new … Ops on top request more tuples by using pull messages, ops at bottom push changes up (using ideas similar to view maintenance) Materialization, to materialize the current result, to initiate new fetches. Also, we want operators to function asych: passing messages to each other. so that latency is taken care of Wont get into details here..
21
Alternate Plans: Filter Locations
AtLeast [8] AtLeast [8] SELECT n,l,c FROM country WHERE l = ‘Spanish’ ATLEAST 8 Filter [l=‘Spanish’] Join Join Filter [l=‘Spanish’] Resolve[m3] Join Join Now, we can consider alternate plans. There are 2 optimization challenges not found in traditional qp. In traditional databases, almost always better to push filters down to reduce intermediate result, but here: Filter up can cost more but have better paralellism, since we can get l & c together Filter down will cost less since we only get c if l is confirmed to be spanish. Here: predicate can be after joins or between joins. Unlike in trad databases where it is ALMOST always beneficial to push predicates down, here on pushing predicates down we may have potentially lower cost but higher latency (since fetches don’t happen in ||al), so the choice is not so simple! Resolve[m3] Resolve[d.e] Resolve[m3] Scan D2(n,c) Fetch [nl,c] Resolve[d.e] Resolve[m3] Scan A(n) Fetch [n] Scan D1(n,l) Fetch [nl] Scan D2(n,c) Fetch [nl,c] Scan A(n) Fetch [n] Scan D1(n,l) Fetch [nl]
22
Alternate Plans: Fetch Rules
AtLeast [8] SELECT n,l,c FROM country WHERE l = ‘Spanish’ ATLEAST 8 Join Filter [l=‘Spanish’] Resolve[m3] Join Second challenge, due to many possible fetch rules. keep the plan same and change f.r. for a query that asks for spanish countries, either start with random countries and check if they are spanish, or with spanish countries and verify that they actually speak spanish. No fetch rules in traditional query processing – represents a twist in declarative crowdsourcing. As we will find later, FR affects QP significantly, so designer should provide us with lots of f.r.s and the system should be able to choose between them. Scan D2(n,c) Fetch [nl,c] Resolve[d.e] Resolve[m3] Scan A(n) Fetch [ln] Fetch [n] Fetch [ln,c] Scan D1(n,l) Fetch [nl] Fetch [nl,c]
23
Query Optimization Questions
When to resolve? Which attributes should I fetch? Which fetch rule should I use? How much to fetch? In what order? How do I cost a plan? What statistics do I need? How do I change plans on-the-fly? In this talk, ive barely scratched the surface of alternatives w.r.t query processing. In addition to what I described so far, how much and in what order we’re fetching, raises very interesting issues. Tons more to do! We’re only now starting to think about how to select a plan.. What stats to maintain etc.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.