Download presentation

Presentation is loading. Please wait.

Published byMaliyah Offield Modified over 2 years ago

1
When is A=B? Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch

2
Acknowledgments

3
Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

4
Reality: We all are insane! When do you start believing that your paper is not worth publishing?

5
Speculations on IT Trends Big Data: Automating Experience – Logic -> Statistics – Open World Semantics Hybrid Systems: Get best of humans & machines – to err is human Systems – DNA, Quantum: trade energy for precision – Distributed systems: design for failure – Intel’s SCC: non-cache-coherent processors

6
Speculations on IT Trends Big Data: Automating Experience – Logic -> Statistics – Open World Semantics Hybrid Human & Machine Systems – to err is human Systems – DNA HW: trade energy consumption for precision – Distributed systems: design for failure Computers are becoming insane!

7
Implications We need to model insanity – (too crazy for this talk) – (will use Mechanical Turk to simulate craziness) We need to revisit algos & complexity theory – focus of this talk

8
Traditional Complexity Theory Cost is a function of input Example: sorting in O (N * log N) Algo/Problem cost input

9
“Modern” Complexity Theory Cost is a function of input, quality, error rate Example: sorting is O (???) Algo/Problem cost inputquality error

10
Alternative Complexity Theory Quality is a function of input, budget, error rate Example: sorting is O (???) Algo/Problem quality inputbudget error

11
Agenda Case Study: Entity Resolution, Joins – when is A=B? Case Study: Sorting – when is A**
{
"@context": "http://schema.org",
"@type": "ImageObject",
"contentUrl": "http://images.slideplayer.com/9/1385604/slides/slide_11.jpg",
"name": "Agenda Case Study: Entity Resolution, Joins – when is A=B Case Study: Sorting – when is A
**

12
Problem Statement You are the director of the Louvre – you have gazillions of unknown paintings – you have a bunch of students that guess: p(A) = p(B)? You would like to group the paintings by painter – minimize cost (work of students) – minimize errors (#paintings in wrong room) Assumption: There is a ground truth! – (Many problems have no ground truth; e.g., grouping the best paintings.)

13
Naïve Algorithm Step 1: select two random paintings Step 2: ask students to compare them Step 3: goto Step 1 until done How can we do better???

14
Votes Graph A A B B C C D D Is A = B?

15
Votes Graph A A B B C C D D Is A = B? YES!

16
Votes Graph A A B B C C D D

17
A A B B C C D D Is B = C? Is A = D?

18
Votes Graph A A B B C C D D Is B = C? YES! Is A = D? NO!

19
Votes Graph A A B B C C D D Is B = C? ???

20
Votes Graph A A B B C C D D Is B = C? YES! 50 30 -100

21
Decision Functions Input: Votes graph (with weights) two nodes Output: Yes, No, Do-not-know Desired Properties: – Consistency: do not invent anything – Convergence: do not always punt – Reflexivity, Symmetry, Transitivity, Anti-transitivity

22
Min-Max Function Compute pScore, nScore – take all positive, negative paths – score of path: minimum of weights of edges (AND) – pScore = maximum of score of all positive paths (OR) – nScore = maximum of score of all negative paths (OR) Make decision based on quorum (e.g., q=3) – Yes: pScore – nScore > q – No: nScore – pScore > q – Do-not-know:otherwise

23
Min/Max with Conflicts A A B B C C D D Is B = C? YES pScore = 30 nScore = 1 Is A = D? NO pScore = 0 nScore = 30 50 30 -100

24
Naïve Algorithm V2.0 Step 1: select two random paintings, p 1, p 2 Step 2: if (MinMax(p 1,p 2 ) == Do-not-know) ask students to compare them else return MinMax(p 1, p 2 ) Step 3: goto Step 1 until done

25
Min/Max and Transitivity? B B C C A A D D 5 5 -2 E E 5 3 A = D? YES pScore = 5 nScore = 2 D = E? YES pScore = 3 nScore = 0 A = E? Do-not-know pScore = 3 nScore = 2

26
When is A=E? B B C C A A D D 5 5 -2 E E 5 3 Compute “A=E”: Need at least 5 votes for success. Compute “D=E”: In best case, only 2 more votes needed.

27
When is A=E? B B C C A A D D 5 5 -2 E E 5 3 Crowdsource A=E: Need at least 5 votes for success. Crowdsource D=E: In best case, only 2 votes needed. Many more surprises like that!!!

32
Related Work & Alternatives R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000. M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. Huge body of work on ER in DB, II communities. Other decision function: MinCuts!

33
Summary Getting A=B right more important than algorithm – Naïve algo with Min/Max >> Correlation Clustering Result of A=B depends on C, D, … – sounds trivial, but has nasty implications – need a decision function: new cost/precision tradeoffs – Some trad. algos (e.g., CC) do not work Complexity: Still unknown! – interesting future work

34
Agenda Case Study: Entity Resolution, Joins – when is A=B? Case Study: Sorting – when is A**
{
"@context": "http://schema.org",
"@type": "ImageObject",
"contentUrl": "http://images.slideplayer.com/9/1385604/slides/slide_34.jpg",
"name": "Agenda Case Study: Entity Resolution, Joins – when is A=B Case Study: Sorting – when is A
**

35
Revisit Sorting Algos How do traditional sorting algorithms behave – Quicksort – Bubblesort Look at new sorting algorithms based on graph – PageRank – Min/Max – Schulze method Focus on Quicksort vs. Bubblesort here – Just give a glimpse of what can happen

36
Quicksort: Effect of built-in transitivity Sort the following sequence Neutral, Painful, Good, Excellent, Bad Use “Good” as pivot element for partitioning Fumble “Painful < Good” comparison Excellent, Painful, Good, Neutral, Bad One bad comparison propagates to three misclassifications – quality of result can become arbitrarily bad – difficult to extend QSort algo with safety net.

37
Results (20% error, uniform) Cost (number of iterations of algorithm) Quality (%)

38
Summary Some algos implicitly exploit transitivity – difficult to control cost/quality tradeoff – might result in a poor result for specific application QuickSort >> Bubblesort no longer true – depends on error and quality expectation – there are better and worse ways to exploit transitivity depending on budget and error behavior – confirms observations of “A=B” study

39
Related Work on Sorting Ludwig Busse et al.: The information content in sorting algorithms. 2012. M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. Qurk (MIT) & Deco (Stanford) projects. 2011-2013. …

40
Conclusion & Future Work Computers are becoming insane – because they automate more of the insane world – because we are hitting the limits of trad. computing – consequence: quality becomes a major metric Adding “quality” has dramatic implications – need to revisit algorithms to become fault-tolerant – need to revisit complexity: totally open – need to revisit debugging and testing: totally open

Similar presentations

OK

DB Zero & DB Everything Donald Kossmann 28msec, Inc. & ETH Zurich.

DB Zero & DB Everything Donald Kossmann 28msec, Inc. & ETH Zurich.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on land pollution Ppt on tunnel diode construction Ppt on carburetors Ppt on porter's five forces template Ppt on human nutrition and digestion of cnidarians Download ppt on natural resources for class 10 Ppt on social media on business Ppt on aircraft emergencies videos Free ppt on festivals of india Ppt on game theory in economics