Presentation is loading. Please wait.

Presentation is loading. Please wait.

When is A=B? Donald Kossmann Systems Group, ETH Zurich

Similar presentations


Presentation on theme: "When is A=B? Donald Kossmann Systems Group, ETH Zurich"— Presentation transcript:

1 When is A=B? Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch

2 Acknowledgments

3 Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

4 Reality: We all are insane! When do you start believing that your paper is not worth publishing?

5 Speculations on IT Trends Big Data: Automating Experience – Logic -> Statistics – Open World Semantics Hybrid Systems: Get best of humans & machines – to err is human Systems – DNA, Quantum: trade energy for precision – Distributed systems: design for failure – Intel’s SCC: non-cache-coherent processors

6 Speculations on IT Trends Big Data: Automating Experience – Logic -> Statistics – Open World Semantics Hybrid Human & Machine Systems – to err is human Systems – DNA HW: trade energy consumption for precision – Distributed systems: design for failure Computers are becoming insane!

7 Implications We need to model insanity – (too crazy for this talk) – (will use Mechanical Turk to simulate craziness) We need to revisit algos & complexity theory – focus of this talk

8 Traditional Complexity Theory Cost is a function of input Example: sorting in O (N * log N) Algo/Problem cost input

9 “Modern” Complexity Theory Cost is a function of input, quality, error rate Example: sorting is O (???) Algo/Problem cost inputquality error

10 Alternative Complexity Theory Quality is a function of input, budget, error rate Example: sorting is O (???) Algo/Problem quality inputbudget error

11 Agenda Case Study: Entity Resolution, Joins – when is A=B? Case Study: Sorting – when is A { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/9/1385604/slides/slide_11.jpg", "name": "Agenda Case Study: Entity Resolution, Joins – when is A=B Case Study: Sorting – when is A

12 Problem Statement You are the director of the Louvre – you have gazillions of unknown paintings – you have a bunch of students that guess: p(A) = p(B)? You would like to group the paintings by painter – minimize cost (work of students) – minimize errors (#paintings in wrong room) Assumption: There is a ground truth! – (Many problems have no ground truth; e.g., grouping the best paintings.)

13 Naïve Algorithm Step 1: select two random paintings Step 2: ask students to compare them Step 3: goto Step 1 until done How can we do better???

14 Votes Graph A A B B C C D D Is A = B?

15 Votes Graph A A B B C C D D Is A = B? YES!

16 Votes Graph A A B B C C D D

17 A A B B C C D D Is B = C? Is A = D?

18 Votes Graph A A B B C C D D Is B = C? YES! Is A = D? NO!

19 Votes Graph A A B B C C D D Is B = C? ???

20 Votes Graph A A B B C C D D Is B = C? YES! 50 30 -100

21 Decision Functions Input: Votes graph (with weights) two nodes Output: Yes, No, Do-not-know Desired Properties: – Consistency: do not invent anything – Convergence: do not always punt – Reflexivity, Symmetry, Transitivity, Anti-transitivity

22 Min-Max Function Compute pScore, nScore – take all positive, negative paths – score of path: minimum of weights of edges (AND) – pScore = maximum of score of all positive paths (OR) – nScore = maximum of score of all negative paths (OR) Make decision based on quorum (e.g., q=3) – Yes: pScore – nScore > q – No: nScore – pScore > q – Do-not-know:otherwise

23 Min/Max with Conflicts A A B B C C D D Is B = C? YES pScore = 30 nScore = 1 Is A = D? NO pScore = 0 nScore = 30 50 30 -100

24 Naïve Algorithm V2.0 Step 1: select two random paintings, p 1, p 2 Step 2: if (MinMax(p 1,p 2 ) == Do-not-know) ask students to compare them else return MinMax(p 1, p 2 ) Step 3: goto Step 1 until done

25 Min/Max and Transitivity? B B C C A A D D 5 5 -2 E E 5 3 A = D? YES pScore = 5 nScore = 2 D = E? YES pScore = 3 nScore = 0 A = E? Do-not-know pScore = 3 nScore = 2

26 When is A=E? B B C C A A D D 5 5 -2 E E 5 3 Compute “A=E”: Need at least 5 votes for success. Compute “D=E”: In best case, only 2 more votes needed.

27 When is A=E? B B C C A A D D 5 5 -2 E E 5 3 Crowdsource A=E: Need at least 5 votes for success. Crowdsource D=E: In best case, only 2 votes needed. Many more surprises like that!!!

28

29

30

31

32 Related Work & Alternatives R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000. M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. Huge body of work on ER in DB, II communities. Other decision function: MinCuts!

33 Summary Getting A=B right more important than algorithm – Naïve algo with Min/Max >> Correlation Clustering Result of A=B depends on C, D, … – sounds trivial, but has nasty implications – need a decision function: new cost/precision tradeoffs – Some trad. algos (e.g., CC) do not work Complexity: Still unknown! – interesting future work

34 Agenda Case Study: Entity Resolution, Joins – when is A=B? Case Study: Sorting – when is A { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/9/1385604/slides/slide_34.jpg", "name": "Agenda Case Study: Entity Resolution, Joins – when is A=B Case Study: Sorting – when is A

35 Revisit Sorting Algos How do traditional sorting algorithms behave – Quicksort – Bubblesort Look at new sorting algorithms based on graph – PageRank – Min/Max – Schulze method Focus on Quicksort vs. Bubblesort here – Just give a glimpse of what can happen

36 Quicksort: Effect of built-in transitivity Sort the following sequence Neutral, Painful, Good, Excellent, Bad Use “Good” as pivot element for partitioning Fumble “Painful < Good” comparison Excellent, Painful, Good, Neutral, Bad One bad comparison propagates to three misclassifications – quality of result can become arbitrarily bad – difficult to extend QSort algo with safety net.

37 Results (20% error, uniform) Cost (number of iterations of algorithm) Quality (%)

38 Summary Some algos implicitly exploit transitivity – difficult to control cost/quality tradeoff – might result in a poor result for specific application QuickSort >> Bubblesort no longer true – depends on error and quality expectation – there are better and worse ways to exploit transitivity depending on budget and error behavior – confirms observations of “A=B” study

39 Related Work on Sorting Ludwig Busse et al.: The information content in sorting algorithms. 2012. M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. Qurk (MIT) & Deco (Stanford) projects. 2011-2013. …

40 Conclusion & Future Work Computers are becoming insane – because they automate more of the insane world – because we are hitting the limits of trad. computing – consequence: quality becomes a major metric Adding “quality” has dramatic implications – need to revisit algorithms to become fault-tolerant – need to revisit complexity: totally open – need to revisit debugging and testing: totally open


Download ppt "When is A=B? Donald Kossmann Systems Group, ETH Zurich"

Similar presentations


Ads by Google