# When is A=B? Donald Kossmann Systems Group, ETH Zurich

## Presentation on theme: "When is A=B? Donald Kossmann Systems Group, ETH Zurich"— Presentation transcript:

When is A=B? Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch

Acknowledgments

Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

Reality: We all are insane! When do you start believing that your paper is not worth publishing?

Speculations on IT Trends Big Data: Automating Experience – Logic -> Statistics – Open World Semantics Hybrid Systems: Get best of humans & machines – to err is human Systems – DNA, Quantum: trade energy for precision – Distributed systems: design for failure – Intel’s SCC: non-cache-coherent processors

Speculations on IT Trends Big Data: Automating Experience – Logic -> Statistics – Open World Semantics Hybrid Human & Machine Systems – to err is human Systems – DNA HW: trade energy consumption for precision – Distributed systems: design for failure Computers are becoming insane!

Implications We need to model insanity – (too crazy for this talk) – (will use Mechanical Turk to simulate craziness) We need to revisit algos & complexity theory – focus of this talk

Traditional Complexity Theory Cost is a function of input Example: sorting in O (N * log N) Algo/Problem cost input

“Modern” Complexity Theory Cost is a function of input, quality, error rate Example: sorting is O (???) Algo/Problem cost inputquality error

Alternative Complexity Theory Quality is a function of input, budget, error rate Example: sorting is O (???) Algo/Problem quality inputbudget error

Agenda Case Study: Entity Resolution, Joins – when is A=B? Case Study: Sorting – when is A<B?

Problem Statement You are the director of the Louvre – you have gazillions of unknown paintings – you have a bunch of students that guess: p(A) = p(B)? You would like to group the paintings by painter – minimize cost (work of students) – minimize errors (#paintings in wrong room) Assumption: There is a ground truth! – (Many problems have no ground truth; e.g., grouping the best paintings.)

Naïve Algorithm Step 1: select two random paintings Step 2: ask students to compare them Step 3: goto Step 1 until done How can we do better???

Votes Graph A A B B C C D D Is A = B?

Votes Graph A A B B C C D D Is A = B? YES!

Votes Graph A A B B C C D D

A A B B C C D D Is B = C? Is A = D?

Votes Graph A A B B C C D D Is B = C? YES! Is A = D? NO!

Votes Graph A A B B C C D D Is B = C? ???

Votes Graph A A B B C C D D Is B = C? YES! 50 30 -100

Decision Functions Input: Votes graph (with weights) two nodes Output: Yes, No, Do-not-know Desired Properties: – Consistency: do not invent anything – Convergence: do not always punt – Reflexivity, Symmetry, Transitivity, Anti-transitivity

Min-Max Function Compute pScore, nScore – take all positive, negative paths – score of path: minimum of weights of edges (AND) – pScore = maximum of score of all positive paths (OR) – nScore = maximum of score of all negative paths (OR) Make decision based on quorum (e.g., q=3) – Yes: pScore – nScore > q – No: nScore – pScore > q – Do-not-know:otherwise

Min/Max with Conflicts A A B B C C D D Is B = C? YES pScore = 30 nScore = 1 Is A = D? NO pScore = 0 nScore = 30 50 30 -100

Naïve Algorithm V2.0 Step 1: select two random paintings, p 1, p 2 Step 2: if (MinMax(p 1,p 2 ) == Do-not-know) ask students to compare them else return MinMax(p 1, p 2 ) Step 3: goto Step 1 until done

Min/Max and Transitivity? B B C C A A D D 5 5 -2 E E 5 3 A = D? YES pScore = 5 nScore = 2 D = E? YES pScore = 3 nScore = 0 A = E? Do-not-know pScore = 3 nScore = 2

When is A=E? B B C C A A D D 5 5 -2 E E 5 3 Compute “A=E”: Need at least 5 votes for success. Compute “D=E”: In best case, only 2 more votes needed.

When is A=E? B B C C A A D D 5 5 -2 E E 5 3 Crowdsource A=E: Need at least 5 votes for success. Crowdsource D=E: In best case, only 2 votes needed. Many more surprises like that!!!

Related Work & Alternatives R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000. M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. Huge body of work on ER in DB, II communities. Other decision function: MinCuts!

Summary Getting A=B right more important than algorithm – Naïve algo with Min/Max >> Correlation Clustering Result of A=B depends on C, D, … – sounds trivial, but has nasty implications – need a decision function: new cost/precision tradeoffs – Some trad. algos (e.g., CC) do not work Complexity: Still unknown! – interesting future work

Agenda Case Study: Entity Resolution, Joins – when is A=B? Case Study: Sorting – when is A<B?

Revisit Sorting Algos How do traditional sorting algorithms behave – Quicksort – Bubblesort Look at new sorting algorithms based on graph – PageRank – Min/Max – Schulze method Focus on Quicksort vs. Bubblesort here – Just give a glimpse of what can happen

Quicksort: Effect of built-in transitivity Sort the following sequence Neutral, Painful, Good, Excellent, Bad Use “Good” as pivot element for partitioning Fumble “Painful < Good” comparison Excellent, Painful, Good, Neutral, Bad One bad comparison propagates to three misclassifications – quality of result can become arbitrarily bad – difficult to extend QSort algo with safety net.

Results (20% error, uniform) Cost (number of iterations of algorithm) Quality (%)

Summary Some algos implicitly exploit transitivity – difficult to control cost/quality tradeoff – might result in a poor result for specific application QuickSort >> Bubblesort no longer true – depends on error and quality expectation – there are better and worse ways to exploit transitivity depending on budget and error behavior – confirms observations of “A=B” study

Related Work on Sorting Ludwig Busse et al.: The information content in sorting algorithms. 2012. M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. Qurk (MIT) & Deco (Stanford) projects. 2011-2013. …

Conclusion & Future Work Computers are becoming insane – because they automate more of the insane world – because we are hitting the limits of trad. computing – consequence: quality becomes a major metric Adding “quality” has dramatic implications – need to revisit algorithms to become fault-tolerant – need to revisit complexity: totally open – need to revisit debugging and testing: totally open