Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

Similar presentations


Presentation on theme: "Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)"— Presentation transcript:

1 Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

2 Overview Introduction Main Result Proof Idea Conclusion

3 Introduction

4 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM The Spam Problem Learning: Use data seen so far to generate rules for future prediction. Motivating Example

5 The General Learning Framework Unknown probability distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n -> { +,- }. + - - - - + + + - - - - + + + + - After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [ h (x) ≠ f (x)] f h o: no 1: yes o: no 1: yes

6 What does learnable mean? Performance: The learning algorithm outputs high accuracy hypothesis with high probability. Efficiency: The algorithm has Polynomial running time. This is called the PAC learning Model.

7 Concept Class If the target function f can be arbitrary, we have no way of learning it without seeing all the examples. We may assume that f is from some simple concept (function) class such as Conjunctions Halfspaces Decision Tree, Decision List, Low Degree Polynomial, Neural Netwrok, etc…

8 Learning a Concept Class C Unknown distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n  {+,-} + - - - - + + + - - - - + + + + - After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [h(x) ≠ f(x)] f h in concept class C.

9 Conjunctions (Monomials) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM “10 Million= yes” AND “Lottery=yes” AND “Pharmacy=yes” The Spam Problem

10 Halfspaces (Linear Threshold Functions) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YES NOT SPAM YESNOYESNOYESNOT SPAM YES NO SPAM sign(“10 Million= YES” + 2 “Lottery=YES”+ “Pharmacy = YES” – 3.5 ) The Spam Problem

11 Relationship Halfspaces Conjunctions (X 1 and X 2 …and X n ) = sgn(X 1 + X 2 …+ X n –n+0.5)

12 How to learn the concept class ? + - - - - + + + - - - - + + + + - + Algorithm: 1. Draw some examples. 2. Then we can use linear programming to find halfspace consistent with all examples. Unknown distribution D over {0,1} n, examples from D are labeled by an unknown conjunction f:{0,1} n -> {0,1}. Well-known theory (VC dimension)  for any D random sample of O(n/ε) many examples yields 1- ε accurate hypothesis with high probability. Conjunctions

13 Learning Conjunctions from random examples Real-world data probably doesn’t come with guarantee that examples are labeled perfectly according to a conjunction. Linear Programing is brittle: noisy examples can easily result in no consistent hypothesis. is easy! …but not very realistic… perfectly labeled ^ + - - - - + + + - - - - + + + + - + - + + - Motivates study of noisy variants of learning conjunctions.

14 Learning Conjunctions under noise Unknown distribution D over {0,1} n examples and there is a conjunction with 1- ε accuracy. Goal: To find a hypothesis that has good accuracy (as good as 1- ε? Or just better than 50%?) This is also called “agnostic” noise model.

15 Another interpretation the noise model Unknown distribution D over {0,1} n, examples from D are perfectly labeled by an unknown conjunction f:{0,1} n  {+,-}. + - - - - + + + - - - - + + + + - After receiving examples, ε fraction of the examples is corrupted. + + - - Only ε fraction of the data is corrupted, can we still find a good hypothesis?

16 Previous (Positive) No Noise: [Val84, BHW87, Lit88, Riv87]: Conjunction is Learnable Random Noise: [Kea93]: Conjunction is Learnable with random noise

17 Previous Work(Negative) For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No conjunction is ½ + ε consistent with the data. [Fel06, FGKP09] It is NP-hard to learn a 51%-accurate conjunction even if there exists a conjunction consistent with 99% of the examples.

18 Weakness of Previous Result We might still be able to learn conjunctions by outputting larger class of functions. E.g. [Lit88] use the Winnow algorithm which output halfspaces function. Linear Programming

19 Main Result

20 For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.

21 Why halfspaces? In practice, halfspaces are at the heart of many learning algorithms: Perceptron Winnow SVM (no kernel) Any Linear Classifier… Learning Theory Computational We can not learn a conjunction under a little bit noise using any of the above mentioned algorithms!

22 If we are learning algorithm designer To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.

23 Proof Idea

24 First Simplification Learning Conjunction = Learning Disjunction Why? Notice that ~ (x 1 AND x 2 …AND x n ) = (~ x 1 OR ~x 2 …OR ~x n ) If we have a good algorithm for learning Disjunction, we can apply this algorithm on example-label pair (~x,~f(x)).

25 We will prove a simpler theorem. It is NP-hard to tell whether Exists a disjunction consistent with 1- ε fraction of the data, No halfspace is ½+ε consistent with the data. It is NP-hard to learn a 60/88-accurate halfspace with threshold 0 even if there exists a conjunction consistent with 61/88 of the examples. 61/88 60/88 ^ with threshold 0

26 Halfspace with threshold 0 f(x) = sgn(w 1 x 1 + w 1 x 2 +.. w n x n ) Assuming sgn(0) = “-”, disjunction is sgn(x 1 + x 2 +.. x n )

27 A: Reduction from a known hard problem. Q: How can we prove a problem is hard?

28 Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges. 1 2 3 4

29 Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges. 1 2 3 4 Cut = 2

30 Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges. 1 2 3 4 Cut = 3

31 Starting Point of Reduction Following is a Theorem in [Has01]: Theorem: Given a Graph G (V,E), and Opt(G) = #maximum cut / # edges It is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22.

32 The reduction (0,1,0,1,1,0) : + (1,1,1,0,1,0) : - (0,1,1,1,1,0) : - (0,1,1,0,1,1) : - (0,1,1,1,1,0) : - Graph G Distribution on examples Finding a good cut Finding a good hypothesis Polynomial time reduction

33 Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)

34 With such a reduction We know it is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22. Therefore It is NP-hard to tell whether Exists a disjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data.

35 The reduction Given a graph G of n vertices, generating points in n dimension. P i : the example with 0 at all position except the ith coordinate to be 1. P ij : the example with 0 at all position except the ith and jth position. For example, when n = 4 P 1 : (1,0,0,0), P 2 : (0,1,0,0), P 3 : (0,0,1,0), P 12 : (1,1,0,0), P 23 : (0,1,1,0)

36 The reduction For each edge (i, j) in G, generating 4 examples: (P i -) (P j -) (P ij +)

37 The Reduction 1 2 3 4 For a edge (1,2) add: (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+)

38 The Reduction 1 2 3 4 (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -

39 Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)

40 Proof of Good Cut => Good Hypothesis Opt(G) > 17/22, then there is a disjunction agrees with probability 61/88 fraction of the example. Proof: Opt(G) > 17/22 means that there is a partition of G into (S,S) such that 17/22 fraction of the edge is in the cut. passes with probability 61/88. Why? -

41 Good Cut => Good Hypothesis Partition {1,3} is a good cut Disjunction x 1 OR x 3 is a good hypothesis. 1 2 3 4 (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -

42 The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 1 is in the disjunction, 3 out of 4 is correct.

43 The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 2 is in the disjunction, 3 out of 4 is correct.

44 The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both in the disjunction, 2 out of 4 is correct.

45 The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both not in the disjunction, 2 out of 4 is correct.

46 The Large Picture If we choose conjunction x 1 OR x 3 For edges in the cut, 3 out of 4 examples is satisfied. For edges not in the cut, 2 out of 4 of the examples is satisfied. 1 2 3 4 (1,0,0,0) - (0,1,0,0) - (1,1,0,0) + (0,1,0,1) - (1,0,1,1) - (0,1,1,0) + (0,0,1,0) - (1,1,0,1) - (0,0,1,1) + (1,0,0,0) - (0,0,1,0) - (1,0,1,0) +

47 Therefore, we prove: If a partition (S, S) has 17/22 fraction of the edges, then the disjunction is consistent with: (1/2) + (1/4)(17/22) = 61/88 fraction the examples. -

48 Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. (Good Hypothesis => Good Cut)

49 Proof of Good Hypothesis =>Good Cut If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. Suppose there is some halfspace sgn(w 1 x 1 + w 1 x 2 +.. w n x n ) has accuracy 60/88. We assign the partition of vertex i according to sgn(w i ) It has cut at least 16/22. Why?

50 Good Cut => Good Partition For edge (1,2) (1,0,0,0,..0) (-) w 1 ≤ 0 (0,1,0,0,..0) (-) w 2 ≤ 0 (1,1,0,0,…0) (+) w 1 +w 2 > 0 3 out of 4 are satisfied only when 1. w 1 >0,w 2 ≤0 2. w 2 >0,w 1 ≤0 At most 3 out of 4 is satisfied.

51 To finish the proof: 60/88 = (¼ ) 16/22 + ½, Therefore 16/22 fraction of the edge (i,j) must be that has different sign on w i,w j  cut > 16/22.

52 What we prove in the talk: It is NP-hard to tell whether Exists a conjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data. It is NP-hard to find a 60/88-accurate halfsapce with 0 threshold even if there exists a conjunction consistent with 61/88 of the examples.

53 Main result in the paper: For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.

54 To get better hardness Starting from a problem called Label Cover. It is NP-hard to tell i)Opt > 0.99 ii)Opt < 0.01

55 The sketch of the proof “Smooth Label” Cover Label Cover Gadget: Dictatorship Testing Learning Conjunction Berry Esseen Critical Index

56 Conclusion Even weak learning of noisy conjunctions by halfspaces is NP-hard. To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.

57 Future Work Prove: For any ε > 0, given a set of training examples, even there is a conjunction consistent with 1- ε fraction of the data, it is NP-hard to find a degree d polynomial threshold function that is ½ + ε consistent with the data. Why low degree PTF? Corresponding to SVM with Polynomial Kernel Can be used to learn conjunctions/halfspaces under uniform distribution agnostically.


Download ppt "Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)"

Similar presentations


Ads by Google