Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Overview Introduction Main Result Proof Idea Conclusion
Introduction
10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM The Spam Problem Learning: Use data seen so far to generate rules for future prediction. Motivating Example
The General Learning Framework Unknown probability distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n -> { +,- } After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [ h (x) ≠ f (x)] f h o: no 1: yes o: no 1: yes
What does learnable mean? Performance: The learning algorithm outputs high accuracy hypothesis with high probability. Efficiency: The algorithm has Polynomial running time. This is called the PAC learning Model.
Concept Class If the target function f can be arbitrary, we have no way of learning it without seeing all the examples. We may assume that f is from some simple concept (function) class such as Conjunctions Halfspaces Decision Tree, Decision List, Low Degree Polynomial, Neural Netwrok, etc…
Learning a Concept Class C Unknown distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n {+,-} After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [h(x) ≠ f(x)] f h in concept class C.
Conjunctions (Monomials) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM “10 Million= yes” AND “Lottery=yes” AND “Pharmacy=yes” The Spam Problem
Halfspaces (Linear Threshold Functions) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YES NOT SPAM YESNOYESNOYESNOT SPAM YES NO SPAM sign(“10 Million= YES” + 2 “Lottery=YES”+ “Pharmacy = YES” – 3.5 ) The Spam Problem
Relationship Halfspaces Conjunctions (X 1 and X 2 …and X n ) = sgn(X 1 + X 2 …+ X n –n+0.5)
How to learn the concept class ? Algorithm: 1. Draw some examples. 2. Then we can use linear programming to find halfspace consistent with all examples. Unknown distribution D over {0,1} n, examples from D are labeled by an unknown conjunction f:{0,1} n -> {0,1}. Well-known theory (VC dimension) for any D random sample of O(n/ε) many examples yields 1- ε accurate hypothesis with high probability. Conjunctions
Learning Conjunctions from random examples Real-world data probably doesn’t come with guarantee that examples are labeled perfectly according to a conjunction. Linear Programing is brittle: noisy examples can easily result in no consistent hypothesis. is easy! …but not very realistic… perfectly labeled ^ Motivates study of noisy variants of learning conjunctions.
Learning Conjunctions under noise Unknown distribution D over {0,1} n examples and there is a conjunction with 1- ε accuracy. Goal: To find a hypothesis that has good accuracy (as good as 1- ε? Or just better than 50%?) This is also called “agnostic” noise model.
Another interpretation the noise model Unknown distribution D over {0,1} n, examples from D are perfectly labeled by an unknown conjunction f:{0,1} n {+,-} After receiving examples, ε fraction of the examples is corrupted Only ε fraction of the data is corrupted, can we still find a good hypothesis?
Previous (Positive) No Noise: [Val84, BHW87, Lit88, Riv87]: Conjunction is Learnable Random Noise: [Kea93]: Conjunction is Learnable with random noise
Previous Work(Negative) For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No conjunction is ½ + ε consistent with the data. [Fel06, FGKP09] It is NP-hard to learn a 51%-accurate conjunction even if there exists a conjunction consistent with 99% of the examples.
Weakness of Previous Result We might still be able to learn conjunctions by outputting larger class of functions. E.g. [Lit88] use the Winnow algorithm which output halfspaces function. Linear Programming
Main Result
For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.
Why halfspaces? In practice, halfspaces are at the heart of many learning algorithms: Perceptron Winnow SVM (no kernel) Any Linear Classifier… Learning Theory Computational We can not learn a conjunction under a little bit noise using any of the above mentioned algorithms!
If we are learning algorithm designer To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.
Proof Idea
First Simplification Learning Conjunction = Learning Disjunction Why? Notice that ~ (x 1 AND x 2 …AND x n ) = (~ x 1 OR ~x 2 …OR ~x n ) If we have a good algorithm for learning Disjunction, we can apply this algorithm on example-label pair (~x,~f(x)).
We will prove a simpler theorem. It is NP-hard to tell whether Exists a disjunction consistent with 1- ε fraction of the data, No halfspace is ½+ε consistent with the data. It is NP-hard to learn a 60/88-accurate halfspace with threshold 0 even if there exists a conjunction consistent with 61/88 of the examples. 61/88 60/88 ^ with threshold 0
Halfspace with threshold 0 f(x) = sgn(w 1 x 1 + w 1 x w n x n ) Assuming sgn(0) = “-”, disjunction is sgn(x 1 + x x n )
A: Reduction from a known hard problem. Q: How can we prove a problem is hard?
Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges
Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges Cut = 2
Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges Cut = 3
Starting Point of Reduction Following is a Theorem in [Has01]: Theorem: Given a Graph G (V,E), and Opt(G) = #maximum cut / # edges It is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22.
The reduction (0,1,0,1,1,0) : + (1,1,1,0,1,0) : - (0,1,1,1,1,0) : - (0,1,1,0,1,1) : - (0,1,1,1,1,0) : - Graph G Distribution on examples Finding a good cut Finding a good hypothesis Polynomial time reduction
Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)
With such a reduction We know it is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22. Therefore It is NP-hard to tell whether Exists a disjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data.
The reduction Given a graph G of n vertices, generating points in n dimension. P i : the example with 0 at all position except the ith coordinate to be 1. P ij : the example with 0 at all position except the ith and jth position. For example, when n = 4 P 1 : (1,0,0,0), P 2 : (0,1,0,0), P 3 : (0,0,1,0), P 12 : (1,1,0,0), P 23 : (0,1,1,0)
The reduction For each edge (i, j) in G, generating 4 examples: (P i -) (P j -) (P ij +)
The Reduction For a edge (1,2) add: (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+)
The Reduction (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -
Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)
Proof of Good Cut => Good Hypothesis Opt(G) > 17/22, then there is a disjunction agrees with probability 61/88 fraction of the example. Proof: Opt(G) > 17/22 means that there is a partition of G into (S,S) such that 17/22 fraction of the edge is in the cut. passes with probability 61/88. Why? -
Good Cut => Good Hypothesis Partition {1,3} is a good cut Disjunction x 1 OR x 3 is a good hypothesis (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 1 is in the disjunction, 3 out of 4 is correct.
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 2 is in the disjunction, 3 out of 4 is correct.
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both in the disjunction, 2 out of 4 is correct.
The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both not in the disjunction, 2 out of 4 is correct.
The Large Picture If we choose conjunction x 1 OR x 3 For edges in the cut, 3 out of 4 examples is satisfied. For edges not in the cut, 2 out of 4 of the examples is satisfied (1,0,0,0) - (0,1,0,0) - (1,1,0,0) + (0,1,0,1) - (1,0,1,1) - (0,1,1,0) + (0,0,1,0) - (1,1,0,1) - (0,0,1,1) + (1,0,0,0) - (0,0,1,0) - (1,0,1,0) +
Therefore, we prove: If a partition (S, S) has 17/22 fraction of the edges, then the disjunction is consistent with: (1/2) + (1/4)(17/22) = 61/88 fraction the examples. -
Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. (Good Hypothesis => Good Cut)
Proof of Good Hypothesis =>Good Cut If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. Suppose there is some halfspace sgn(w 1 x 1 + w 1 x w n x n ) has accuracy 60/88. We assign the partition of vertex i according to sgn(w i ) It has cut at least 16/22. Why?
Good Cut => Good Partition For edge (1,2) (1,0,0,0,..0) (-) w 1 ≤ 0 (0,1,0,0,..0) (-) w 2 ≤ 0 (1,1,0,0,…0) (+) w 1 +w 2 > 0 3 out of 4 are satisfied only when 1. w 1 >0,w 2 ≤0 2. w 2 >0,w 1 ≤0 At most 3 out of 4 is satisfied.
To finish the proof: 60/88 = (¼ ) 16/22 + ½, Therefore 16/22 fraction of the edge (i,j) must be that has different sign on w i,w j cut > 16/22.
What we prove in the talk: It is NP-hard to tell whether Exists a conjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data. It is NP-hard to find a 60/88-accurate halfsapce with 0 threshold even if there exists a conjunction consistent with 61/88 of the examples.
Main result in the paper: For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.
To get better hardness Starting from a problem called Label Cover. It is NP-hard to tell i)Opt > 0.99 ii)Opt < 0.01
The sketch of the proof “Smooth Label” Cover Label Cover Gadget: Dictatorship Testing Learning Conjunction Berry Esseen Critical Index
Conclusion Even weak learning of noisy conjunctions by halfspaces is NP-hard. To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.
Future Work Prove: For any ε > 0, given a set of training examples, even there is a conjunction consistent with 1- ε fraction of the data, it is NP-hard to find a degree d polynomial threshold function that is ½ + ε consistent with the data. Why low degree PTF? Corresponding to SVM with Polynomial Kernel Can be used to learn conjunctions/halfspaces under uniform distribution agnostically.