Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

Slides:

Advertisements

Similar presentations

Hardness of Reconstructing Multivariate Polynomials. Parikshit Gopalan U. Washington Parikshit Gopalan U. Washington Subhash Khot NYU/Gatech Rishi Saket.

Advertisements

VC Dimension – definition and impossibility result

Problems and Their Classes

Inapproximability of MAX-CUT Khot,Kindler,Mossel and O ’ Donnell Moshe Ben Nehemia June 05.

Co Training Presented by: Shankar B S DMML Lab

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

The Probabilistic Method Shmuel Wimer Engineering Faculty Bar-Ilan University.

Longin Jan Latecki Temple University

Introduction to PCP and Hardness of Approximation Dana Moshkovitz Princeton University and The Institute for Advanced Study 1.

1/17 Optimal Long Test with One Free Bit Nikhil Bansal (IBM) Subhash Khot (NYU)

The number of edge-disjoint transitive triples in a tournament.

Introduction to Approximation Algorithms Lecture 12: Mar 1.

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR) TexPoint fonts used in EMF. Read the TexPoint manual before.

Computational problems, algorithms, runtime, hardness

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.

Probably Approximately Correct Model (PAC)

Evaluating Hypotheses

Exact Learning of Boolean Functions with Queries Lisa Hellerstein Polytechnic University Brooklyn, NY AMS Short Course on Statistical Learning Theory,

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Analysis of Algorithms CS 477/677

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

1 Slides by Asaf Shapira & Michael Lewin & Boaz Klartag & Oded Schwartz. Adapted from things beyond us.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Hardness Results for Problems

1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.

The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.

The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Potential-Based Agnostic Boosting Varun Kanade Harvard University (joint work with Adam Tauman Kalai (Microsoft NE))

Incidentor coloring: methods and results A.V. Pyatkin "Graph Theory and Interactions" Durham, 2013.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.

1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)

The Dominating Set and its Parametric Dual  the Dominated Set  Lan Lin prepared for theory group meeting on June 11, 2003.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.

CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

NP-completeness NP-complete problems. Homework Vertex Cover Instance. A graph G and an integer k. Question. Is there a vertex cover of cardinality k?

CSC 413/513: Intro to Algorithms

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.

On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.

Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.

CS-424 Gregory Dudek Lecture 14 Learning –Inductive inference –Probably approximately correct learning.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

More NP-complete problems

Computational Learning Theory

Vapnik–Chervonenkis Dimension

NP-Completeness Yin Tat Lee

Computational Learning Theory

Computational Learning Theory

NP-Completeness Yin Tat Lee

CSE 589 Applied Algorithms Spring 1999

Lecture 14 Learning Inductive inference

Instructor: Aaron Roth

Presentation transcript:

Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)

Overview Introduction Main Result Proof Idea Conclusion

Introduction

10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM The Spam Problem Learning: Use data seen so far to generate rules for future prediction. Motivating Example

The General Learning Framework Unknown probability distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n -> { +,- } After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [ h (x) ≠ f (x)] f h o: no 1: yes o: no 1: yes

What does learnable mean? Performance: The learning algorithm outputs high accuracy hypothesis with high probability. Efficiency: The algorithm has Polynomial running time. This is called the PAC learning Model.

Concept Class If the target function f can be arbitrary, we have no way of learning it without seeing all the examples. We may assume that f is from some simple concept (function) class such as Conjunctions Halfspaces Decision Tree, Decision List, Low Degree Polynomial, Neural Netwrok, etc…

Learning a Concept Class C Unknown distribution D over {0,1} n, examples from D are labeled by an unknown function f:{0,1} n  {+,-} After receiving examples, algorithm does its computation and outputs hypothesis h. + Error of hypothesis is Pr x~D [h(x) ≠ f(x)] f h in concept class C.

Conjunctions (Monomials) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YESNOT SPAM YESNOYESNOYESNOT SPAM YES NOYESNOSPAM “10 Million= yes” AND “Lottery=yes” AND “Pharmacy=yes” The Spam Problem

Halfspaces (Linear Threshold Functions) 10 MillionLotteryCheapPharmacyJunkIs Spam YES NOYESNOSPAM NOYES NOYESNOT SPAM YES SPAM NO YES NOT SPAM YESNOYESNOYESNOT SPAM YES NO SPAM sign(“10 Million= YES” + 2 “Lottery=YES”+ “Pharmacy = YES” – 3.5 ) The Spam Problem

Relationship Halfspaces Conjunctions (X 1 and X 2 …and X n ) = sgn(X 1 + X 2 …+ X n –n+0.5)

How to learn the concept class ? Algorithm: 1. Draw some examples. 2. Then we can use linear programming to find halfspace consistent with all examples. Unknown distribution D over {0,1} n, examples from D are labeled by an unknown conjunction f:{0,1} n -> {0,1}. Well-known theory (VC dimension)  for any D random sample of O(n/ε) many examples yields 1- ε accurate hypothesis with high probability. Conjunctions

Learning Conjunctions from random examples Real-world data probably doesn’t come with guarantee that examples are labeled perfectly according to a conjunction. Linear Programing is brittle: noisy examples can easily result in no consistent hypothesis. is easy! …but not very realistic… perfectly labeled ^ Motivates study of noisy variants of learning conjunctions.

Learning Conjunctions under noise Unknown distribution D over {0,1} n examples and there is a conjunction with 1- ε accuracy. Goal: To find a hypothesis that has good accuracy (as good as 1- ε? Or just better than 50%?) This is also called “agnostic” noise model.

Another interpretation the noise model Unknown distribution D over {0,1} n, examples from D are perfectly labeled by an unknown conjunction f:{0,1} n  {+,-} After receiving examples, ε fraction of the examples is corrupted Only ε fraction of the data is corrupted, can we still find a good hypothesis?

Previous (Positive) No Noise: [Val84, BHW87, Lit88, Riv87]: Conjunction is Learnable Random Noise: [Kea93]: Conjunction is Learnable with random noise

Previous Work(Negative) For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No conjunction is ½ + ε consistent with the data. [Fel06, FGKP09] It is NP-hard to learn a 51%-accurate conjunction even if there exists a conjunction consistent with 99% of the examples.

Weakness of Previous Result We might still be able to learn conjunctions by outputting larger class of functions. E.g. [Lit88] use the Winnow algorithm which output halfspaces function. Linear Programming

Main Result

For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.

Why halfspaces? In practice, halfspaces are at the heart of many learning algorithms: Perceptron Winnow SVM (no kernel) Any Linear Classifier… Learning Theory Computational We can not learn a conjunction under a little bit noise using any of the above mentioned algorithms!

If we are learning algorithm designer To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.

Proof Idea

First Simplification Learning Conjunction = Learning Disjunction Why? Notice that ~ (x 1 AND x 2 …AND x n ) = (~ x 1 OR ~x 2 …OR ~x n ) If we have a good algorithm for learning Disjunction, we can apply this algorithm on example-label pair (~x,~f(x)).

We will prove a simpler theorem. It is NP-hard to tell whether Exists a disjunction consistent with 1- ε fraction of the data, No halfspace is ½+ε consistent with the data. It is NP-hard to learn a 60/88-accurate halfspace with threshold 0 even if there exists a conjunction consistent with 61/88 of the examples. 61/88 60/88 ^ with threshold 0

Halfspace with threshold 0 f(x) = sgn(w 1 x 1 + w 1 x w n x n ) Assuming sgn(0) = “-”, disjunction is sgn(x 1 + x x n )

A: Reduction from a known hard problem. Q: How can we prove a problem is hard?

Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges

Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges Cut = 2

Reduction from the Max Cut problem Max Cut: Given a graph G and finding a partition that maximizing the crossing edges Cut = 3

Starting Point of Reduction Following is a Theorem in [Has01]: Theorem: Given a Graph G (V,E), and Opt(G) = #maximum cut / # edges It is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22.

The reduction (0,1,0,1,1,0) : + (1,1,1,0,1,0) : - (0,1,1,1,1,0) : - (0,1,1,0,1,1) : - (0,1,1,1,1,0) : - Graph G Distribution on examples Finding a good cut Finding a good hypothesis Polynomial time reduction

Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)

With such a reduction We know it is NP-hard to tell apart the following two cases: 1) Opt(G) > 17/22. 2) Opt(G) < 16/22. Therefore It is NP-hard to tell whether Exists a disjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data.

The reduction Given a graph G of n vertices, generating points in n dimension. P i : the example with 0 at all position except the ith coordinate to be 1. P ij : the example with 0 at all position except the ith and jth position. For example, when n = 4 P 1 : (1,0,0,0), P 2 : (0,1,0,0), P 3 : (0,0,1,0), P 12 : (1,1,0,0), P 23 : (0,1,1,0)

The reduction For each edge (i, j) in G, generating 4 examples: (P i -) (P j -) (P ij +)

The Reduction For a edge (1,2) add: (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+)

The Reduction (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -

Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) Opt(G) < 16/22, then no halfspaces with 0 threshold is consistent with 60/88 fraction of the examples. (Good Hypothesis => Good Cut)

Proof of Good Cut => Good Hypothesis Opt(G) > 17/22, then there is a disjunction agrees with probability 61/88 fraction of the example. Proof: Opt(G) > 17/22 means that there is a partition of G into (S,S) such that 17/22 fraction of the edge is in the cut. passes with probability 61/88. Why? -

Good Cut => Good Hypothesis Partition {1,3} is a good cut Disjunction x 1 OR x 3 is a good hypothesis (1,0,0,0) + (0,1,0,0) + (1,1,0,0) - (0,1,0,1) + (1,0,1,1) + (0,1,1,0) - (0,0,1,0) + (1,1,0,1) + (0,0,1,1) - (1,0,0,0) + (0,0,1,0) + (1,0,1,0) -

The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 1 is in the disjunction, 3 out of 4 is correct.

The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) Only x 2 is in the disjunction, 3 out of 4 is correct.

The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both in the disjunction, 2 out of 4 is correct.

The Reduction For edge (1,2) (1,0,0,0) (-) (0,1,0,0) (-) (1,1,0,0) (+) If x1,x2 are both not in the disjunction, 2 out of 4 is correct.

The Large Picture If we choose conjunction x 1 OR x 3 For edges in the cut, 3 out of 4 examples is satisfied. For edges not in the cut, 2 out of 4 of the examples is satisfied (1,0,0,0) - (0,1,0,0) - (1,1,0,0) + (0,1,0,1) - (1,0,1,1) - (0,1,1,0) + (0,0,1,0) - (1,1,0,1) - (0,0,1,1) + (1,0,0,0) - (0,0,1,0) - (1,0,1,0) +

Therefore, we prove: If a partition (S, S) has 17/22 fraction of the edges, then the disjunction is consistent with: (1/2) + (1/4)(17/22) = 61/88 fraction the examples. -

Desired Property of the Reduction Opt(G) > 17/22, then there is a disjunction agrees with 61/88 fraction of the examples. (Good Cut => Good Hypothesis) If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. (Good Hypothesis => Good Cut)

Proof of Good Hypothesis =>Good Cut If there is a halfspaces with 0 threshold has accuracy 60/88, the there is a cut of size 16/22. Suppose there is some halfspace sgn(w 1 x 1 + w 1 x w n x n ) has accuracy 60/88. We assign the partition of vertex i according to sgn(w i ) It has cut at least 16/22. Why?

Good Cut => Good Partition For edge (1,2) (1,0,0,0,..0) (-) w 1 ≤ 0 (0,1,0,0,..0) (-) w 2 ≤ 0 (1,1,0,0,…0) (+) w 1 +w 2 > 0 3 out of 4 are satisfied only when 1. w 1 >0,w 2 ≤0 2. w 2 >0,w 1 ≤0 At most 3 out of 4 is satisfied.

To finish the proof: 60/88 = (¼ ) 16/22 + ½, Therefore 16/22 fraction of the edge (i,j) must be that has different sign on w i,w j  cut > 16/22.

What we prove in the talk: It is NP-hard to tell whether Exists a conjunction consistent with 61/88 fraction of the data, No halfspace with 0 threshold is 60/88 consistent with the data. It is NP-hard to find a 60/88-accurate halfsapce with 0 threshold even if there exists a conjunction consistent with 61/88 of the examples.

Main result in the paper: For any ε > 0, NP-hard to tell whether Exists a conjunction consistent with 1- ε fraction of the data, No halfspace is ½ + ε consistent with the data. It is NP-hard to learn a 51%-accurate halfspace even if there exists a conjunction consistent with 99% of the examples.

To get better hardness Starting from a problem called Label Cover. It is NP-hard to tell i)Opt > 0.99 ii)Opt < 0.01

The sketch of the proof “Smooth Label” Cover Label Cover Gadget: Dictatorship Testing Learning Conjunction Berry Esseen Critical Index

Conclusion Even weak learning of noisy conjunctions by halfspaces is NP-hard. To obtain an efficient halfspace-based learning algorithm for conjunctions, we need either to restrict the distribution of the examples or limit the noise.

Future Work Prove: For any ε > 0, given a set of training examples, even there is a conjunction consistent with 1- ε fraction of the data, it is NP-hard to find a degree d polynomial threshold function that is ½ + ε consistent with the data. Why low degree PTF? Corresponding to SVM with Polynomial Kernel Can be used to learn conjunctions/halfspaces under uniform distribution agnostically.