Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.

Similar presentations


Presentation on theme: "The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion."— Presentation transcript:

1

2 The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

3 Introduction The complexity of leaning is measured mainly along Information computation. two axis: Information and computation. Information complexity Information complexity enjoys a rich theory that yields rather crisp sample size and convergence rate guarantees. Computational complexity The focus of this talk is the Computational complexity of learning. While playing a critical role in any application,its theoretical understanding is far less satisfactory.

4 Outline of this Talk 1.Some background. 2. Survey of recent pessimistic hardness results. 3. New efficient learning algorithms for some basic learning architectures.

5 The Label Prediction Problem Given some domain X set X S A sample S of labeled X members of X is generated by some (unknown) distribution x For a next point x, predict its label Data files of drivers Will the current customer file a claim? Drivers in a sample are labeled according to whether they filed an insurance claim Formal DefinitionExample

6 The Agnostic Learning Paradigm  H X  Choose a Hypothesis Class H of subsets of X.  Sh HS  For an input sample S, find some h in H that fits S well.  h  For a new point x, predict a label according to its membership in h.

7 The Mathematical Justification H If H is not too rich (has small VC-dimension) hH then, for every h in H, hS the agreement ratio of h on the sample S x is a good estimate of its probability of success on a new x.

8 The Mathematical Justification - Formally SD If S is sampled i.i.d., by some D over X  {0, 1}> 1-  X  {0, 1} then with probability > 1-  Agreement ratioProbability of success

9 The Model Selection Issue Output of the the learning Algorithm Best regressor for P Approximation Error Estimation Error Computational Error The Class H

10 The Computational Problem  :{0, 1}  Input: A finite set of {0, 1}-labeled S R n points S in R n.  :  Output: Some ‘hypothesis’ h in H that maximizes the number of correctly classified points of S.

11 Half-spaces of Linear We shall focus on the class NP Hard Find best hyperplane for arbitrary samples S ? Find hyperplane approximating the optimal for arbitrary S Feasable (Perceptron Algorithms) Find best hyperplane for separable S

12 For each of the following classes, approximating the H best agreement rate for h in H (on a given input S sample S ) up to some constant ratio, is NP-hard : MonomialsConstant width Monotone MonomialsHalf-spaces Balls Axis aligned Rectangles Threshold NN’s with constant 1st-layer width BD-Eiron-Long Bartlett- BD Hardness-of-Approximation Results

13 The SVM Solution R n Rather than bothering with non-separable data, make the data separable - by embedding it into some high-dimensional R n

14 A Problem with the SVM method   (|X|)  In “most” cases the data cannot be made separable unless the mapping is into dimension  (|X|). This happens even for classes of small VC-dimension.   For “most” classes, no mapping for which concept-classified data becomes separable, has large margins. In all of these cases generalization is lost! In all of these cases generalization is lost!

15 Data-Dependent Success   Note that the definition of success for agnostic learning is data-dependent; The success rate of the learner on S is compared to that of the best h in H.   We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

16 A New Success Criterion A A learning algorithm A is  margin  successful S  R n  {0,1} if, for every input S  R n  {0,1}, |{(x,y)  S: A (s) (x) = y}| > |{(x,y): h(x)=y and d(h, x) >  h for every half-space h.

17 Some Intuition   If there exist some optimal h which separates with generous margins, then a  margin algorithm must produce an optimal separator. On the other hand,   If every good separator can be degraded by small perturbations, then a  margin algorithm can settle for a hypothesis that is far from optimal.

18 A New Positive Result  For every positive , there is an efficient  margin algorithm. That is, the algorithm that classifies correctly as many input points as any half-space can classify correctly with margin 

19  The positive result  For every positive  there is a  - margin algorithm whose running time is polynomial in |S| and n. A Complementing Hardness Result   Unless P = NP, no algorithm can do this in time polynomial in 1/  and in |S| and n ).

20 A  -margin Perceptron Algorithm   On input S consider all k-size sub-samples.   For each such sub-sample find its largest margin separating hyperplane.   Among all the (~|S| k ) resulting hyperplanes. choose the one with best performance on S. (The choice of k is a function of the desired margin   k ~   

21 Other  margin Algorithms Each of the following algorithms can replace the “find the largest margin separating hyperplane”   The usual “Perceptron Algorithm”.   “Find a point of equal distance from x 1, … x k “.   Phil Long’s ROMMAalgorithm. These are all very fast online algorithms.

22 Directions for Further Research   Can similar efficient algorithms be derived for more complex NN architectures?   How well do the new algorithms perform on real data sets?   Can the ‘local approximation’ results be extended to more geometric functions?


Download ppt "The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion."

Similar presentations


Ads by Google