Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion.

Similar presentations


Presentation on theme: "The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion."— Presentation transcript:

1 The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion Haifa, Israel

2 Introduction The complexity of leaning is measured mainly along : Information computation. two axis: Information and computation. Information complexity Information complexity is concerned with the generalization performance of learning. Namely, how many training examples are needed? What is the convergence rate of a learner’s estimate to the true population parameters? Computational complexity The Computational complexity of learning is concerned with the computation applied to the data in order to deduce from it the Learner’s hypothesis. It seems that when an algorithm improves with respect to one of these measures it deteriorates with respect to the other.

3 Outline of this Talk 1.Some background. 2. Survey of recent pessimistic computational hardness results. 3. A discussion of three different directions for solutions: a. The Support Vector Machines approach. b. The Boosting approach (an agnostic learning variant). c. Algorithms that are efficient for `well behaved’ inputs.

4 The Label Prediction Problem Given some domain X set X S A sample S of labeled X members of X is generated by some (unknown) distribution x For a next point x, predict its label Data extracted from grant applications. Should current application be funded? Applications in a sample are labeled by success/failure of resulting projects. Formal DefinitionExample

5 Two Basic Competing Models Sample labels are consistent hH with some h in H Learner hypothesis required to meet absolute Upper bound on its error No prior restriction on the sample labels The required upper bound on the hypothesis error is only relative (to the best hypothesis in the class) PAC frameworkAgnostic framework

6 Basic Agnostic Learning Paradigm  H X  Choose a Hypothesis Class H of subsets of X.  Sh HS  For an input sample S, find some h in H that fits S well.  h  For a new point x, predict a label according to its membership in h.

7 The Mathematical Justification Assume both the training sample and the test point X x {0,1} are generated by the same distribution over X x {0,1} then, H If H is not too rich (e.g., has small VC-dimension) hH then, for every h in H, hS the agreement ratio of h on the sample S x is a good estimate of its probability of success on a new x.

8 The Computational Problem  :{0, 1}  Input: A finite set of {0, 1}-labeled S R n points S in R n.  :  Output: Some ‘hypothesis’ function h in H that maximizes the number of correctly classified points of S.

9 Half-spaces of Linear We shall focus on the class NP Hard Find best hyperplane for arbitrary samples S ? Find hyperplane approximating the optimal for arbitrary S Feasible (Perceptron Algorithms) Find best hyperplane for separable S

10 For each of the following classes, approximating the H best agreement rate for h in H (on a given input S sample S ) up to some constant ratio, is NP-hard : MonomialsConstant width Monotone MonomialsHalf-spaces Balls Axis aligned Rectangles Threshold NN’s with constant 1st-layer width BD-Eiron-Long Bartlett- BD Hardness-of-Approximation Results

11 Gaps in Our Knowledge   The additive constants in the hardness- 1%-2%. of-approximation results are 1%-2%. They do not rule out efficient algorithms achieving, say, 90%(optimal success rate).   However, currently, there are no efficient algorithm performing significantly above 50%(optimal success rate).

12 We shall discuss three solution paradigms   Kernel-Based methods (including Support Vectors Machines).   Boosting (adapted to the Agnostic setting).   Data Dependent Success Approximation Algorithms.

13 The Types of Errors to be Considered Output of the the learning Algorithm D Best regressor for D Approximation Error Estimation Error Computational Error The Class H

14 The Boosting Solution Basic Idea “Extend the concept class as much as it can be done without hurting generalizability”

15 The Boosting Idea H S Given a hypothesis class H and a labeled sample S, Rather than searching for a good hypothesis HCo(H) in H search in a larger class Co(H). Important Gains: Co(H) H 1) A fine approximation can be found in Co(H) in time polynomial in the time of finding a coarse approximation in H. HCo(H). 2) The generalization bounds do not deteriorate when moving from H to Co(H).

16 Boosting Solution: Weak Learners An algorithm is a for a class H An algorithm is a  weak learner for a class H if on every H-labeled weighted sample S, it outputs some h in H Er S (h) < ½ - so that Er S (h) < ½ - 

17 Boosting Solution: the Basic Result Theorem [Schapire ’89, Freund ’90] : Theorem [Schapire ’89, Freund ’90] : There is an algorithm that, having access to an efficient  weak learner, PHS, for a P -random H -sample S and parameters , h Co(H) it finds some h in Co(H) Er P (h) < so that Er P (h) < ,  with prob. . |S| In time polynomial in  and  (and |S|).

18 The Boosting Solution in Practice The boosting approach was embraced by The boosting approach was embraced by practitoners of Machine Learning and applied, practitoners of Machine Learning and applied, quite successfully, to a wide variety of real-life quite successfully, to a wide variety of real-life problems. problems.

19 Theoretical Problems with the The Boosting Solution The boosting results assume that the input sample The boosting results assume that the input sample labeling is consistent with some function in H labeling is consistent with some function in H (the PAC framework assumption). (the PAC framework assumption). In practice this is never the case. In practice this is never the case. The boosting algorithm’s success is based on having The boosting algorithm’s success is based on having access to an efficient weak learner – access to an efficient weak learner – no such learner exists. no such learner exists.

20 Boosting Theory Attempt to Recover Can one settle for weaker, realistic, assumptions? Agnostic weak leaners : Agnostic weak leaners : H an algorithm is a  weak agnostic learner for H, ShH if for every labeled sample S it finds h in H s.t. Er S (h) < Er S (Opt(H)) + Er S (h) < Er S (Opt(H)) + 

21 Revised Boosting Solution Theorem [B-D, Long, Mansour] : Theorem [B-D, Long, Mansour] : There is an algorithm that, having access to a  weak agnostic learner, computes an h s.t. Er P (h) < c Er P (Opt(H)) c’ Er P (h) < c Er P (Opt(H)) c’ c c’h Co(H) (Where c and c’ are constants depending on  and h is in Co(H))

22 Problems with the The Boosting Solution Only for a restricted family of classes, are Only for a restricted family of classes, are there known efficient agnostic weak learners. there known efficient agnostic weak learners. The generalization bound we currently have, The generalization bound we currently have, Contains an annoying exponentiation of the optimal Contains an annoying exponentiation of the optimal error. error. Can this be improved? Can this be improved?

23 The SVM Solution “Extend the Hypothesis Class to guarantee computational feasibility”. R n Rather than bothering with non-separable data, make the data separable - by embedding it into some high-dimensional R n

24 The SVM Solution

25 The SVM Paradigm  X  Choose an Embedding of the domain X into some high dimensional Euclidean space, so that the data sample becomes (almost) linearly separable.   Find a large-margin data-separating hyperplane in this image space, and use it for prediction. Important gain: When the data is separable, finding such a hyperplane is computationally feasible.

26 The SVM Solution in Practice The SVM approach is embraced by The SVM approach is embraced by practitoners of Machine Learning and applied, practitoners of Machine Learning and applied, very successfully, to a wide variety of real-life very successfully, to a wide variety of real-life problems. problems.

27 A Potential Problem: Generalization  VC-dimension bounds:  VC-dimension bounds: The VC-dimension of R n n+1 the class of half-spaces in R n is n+1. Can we guarantee low dimension of the embeddings range?  Margin bounds:,  Margin bounds: Regardless of the Euclidean dimension, g generalization can bounded as a function of the margins of the hypothesis hyperplane. Can one guarantee the existence of a large-margin separation?

28 An Inherent Limitation of SVM ‘s   (|X|)  In “most” cases the data cannot be made separable unless the mapping is into dimension  (|X|). This happens even for classes of small VC-dimension.   For “most” classes, no mapping for which concept-classified data becomes separable, has large margins. In both cases generalization bounds are lost! In both cases generalization bounds are lost!

29 A third Proposal for Solution: Data- Dependent Success Approximations   Note that the definition of success for agnostic learning is data-dependent; The success rate of the learner on S is compared to that of the best h in H.   We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

30 Data- Dependent Success Approximations   While Boosting, as well as Kernel-Based methods, extend the class from which the algorithm picks hypothesis, there is a natural alternative for circumventing the hardness-of-approximation results: shrinking the comparison class.   Our DDSA algorithms do it by imposing margins on the comparison class hypotheses.

31 Data Dependent Success Definition for Half-spaces A A learning algorithm A is  margin  successful S  R n  {0,1} if, for every input S  R n  {0,1}, |{(x,y)  S: A (s) (x) = y}| > |{(x,y): h(x)=y and d(h, x) >  h for every half-space h.

32 Some Intuition   If there exist some optimal h which separates with generous margins, then a  margin algorithm must produce an optimal separator. On the other hand,   If every good separator can be degraded by small perturbations, then a  margin algorithm can settle for a hypothesis that is far from optimal.

33  S| n The positive result  For every positive  there is a  - margin algorithm whose running time is polynomial in |S| and n. A Complementing Hardness Result  |S|n  Unless P = NP, no algorithm can do this in time polynomial in  and in |S| and n ).

34 Some Obvious Open Questions   Is there a parameter that can be used to ensure good generalization for Kernel –Based (SVM-like) methods?   Are there efficient Agnostic Weak Learners for potent hypothesis classes?   Is there an inherent trade-off between the generalization ability and the computational complexity of algorithms?

35 THE END

36 “Old” Work   Hardness results: Blum and Rivest showed that it is NP-hard to optimize the weights of a 3-nodes NN. Similar hardness-of-optimization results for other classes followed. But learning can settle for less than optimization.  known  Efficient algorithms: known perceptron algorithms are efficient for linearly separable input data (or the image of such data under ‘tamed’ noise). But natural data sets are usually not separable.

37 A  -margin Perceptron Algorithm   On input S consider all k-size sub-samples.   For each such sub-sample find its largest margin separating hyperplane.   Among all the (~|S| k ) resulting hyperplanes. choose the one with best performance on S. (The choice of k is a function of the desired margin   k ~   

38 Other  margin Algorithms Each of the following algorithms can replace the “find the largest margin separating hyperplane”   The usual “Perceptron Algorithm”.   “Find a point of equal distance from x 1, … x k “.   Phil Long’s ROMMAalgorithm. These are all very fast online algorithms.


Download ppt "The Apparent Tradeoff between Computational Complexity and Generalization of Learning: A Biased Survey of our Current Knowledge Shai Ben-David Technion."

Similar presentations


Ads by Google