Presentation is loading. Please wait.

Presentation is loading. Please wait.

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer 25.11.2004.

Similar presentations


Presentation on theme: "T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer 25.11.2004."— Presentation transcript:

1 T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

2 Motivation Supervised Learning learn functional relationships from a finite set of labelled training examples Generalization How well does the learned function perform on unseen test examples? Central question in supervised learning

3 What you will hear New Idea: Stability implies predictivity learning algorithm is stable if small pertubations of training set do not change hypothesis much Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis

4 Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

5 Some Definitions 1/2 Training Data: S = {z 1 =(x 1,y 1 ),..., z n =(x n, y n )} Z = X  Y Unknown Distribution  (x, y) Hypothesis Space: H Hypothesis f S  H: X  Y Learning Algorithm: Regression: f S is real-valued / Classification: f S is binary symmetric learning algorithm (ordering irrelevant)

6 Some Definitions 2/2 Loss Function: V(f, z) e.g. V(f, z) = (f(x) – y) 2 Assume that V is bounded Empirical Error (Training Error) Expected Error (True Error)

7 Generalization and Consistency Convergence in Probability Generalization Performance on training examples must be a good indicator of performance on future examples Consistency Expected error converges to most accurate one in H

8 Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

9 Empirical Risk Minimization (ERM) Focus of classical learning theory research exact and almost ERM Minimize training error over H: take best hypothesis on training data For ERM: Generalization  Consistency

10 What algorithms are ERM? All these belong to class of ERM algorithms Least Squares Regression Decision Trees ANN Backpropagation (?)... Are all learning algorithms ERM? NO! Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization...

11 Vapnik asked What property must the hypothesis space H have to ensure good generalization of ERM?

12 Classical Results for ERM 1 Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class: convergence of empirical mean to true expected value uniform convergence in probability of loss functions induced by H and V 1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

13 VC-Dimension Binary functions f: X  {0, 1} VC-dim(H) = size of largest finite set in X that can be shattered by H e.g. linear separation in 2D yields VC-dim = 3 Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite 1. 1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

14 Achievements of Classical Learning Theory Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the hypothesis space?

15 Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

16 Poggio et.al. asked What property must the learning map L have for good generalization of general algorithms? Can a new theory subsume the classical results for ERM?

17 Stability Small pertubations of the training set should not change the hypothesis much especially deleting one training example S i = S \ {z i } How can this be mathematically defined? Original Training Set S Perturbed Training Set S i Hypothesis Space Learning Map

18 Uniform Stability 1 A learning algorithm L is uniformly stable if After deleting one training sample the change must be small at all points z  Z Uniform stability implies generalization Requirement is too strong Most algorithms (e.g. ERM) are not uniformly stable 1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001

19 CV loo stability 1 Cross-Validation leave- one-out stability considers only errors at removed training points strictly weaker than uniform stability remove z i error at x i 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

20 Equivalence for ERM 1 Theorem: For good loss functions the following statements are equivalent for ERM: L is distribution-independent CV loo stable ERM generalizes and is universally consistent H is a uGC class Question: Does CV loo stability ensure generalization for all learning algorithms? 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

21 CV loo Counterexample 1 X be uniform on [0, 1] Y  {-1, +1} Target f * (x) = 1 Learning algorithm L: No change at removed training point  CV loo stable Algorithm does not generalize at all! 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

22 Additional Stability Criteria Error (E loo ) stability Empirical Error (EE loo ) stability Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM) Not sufficient for generalization

23 CVEEE loo Stability Learning Map L is CVEEE loo stable if it is CV loo stable and E loo stable and EE loo stable Question: Does this imply generalization for all L?

24 CVEEE loo implies Generalization 1 Theorem: If L is CVEEE loo stable and the loss function is bounded, then f S generalizes Remarks: Neither condition (CV, E, EE) itself is sufficient E loo and EE loo stability are not sufficient For ERM CV loo stability alone is necessary and sufficient for generalization and consistency 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

25 Consistency CVEEE loo stability in general does NOT guarantee consistency Good generalization does NOT necessarily mean good prediction but poor expected performance is indicated by poor training performance

26 CVEEE loo stable algorithms Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost) For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN) For all these algorithms generalization is guaranteed by the shown theorems!

27 Agenda Introduction Problem Definition Classical Results Stability Criteria Conclusion

28 Implications Classical „VC-style“ conditions  Occams Razor: prefer simple hypotheses CV loo stability  Incremental Change online-algorithms Inverse Problems: stability  well-posedness condition numbers characterize stability Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery

29 Language Learning Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more insight into real language learning? Language learning algorithm or Class of all learnable grammars? Focus on algorithms shift focus to stability

30 Conclusion Stability implies generalization intuitive (CV loo ) and technical (E loo, EE loo ) criteria Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms Restrictions on learning map rather than hypothesis space New approach for designing learning algorithms

31 Open Questions Easier / other necessary and sufficient conditions for generalization Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms

32 Thank you!

33 Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003 T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S. 378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale- sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997


Download ppt "T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer 25.11.2004."

Similar presentations


Ads by Google