Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC.

Similar presentations


Presentation on theme: "Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC."— Presentation transcript:

1

2 Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC

3 Scientific methods Induction F.Bacon Machine Models Data Deduction R.Descartes Math. modeling

4 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

5 Outline  Learning as ill-posed problem  General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

6 Problem statement Learning is inverse, ill-posed problem Model  Data Learning paradoxes Infinite predictions  Finite data? How to optimize future predictions? How to select regular from casual in data? Regularization of learning Optimal model complexity

7 Well-posed problem Solution is unique Solution is stable Hadamard (1900-s) Tikhonoff (1960-s)

8 Learning from examples Problem: Find hypothesis h, generating observed data D in model H Well defined if not sensitive to: noise in data (Hadamard) learning procedure (Tikhonoff)

9 Learning is ill-posed problem Example: Function approximation Sensitive to noise in data Sensitive to learning procedure

10 Learning is ill-posed problem Solution is non-unique

11 Outline  Learning as ill-posed problem General problem: data generalization  General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

12 Problem regularization Main idea: restrict solutions – sacrifice precision to stability How to choose?

13 Statistical Learning practice Data  Learning set + Validation set Cross-validation: Systematic approach to ensembles  Bayes + + … +

14 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization  Bayesian regularization. Theory  Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

15 Statistical Learning theory Learning as inverse Probability Probability theory. H: h  D Learning theory. H: h  D H Bernoulli (1713) Bayes (~ 1750)

16 Bayesian learning Evidence Prior Posterior

17 Coin tossing game H

18 Monte Carlo simulations

19 Bayesian regularization Most Probable hypothesis  Learning error Example: Function approximation Regularization

20 Minimal Description Length Most Probable hypothesis Code length for:Data hypothesis Example: Optimal prefix code 01 11 10 110 111 Rissanen (1978)

21 Data Complexity Complexity K ( D |H ) = min L ( h, D|H ) Code length L(h,D) = coded data L(D|h) + decoding program L(h) Data D Decoding Kolmogoroff (1965)

22 Complex = Unpredictable Prediction error ~ L ( h,D ) / L ( D ) Random data is uncompressible Compression = predictability Program h: length L(h,D) Data D Decoding Example: block coding Solomonoff (1978)

23 Universal Prior All 2 L programs with length L are equiprobable Data complexity Solomonoff (1960) Bayes (~1750) L(h,D) D H

24 Statistical ensemble Shorter description length Proof: Corollary: Ensemble predictions are superior to most probable prediction

25 Ensemble prediction

26 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization  Bayesian regularization. Theory Hypothesis comparison  Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

27 Model comparison Evidence Posterior

28 Statistics: Bayes vs. Fisher Fisher: max Likelihood Bayes: max Evidence

29 Historical outlook 20 – 60s of ХХ century Parametric statistics Asymptotic N   60 - 80s of ХХ century Non-Parametric statistics Regularization of ill-posed problems Non-asymptotic learning Algorithmic complexity Statistical physics of disordered systems Fisher (1912) Chentsoff (1962) Tikhonoff (1963) Vapnik (1968) Kolmogoroff (1965) Gardner (1988)

30 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization  Bayesian regularization. Theory Hypothesis comparison Model comparison  Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

31 Statistical physics Probability of hypothesis - microstate Optimal model - macrostate

32 Free energy F = - log Z: Log of Sum  F = E – TS: Sum of logs P = P{L} 

33 EM algorithm. Main idea Introduce independent P: Iterations E-step: М-step:

34 EM algorithm Е-step Estimate Posterior for given Model М-step Update Model for given Posterior

35 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

36 Bayesian regularization: Examples Hypothesis testing Function approximation Data clustering y h y h(x)h(x) x P(x|H)P(x|H) x

37 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm  Bayesian regularization. Practice  Hypothesis testing Function approximation Data clustering

38 Hypothesis testing Problem Noisy observations: y  Is theoretical value h 0 true? Model H: Gaussian noise Gaussian prior y h0h0

39 Optimal model: Phase transition  Confidence  finite  infinite

40 Threshold effect Student coefficient Hypothesis h 0 is true Corrections to h 0 y P(h)P(h) y h P(h)P(h)

41 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm  Bayesian regularization. Practice Hypothesis testing  Function approximation Data clustering

42 Function approximation Problem Noisy data: y  (x  ) Find approximation h(x) Model: Noise Prior y h(x)h(x) x

43 Optimal model Free energy minimization

44 Saddle point approximation Function of best hypothesis

45 ЕМ learning Е-step. Optimal hypothesis М-step. Optimal regularization

46 Laplace Prior Pruned weights Equisensitive weights

47 Laplace regularization Е-step. Weights estimation М-step. Regularization

48 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm  Bayesian regularization. Practice Hypothesis testing Function approximation  Data clustering

49 Clustering Problem Noisy data: x  Find prototypes (mixture density approximation) How many clusters? Модель: Noise P(x|H)P(x|H) x

50 Optimal model Free energy minimization Iterations E-step: М-step:

51 ЕМ algorithm Е-step: М-step:

52 How many clusters? Number of clusters M(  ) Optimal number of clusters h(m)h(m) 1/1/

53 Simulations: Uniform data Optimal model M

54 Simulations: Gaussian data Optimal model M 0 1020304050 -12.5 -12 -11.5 -11 -10.5 -10 -9.5

55 Simulations: Gaussian mixture Optimal model M

56 Outline Learning as ill-posed problem General problem: data generalization General remedy: model regularization Bayesian regularization. Theory Hypothesis comparison Model comparison Free Energy & EM algorithm Bayesian regularization. Practice Hypothesis testing Function approximation Data clustering

57 Summary Learning Ill-posed problem Remedy: regularization Bayesian learning Built-in regularization (model assumptions) Optimal model = minimal Description Length = minimal Free Energy Practical issues Learning algorithms with built-in optimal regularization - from first principles (opposite to cross validation)


Download ppt "Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC."

Similar presentations


Ads by Google