Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret our measurements How do we get the data for our measurements.

Similar presentations


Presentation on theme: "Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret our measurements How do we get the data for our measurements."— Presentation transcript:

1 Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret our measurements How do we get the data for our measurements

2 Classifier Training and Loss-Function Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 2  minimize

3 Linear Discriminant Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 3 Non parametric methods like ‘k-Nearest-Neighbour” suffer from  lack of training data  “curse of dimensionality”  slow response time  evaluate the whole training data for each classification  use of parametric models y(x) to fit to the training data i.e. any linear function of the input variables: giving rise to linear decision boundaries How do we determine the “weights” w that do “best”?? Linear Discriminant :

4 Fisher’s Linear Discriminant 4 determine the “weights” w that do “best” y  Maximise “separation” between the S and B  minimise overlap of the distribution y S and y B  maximise the distance between the two mean values of the classes  minimise the variance within each class ySyS yByB  maximise note: these quantities can be calculated from the training data the Fisher coefficients Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

5 Linear Discriminant and non linear correlations 5 assume the following non-linear correlated data:  the Linear discriminant obviously doesn’t do a very good job here:  Of course, these can easily be de- correlated:  here: linear discriminator works perfectly on de-correlated data Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

6 Linear Discriminant with Quadratic input: 6  A simple to “quadratic” decision boundary:  var0 * var0  var1 * var1  var0 * var1  quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant:  linear decision boundaries in var0,var1 while :  var0  var1 Performance of Fisher Discriminant with quadratic input: Fisher Fisher with decorrelated variables Fisher with quadratic input Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

7 Linear Discriminant with Quadratic input: 7  A simple to “quadratic” decision boundary:  var0 * var0  var1 * var1  var0 * var1  quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant:  linear decision boundaries in var0,var1 while :  var0  var1 Performance of Fisher Discriminant with quadratic input: Fisher Fisher with decorrelated variables Fisher with quadratic input Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011  var0 * var0  var1 * var1  var0 * var1  quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant:  linear decision boundaries in var0,var1 while: Performance of Fisher Discriminant with quadratic input:  Of course, if one “finds out”/”knows” correlations they are best treated explicitly!  explicit decorrelation  or e.g: Function discriminant analysis (FDA) Fit any user-defined function of input variables requiring that signal events return  1 and background  0 Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator

8 Neural Networks 8 naturally, if we want to go to “arbitrary” non-linear decision boundaries, y(x) needs to be constructed in “any” non-linear fashion  Think of h i (x) as a set of “basis” functions  If h(x) is sufficiently general (i.e. non linear), a linear combination of “enough” basis function should allow to describe any possible discriminating function y(x) Imagine you chose do the following: K.Weierstrass Theorem: proves just that previous statement. Ready is the Neural Network Now we “only” need to find the appropriate “weights” w y(x) = a linear combination of non linear function(s) of linear combination(s) of the input data hi(x)hi(x) Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

9 Neural Networks: Multilayer Perceptron MLP 9 But before talking about the weights, let’s try to “interpret” the formula as a Neural Network:  Nodes in hidden layer represent the “activation functions” whose arguments are linear combinations of input variables  non-linear response to the input  The output is a linear combination of the output of the activation functions at the internal nodes  It is straightforward to extend this to “several” input layers  Input to the layers from preceding nodes only  feed forward network (no backward loops) input layer hidden layerouput layer output: D var discriminating input variables as input + 1 offset “Activation” function e.g. sigmoid: or tanh or … 1 i...... D 1 j M1M1.................. k...... D+1 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

10 Neural Networks: Multilayer Perceptron MLP 10 try to “interpret” the formula as a Neural Network: nodes  neurons links(weights)  synapses Neural network: try to simulate reactions of a brain to certain stimulus (input data) input layer hidden layerouput layer output: D var discriminating input variables as input + 1 offset “Activation” function e.g. sigmoid: or tanh or … 1 i...... D 1 j M1M1.................. k...... D+1 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

11 Neural Network Training 11 idea: using the “training events” adjust the weights such, that  y(x)  0 for background events  y(x)  1 for signal events how do we adjust ?  minimize Loss function: where  y(x): very “wiggly” function  many local minima.  one global overall fit not efficient/reliable  back propagation (learn from experience, gradually adjust your resonse)  online learning (learn event by event -- continious, not once in a while only) i.e. use usual “sum of squares” or misclassification error true event type predicted event type Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

12 Neural Network Training back-propagation and online learning Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 12  start with random weights  adjust weights in each step  steepest descend of the “Loss”- function L online learning  the training events must be mixed randomly otherwise first steer in a (wrong) direction  hard to get out again!  for weights connected to output nodes  for weights not connected to output nodes … a bit more complicated formula  note: all these gradients are easily calculated from the training event  training is repeated n-times over the whole training data sample. how often ??

13 Watching at the Training Progress For MLP, plot architecture after each training epoch 13 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

14 Overtraining S B x1x1 x2x2  training: n-times over all training data how often ??  it seems intuitive that this boundary will give better results in another statistically independent data set than that one  e.g. stop training before you learn statistical fluctuations in the data  verify on independent “test” sample training cycles classificaion error training sample test sample  possible overtraining is concern for every “tunable parameter”  of classifiers: Smoothing parameter, n-nodes…  14 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

15 Cross Validation  classifiers have tuning parameters “  ”  choose and control performance  #training cycles, #nodes, #layers, regularisation parameter (neural net)  smoothing parameter h (kernel density estimator)  ….  more flexible (parameters) in classifier  more prone to overtraining  more training data  better training results  division of data set into “training” and “test” and “validation” sample? Train Test Train Cross Validation: divide the data sample into say 5 sub-sets Train Test Train Test Train Test Train Test Train  train 5 classifiers: y i (x,  ) : i=1,..5,  classifier y i (x,  ) is trained without the i-th sub sample  calculate the test error:  choose tuning parameter  for which CV(  ) is minimum and train the final classifier using all data Too bad  it is still NOT implemented in TMVA 15 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

16 What is the best Network Architecture?  Theoretically a single hidden layer is enough for any problem, provided one allows for sufficient number of nodes. (K.Weierstrass theorem)  “Relatively little is known concerning advantages and disadvantages of using a single hidden layer with many nodes over many hidden layers with fewer nodes. The mathematics and approximation theory of the MLP model with more than one hidden layer is not very well understood ……” ….”nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model” (Glen Cowan) A.Pinkus, “Approximation theory of the MLP model with neural networks”, Acta Numerica (1999),pp.143-195  Typically in high-energy physics, non-linearities are reasonably simple,  1 layer with a larger number of nodes probably enough  still worth trying more layers (and less nodes in each layer) 16 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

17 Support Vector Machines Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 17  If Neural Networks are complicated by finding the proper optimum “weights” for best separation power by “wiggly” functional behaviour of the piecewise defined separation hyperplane  If KNN (multidimensional likelihood) suffers disadvantage that calculating the MVA-output of each test event involves evaluation of ALL training events  If Boosted Decision Trees in theory are always weaker than a perfect Neural Network  Try to get the best of all worlds…

18 Support Vector Machine There are methods to create linear decision boundaries using only measures of distances (= inner (scalar) products)  leads to quadratic optimisation problem The decision boundary in the end is defined only by training events that are closest to the boundary We’ve seen that variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) can allow you to separate with linear decision boundaries non linear problems  Support Vector Machine 18 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

19 Support Vector Machines x1x1 x2x2 margin support vectors Separable data  Find hyperplane that best separates signal from background optimal hyperplane  Linear decision boundary  Best separation: maximum distance (margin) between closest events (support) to hyperplane Non-separable data  Solution of largest margin depends only on inner product of support vectors (distances)  quadratic minimisation problem 11 22 44 33  If data non-separable add misclassification cost parameter C·  i  i to minimisation function 19 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

20 Support Vector Machines  Non-linear cases:  Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data  (x 1,x 2 ) Separable data Non-separable data 20 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011  Find hyperplane that best separates signal from background  Linear decision boundary  Best separation: maximum distance (margin) between closest events (support) to hyperplane  largest margin - inner product of support vectors (distances)  quadratic minimisation problem  If data non-separable add misclassification cost parameter C·  i  i to minimisation function

21 Support Vector Machines x1x1 x2x2 x1x1 x3x3 x1x1 x2x2  Non-linear cases: Kernel size paramter typically needs careful tuning! (Overtraining!)  Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data  Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x  Ф(x)·Ф(x).  certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid  Choose Kernel and fit the hyperplane using the linear techniques developed above  (x 1,x 2 ) Separable data Non-separable data 21 Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011  Find hyperplane that best separates signal from background  Linear decision boundary  Best separation: maximum distance (margin) between closest events (support) to hyperplane  largest margin - inner product of support vectors (distances)  quadratic minimisation problem  If data non-separable add misclassification cost parameter C·  i  i to minimisation function

22 Support Vector Machines Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 22

23 Support Vector Machines 23 SVM: the Kernel size parameter: example: Gaussian Kernels  Kernel size (  of the Gaussian) choosen too large:  not enough “flexibility” in the underlying transformation  Kernel size (  of the Gaussian) choosen propperly for the given problem colour code: Red  large signal probability: Helge VossIntroduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011


Download ppt "Introduction to Statistics and Machine Learning 1 How do we: understandunderstand interpretinterpret our measurements How do we get the data for our measurements."

Similar presentations


Ads by Google