Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christoph Rosemann and Helge Voss DESY FLC

Similar presentations


Presentation on theme: "Christoph Rosemann and Helge Voss DESY FLC"— Presentation transcript:

1 Christoph Rosemann and Helge Voss DESY FLC
Naïve Bayesian Classifier Data Preprocessing Linear Discriminator Artificial Neural Networks Christoph Rosemann and Helge Voss DESY FLC Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

2 Naïve Bayesian Classifier aka “Projective Likelihood”
Kernel methods and Nearest Neighbour estimate the joint pdfs in full D dimensional space If correlations are weak/non-existent the problem can be factorized: This introduces another problem: how to deal with the pdfs? Histogramming (event counting) + Automatic Less than optimal Parametric Fitting Difficult to automate Non-Parametric Fitting (e.g. splines) + Automatable - Possible artefacts, information loss Generated from Gauss Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

3 Likelihood functions and ratios
The individual p(x) is usually called the likelihood function It is class dependent: Signal or Background(s) For each variable and class a pdf description is needed The classifier function is the Likelihood ratio (per class) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

4 Likelihood Example Example: Electron identification with two classes (electron, pion) and two variables (track momentum to calo energy, cluster shape) Take a candidate and evaluate each variable in each class: 2. Determine the Likelihood ratio, e.g. for the electron: Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

5 Projective Likelihood
One of the most popular MVA methods in HEP Well performing, fast, robust and simple classifier If prior probabilities known, estimate on class probability can be made Big improvement wrt to simple cut approach Background likeliness in one variable can be overcome Problematic if (significant, >10%) correlations exist: Usually the classification performance is decreased Factorization approach doesn’t hold Probability estimation is lost Can you tell the correlation? Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

6 Linear Correlations Reminder:
(standard TMVA example) Solution: apply linear transformation to input variables, so they fulfill Two main methods in use Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

7 SQRT Decorrelation Determine square-root C  of correlation matrix C, i.e., C = C C  Compute C  by diagonalising C: Transformation prescription for x: (standard TMVA example before decorrelation) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

8 SQRT Decorrelation Determine square-root C  of correlation matrix C, i.e., C = C C  Compute C  by diagonalising C: Transformation prescription for x: (after decorrelation) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

9 Principal Component Analysis
Eigenvalue problem of the correlation matrix Largest eigenvalue = largest correlation Corresponding Eigenvector along axis with largest variance PCA is typically used to: Reduce the dimensionality of a problem Find the most dominant features Transformation rule: Use eigenvectors as basis (k components) Express variables in terms of the new basis (for each class) Matrix of eigenvectors V obey the relation:  PCA eliminates correlations! Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

10 How is the Transformation applied?
Usually correlations different in signal and background, which one is applied? Two cases (in general): Explicit difference in pdfs (e.g. projective Likelihood) No differentiation in pdfs Use either signal or background decorrelation Use a mixture of both Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

11 Classifier Decorrelation example
Example: linear correlated Gaussians  decorrelation works to 100% 1-D Likelihood on decorrelated sample give best possible performance compare also the effect on the MVA-output variable! correlated variables same variables, with decorrelation (note the different scale on the y-axis) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

12 Decorrelation Caveats
What if the correlations are different for signal and background? Background Signal Original correlations After SQRT decorrelation Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

13 Decorrelation Caveats
What happens if the correlations are non-linear? Original correlations After SQRT decorrelation Use decorrelation with care! Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

14 Gaussianisation Decorrelation works best if variables are Gauss-distributed Perform another transformation: “Gaussianisation” As two step procedure: 1. Rarity transformation (create uniform distribution) 2. make Gaussian via inverse error function: Decorrelate Gauss-shaped variable distributions Optional: do several iterations of Gaussianisation and Decorrelations Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

15 Example of Gaussianisation
Original distributions Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

16 Example of Gaussianisation
Signal gaussianised Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

17 Example of Gaussianisation
Background gaussianised No simultaneous Gaussianisation of signal and background possible Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

18 Modeling decision boundaries
Nearest Neighbor, Kernel estimators and Likelihood estimate the underlying pdfs Move on to determine the decision boundary instead Requires model Specific parametrization Specific determination process Most MVA methods are distinguished by the model In general the parametrization can be expressed as Next specific examples: Linear Discriminator and Neural Networks Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

19 Linear/Fisher Discriminant
Linear model of the input functions Results in linear decision boundaries Note: the lines are not the functions! They are given by y(x)= const Weight determination? Maximal class separation: Maximize distance between mean values Minimize variance within class H1 H0 x1 x2 y(x) yS(x) yB(x) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

20 Linear Discriminant Maximise “between variance”,
minimise “within variance” Decompose covariance matrix C into within class W and between class B Determining the maximum yields the Fisher coeffients Fk All quantities are known from the training data Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

21 Nonlinear correlations in LD
Assume the following non-linear correlated data: Linear discriminant doesn’t give a good result Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

22 Nonlinear correlations in LD
Assume the following non-linear correlated data: Linear discriminant doesn’t give a good result Use Non-linear Decorrelation (polar coordinates) Decorrelated data can be separated well Note: Linear discrimination doesn’t separate data with identical means Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

23 Artificial Neural Networks
ANN (short: neural network) are directional, weighted graphs Evolution in steps: retrace some of the ideas Statistical concept: To extend the decision boundaries to arbitrary functions, y(x) and in turn the hi(x) need to be non-linear (Extension of) Weierstrass-Theorem: Every continuous function defined on an interval [a,b] can be uniformly approximated as closely as desired by a polynomial function Neural Networks choose a particular set of functions Biggest breakthrough: adaptability/(machine) learning Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

24 Nodes and Connections Neural networks consist of nodes and connections
Activation function depends on input Output function depends on activation, usual choices: Sigmoid Tanh Binary Build a network of nodes in layers Input layer has no input nodes Output layer has no next nodes In between are hidden layers Sigmoid Binary input layer hidden layer ouput layer 1 1 . . . . i j Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

25 Example: AND Theorem: any Boolean Function can be represented by (a certain class of) neural network(s) Consider to construct the logical function AND (similar for OR) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

26 Example: XOR Try to build XOR with the same network
Another layer in between is needed Increases the power of description (keep this in mind!) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

27 ANN Adaptability: Hebb’s Rule
In words: If a node receives an input from another node, and if both are highly active (have the same sign), the connection weight between the nodes should be strengthened. Mathematically: (for realistic learning this has to be modified; e.g. weights can grow without limits) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

28 The ancestor: Perceptron
Two layers: input and output One output node Adaptive, modifiable weights Binary transfer function Perceptron convergence theorem: Finite time for learning algorithm convergence, it can learn everything it can represent in finite amount of time Now take a closer look at this representability Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

29 Linear Separability with this is equivalent to
for constant threshold value: straight line in the plane Consider XOR problem: separate A0/A1 from B0/B1 -- Can’t be done with a straight line Type depends on number of inputs: n = 2: straight lines n = 3: planes n > 3: hyper planes Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

30 Hidden Layers Solution shown before: add hidden layer!
Mathematical: from flat hyper planes to convex polygons (connect e.g. with AND – type functions) Build two-layered perceptron o2 o1 (Detail: assume w3,6= w4,6= w5,6 =1/3 theta6 = 0.9) Still limited to convex and connected areas Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

31 Three layered Perceptron
Add another hidden layer Subtractive polygon overlay possible (e.g. with XOR) No further increase by adding more layers Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

32 Multi Layer Perceptron
Standard Network for HEP Network topology: Feedforward network Four layers: Input, Output, two hidden Node properties: Transfer function usually sigmoidal Bias nodes* usually present Learning algorithm: Backpropagation Usually online learning (apply experience after every event) Bias node*: use static threshold = 0, add as substitution threshold for all nodes (except in output layer) – and allow learning algorithm to modify the value, thus creating a variable threshold Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

33 Backpropagation Backpropagation is standard method for supervised learning, Iterative process: Forward pass Compute the network answer for a given input vector by successively computing activation and output for each node Compute error with error function Measure the current network performance by comparing to the right answer Also used to determine the generalization power of the net Backward pass Move backward from output to input layer, modifying the weights according to the learning rule (Different choices possible for applying the weight changes) Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

34 Error and Loss function
Fundamental concept for training: metric for difference between right answer and method result Error or Loss function: Properties: exists, the series converges is continuous and differentiable in Mean Absolute Error, “Classification Error” Mean Squared Error, “Regression Error” Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

35 Error space Define W as vector space of all weights
Search for minimal Error Impossible to determine full W defines a surface in W Negative gradient points to next valley, determined by chain rule The output function plays a crucial role Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

36 Bias versus Variance high low Loss function low bias high variance high bias low variance test sample training sample Overtraining! S B x1 x2 S B x2 Common topic, here (ANN) choice of network topology Training error always decreases Common problem to choose the right complexity: Avoid to learn special features of the training sample Find the best generalization Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April

37 Short Summary If you have a set of uncorrelated variables use the projective Likelihood If you want to apply Pre-Processing, be careful: Only gaussian distributed, linear correlated data can be (properly) decorrelated Gaussianisation is hard to achieve In case of correlated data, also consider other methods Fisher discriminant simple and powerful but limited Neural networks very powerful, but many options/pitfalls Choose the right network complexity, nothing in addition Validate the results Mantra: Use your brain, inspect and understand the data Christoph Rosemann MVA methods Terascale Statistics School Mainz 6. April


Download ppt "Christoph Rosemann and Helge Voss DESY FLC"

Similar presentations


Ads by Google