Giansalvo EXIN Cirrincione unit #3 PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Principles of Density Estimation
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Lecture 3 Nonparametric density estimation and classification
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Clustering.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Introduction to Bayesian Parameter Estimation
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Maximum likelihood (ML)
Lecture II-2: Probability Review
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Univariate Gaussian Case (Cont.)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
unit #3 Neural Networks and Pattern Recognition
Univariate Gaussian Case (Cont.)
Ch8: Nonparametric Methods
Latent Variables, Mixture Models and EM
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Nonparametric density estimation and classification
Mathematical Foundations of BME
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Giansalvo EXIN Cirrincione unit #3

PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This contains a number of parameters which are then optimized by fitting the model to the training set. The chosen form is not correct

PROBABILITY DENSITY ESTIMATION It does not assume a particular functional form, but allows the form of the density to be determined entirely by the data. The number of parameters grows with the size of the TS

PROBABILITY DENSITY ESTIMATION It allows a very general class of functional forms in which the number of adaptive parameters can be increased in a sistematic way to build even more flexible models, but where the total number of parameters in the model can be varied independently from the size of the data set.

Parametric model: normal or Gaussian distribution

Parametric model: normal or Gaussian distribution Mahalanobis distance

contour of constant probability density (smaller by a factor exp(-1/2)) Parametric model: normal or Gaussian distribution

Parametric model: normal or Gaussian distribution The components of x are statistically independent

Parametric model: normal or Gaussian distribution

Parametric model: normal or Gaussian distribution Some properties : any moment can be expressed as a function of  and  under general assumptions, the mean of M random variables tends to be distributed normally, in the limit as M tends to infinity (central limit theorem). Example: sum of a set of variables drawn independently from the same distribution under any non-singular linear transformation of the coordinate system, the pdf is again normal, but with different parameters the marginal and conditional densities are normal.

Parametric model: normal or Gaussian distribution discriminant function independent normal class-conditional pdf’s quadratic decision boundary

Parametric model: normal or Gaussian distribution independent normal class-conditional pdf’s  k =  independent normal class-conditional pdf’s  k =  linear decision boundary

Parametric model: normal or Gaussian distribution P(C 1 ) = P(C 2 )

Parametric model: normal or Gaussian distribution P(C 1 ) = P(C 2 ) = P(C 3 )

Parametric model: normal or Gaussian distribution template matching  =   

ML finds the optimum values for the parameters by maximizing a likelihood  function derived from the training data. drawn independently from the required distribution

TS joint probability density Likelihood of  for the given TS ML finds the optimum values for the parameters by maximizing a likelihood  function derived from the training data.

error function homework Gaussian pdf sample averages

Uncertainty in the values of the parameters

weighting factor (posterior distribution) drawn independently from the underlying distribution

A prior which gives rise to a posterior having the same functional form is said to be a conjugate prior (reproducing densities, e.g. Gaussian). For large numbers of observations, the Bayesian representation of the density approaches the maximum likelihood solution.

Example Assume  known Find  given  normal distribution homework sample mean

Example normal distribution

Iterative techniques: no storage of a complete TS on-line learning in real-time adaptive systems tracking of slowly varying systems From the ML estimate of the mean of a normal distribution

The Robbins-Monro algorithm Consider a pair of random variables g and  which are correlated regression function Assume g has finite variance:

The Robbins-Monro algorithm positive Successive corrections decrease in magnitude for convergence Corrections are sufficiently large that the root is found The accumulated noise has finite variance (noise doesn’t spoil convergence )

The Robbins-Monro algorithm The ML parameter estimate  can be formulated as a sequential update method using the Robbins-Monro formula.

homework

Consider the case where the pdf is taken to be a normal distribution, with known standard deviation  and unknown mean . Show that, by choosing aN aN =  2 / (N+1), the one-dimensional iterative version of the ML estimate of the mean is recovered by using the Robbins-Monro formula for sequential ML. Obtain the corresponding formula for the iterative estimate of  2 and repeat the same analysis.

SUPERVISED LEARNING histograms We can choose both the number of bins M and their starting position on the axis. The number of bins (viz. the bin width) acts as a smoothing parameter. Curse of dimensionality ( M d bins)

Density estimation in general The probability that a new vector x, x, drawn from the unknown pdf p(x), will fall inside some region R of x-space is given by: If we have N points drawn independently from p(x), the probability that K of them will fall within R is given by the binomial law: The distribution is sharply peaked as N tends to infinity. Assume p(x) is continuous and slightly varies over the region R of volume V.V.

Density estimation in general Assumption #1 R relatively large so that P will be large and the binomial distribution will be sharply peaked Assumption #2 R small justifies the assumption of p(x) nearly constant inside the integration region. FIXED DETERMINED FROM DATA K-nearest-neighbours

Density estimation in general Assumption #1 R relatively large so that P will be large and the binomial distribution will be sharply peaked Assumption #2 R small justifies the assumption of p(x) nearly constant inside the integration region. DETERMINED FROM DATA FIXED Kernel-based methods

We can find an expression for K by defining a kernel function H(u), also known as a Parzen window, given by: R is a hypercube centred on x Superposition of N cubes of side h with each cube centred on one of the data points. interpolation function (ZOH)

Kernel-based methods smoother estimate

Kernel-based methods 30 samples ZOH Gaussian

Kernel-based methods Over different selections of data points x n The expectation of the estimated density is a convolution of the true pdf with the kernel function and so represents a smoothed version of the pdf. All of the data points must be stored ! For a finite data set, there is no non-negative estimator which is unbiased for all continuous pdf’s (Rosenblatt, 1956)

K-nearest neighbours One of the potential problems with the kernel-based approach arises from the use of a fixed width parameter (h) for all of the data points. If h is too large, there may be regions of x-space in which the estimate is oversmoothed. Reducing h may lead to problems in regions of lower density where the model density will become noisy. The optimum choice of h may be a function of position. Consider a small hypersphere centred at a point x and allow the radius of the sphere to grow until it contains precisely K data points. The estimate of the density is then given by K / NV.

K-nearest neighbours The estimate is not a true probability density since its integral over all x-space diverges. All of the data points must be stored ! Branch-and-bound

K-nearest neighbour classification rule The data set contains N k points in class C k and N points in total. Draw a hypersphere around x which encompasses K points irrespective of their class.

K-nearest neighbour classification rule Find a hypersphere around x which contains K points and then assign x to the class having the majority inside the hypersphere. K = 1 : nearest-neighbour rule

K-nearest neighbour classification rule Samples that are close in feature space likely belong to the same class. K = 1 : nearest-neighbour rule

1-NNR K-nearest neighbour classification rule

Measure of the distance between two density functions Kullback-Leibler distance or asymmetric divergence L  0 with equality iff the two pdf’s are equal.

homework

Techniques not restricted to specific functional forms, where the size of the model only grows with the complexity of the problem being solved, and not simply with the size of the data set. computationally intensive MIXTURE MODEL Training methods based on ML:  nonlinear optimization  re-estimation (EM algorithm)  stochastic sequential estimation

MIXTURE DISTRIBUTION mixing parameters prior probability of the data point having been generated from component j of the mixture

To generate a data from the pdf, one of the components j is first selected at random with probability P(j) and then a data point is generated from the corresponding component density p(x  j). It can approximate any CONTINUOUS density to arbitrary accuracy provided the model has a sufficiently large number of components, and provided the parameters of the model are chosen correctly. incomplete data (no component label)

posterior probability

spherical Gaussian dd  

MAXIMUM LIKELIHOOD Adjustable parameters :  P( j )   j j = 1, …, M   j j = 1, …, M Problems : singular solutions (likelihood goes to infinity) local minima One of the Gaussian components collapses onto one of the data points

MAXIMUM LIKELIHOOD Possible solutions : constrain the components to have equal variance minimum (underflow) threshold for the variance Problems : singular solutions (likelihood goes to infinity) local minima

softmax or normalized exponential

Expressions for the parameters at a minimum of E Mean of the data vectors weighted by the posterior probabilities that the corresponding data points were generated from that component.

Expressions for the parameters at a minimum of E Variance of the data w.r.t. the mean of that component, again weighted with the posterior probabilities.

Expressions for the parameters at a minimum of E Posterior probabilities for that component, averaged over the data set.

Expressions for the parameters at a minimum of E Highly non-linear coupled equations

Expectation-maximization (EM) algorithm The error function decreases at each iteration until a local minimum is found old new

proof Given a set of non-negative numbers  j that sum to one : Jensen’s inequality

Q Minimizing Q leads to a decrease in the value of the E new unless E new is already at a local minimum.

Gaussian mixture model Minimize : end proof

example EM algorithm 1000 data points uniform distribution seven components after 20 cycles Contours of constant probability density

Why expectation-maximization ? Hypothetical complete data set  x n introduce z n, integer in the range (1,M), specifying which component of the mixture generated x. The distribution of z n is unknown

Why expectation-maximization ? First we guess some values for the parameters of the mixture model (the old parameter values) and then we use these, together with Bayes’ theorem, to find the probability distribution of the {z n }. We then compute the expectation of E comp w.r.t. this distribution. This is the E-step of the EM algorithm. The new parameter values are then found by minimizing this expected error w.r.t. the parameters. This is the maximization or M-step of the EM algorithm (min E = ML).

Why expectation-maximization ? P old (z n |x n ) is the probability for z n, given the value of x n and the old parameter values. Thus, the expectation of E comp over the complete set of {z n } values is given by: probability distribution for the {z n }

Why expectation-maximization ? P old (z n |x n ) is the probability for z n, given the value of x n and the old parameter values. Thus, the expectation of E comp over the complete set of {z n } values is given by: homework

Why expectation-maximization ? P old (z n |x n ) is the probability for z n, given the value of x n and the old parameter values. Thus, the expectation of E comp over the complete set of {z n } values is given by: which is equal to Q ~

Stochastic estimation of parameters It requires the storage of all previous data points

Stochastic estimation of parameters no singular solutions in on-line problems