Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
LECTURE 12 Multiple regression analysis Epsy 640 Texas A&M University.
1 STATISTICAL INFERENCE PART I EXPONENTIAL FAMILY & POINT ESTIMATION.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Introduction and Motivation Approaches for DE: Known model → parametric approach: p(x;θ) (Gaussian, Laplace,…) Unknown model → nonparametric approach Assumes.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
5-1 ANSYS, Inc. Proprietary © 2009 ANSYS, Inc. All rights reserved. May 28, 2009 Inventory # Chapter 5 Six Sigma.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Lecture 2: Statistical learning primer for biologists
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Univariate Gaussian Case (Cont.)
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
RECITATION 2 APRIL 28 Spline and Kernel method Gaussian Processes Mixture Modeling for Density Estimation.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Bayesian Estimation and Confidence Intervals
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Bias and Variance of the Estimator
Latent Variables, Mixture Models and EM
Ch13 Empirical Methods.
LECTURE 09: BAYESIAN LEARNING
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Applied Statistics and Probability for Engineers
Probabilistic Surrogate Models
Presentation transcript:

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1

Contents Introduction Dataset Parametric Methods Non-Parametric Methods Evaluation Conclusions Questions 2

Introduction Parametric Methods: A particular form of the density function (e.g. Gaussian) is assumed to be known and only the parameters (e.g. mean, covariance) need to be estimated. Non-parametric Methods: Adopt no assumption on the form of the distribution, however, a lot of examples are required. 3

Datasets Created a synthetic dataset containing samples that follow a randomly generated mixture of 2 Gaussians 4

Parametric Methods Maximal Likelihood estimation Bayesian estimation 5

Maximum Likelihood Estimation Statement of the Problem: Density function p with parameters θ is given and x t ~p ( X |θ) Likelihood of θ given the sample X l (θ| X ) = p ( X |θ) = ∏ t p (x t |θ) We look θ for that “maximizes the likelihood of the sample”! L (θ| X ) = log l (θ| X ) = ∑ t log p (x t |θ) Maximum likelihood estimator (MLE) θ * = argmax θ L (θ| X ) 6

Maximum Likelihood Estimation (2) Advantages: 1.they become unbiased minimum variance estimators as the sample size increases 2.they have approximate normal distributions and approximate sample variances that can be calculated and used to generate confidence bounds 3.likelihood functions can be used to test hypotheses about models and parameters Disadvantages: 1.With small numbers of failures (less than 5, and sometimes less than 10 is small), MLE's can be heavily biased and the large sample optimality properties do not apply 2.Calculating MLE's often requires specialized software for solving complex non- linear equations. 7

Maximum Likelihood Estimation (3) Stats4 library Fitting a Normal Distribution to the Old Faithful eruption data(mu = 3.487, sd =

Maximum Likelihood Estimation (4) Other libraries: Bbmle : has mle2() which offers essentially the same functionality but includes the option of not inverting the Hessian Matrix. 9

Bayesian estimation The Bayesian approach to parameter estimation works as follows: 1.Formulate our knowledge about a situation 1.Define a distribution model which expresses qualitative aspects of our knowledge about the situation. This model will have some unknown parameters, which will be dealt with as random variables 2.Specify a prior probability distribution which expresses our subjective beliefs and subjective uncertainty about the unknown parameters, before seeing the data. 2.Gather data 3.Obtain posterior knowledge that updates our beliefs 1.Compute posterior probability distribution which estimates the unknown parameters using the rules of probability and given the observed data, presenting us with updated beliefs. 10

Bayesian estimation (2) Available R packages MCMCpack is a software package designed to allow users to perform Bayesian inference via Markov chain Monte Carlo (MCMC). Bayesm package for Bayesian analysis of many models of interest to marketers. Contains a number of interesting datasets, including scanner panel data, key account level data, store level data and various types of survey data 11

Bayesian estimation (3) Credible intervals and point estimates for the parameters. Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) x sigma Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) x sigma

Non-Parametric Methods Histograms The naive estimator The kernel estimator The nearest neighbor method Maximum penalized likelihood estimators 13

Histograms Given an origin x 0 and a bin width h Bin of histogram: [x 0 +mh, x 0 +(m+1)h) The histogram: Drawback Sensitive to the choice of bin width h Number of bins grows exponentially with the dimension of data Discontinuity 14

Histograms Packages graphics::hist 15

Histograms The optimal bin width is

The Naïve Estimator The naïve estimator: Drawback Sensitive to the choice of width h Discontinuity 17

The Naïve Estimator Packages stats::density 18

The Naïve Estimator The optimal bin width is

The kernel estimator Kernel estimator: K: a kernel function that satisfies: h: window width F is continuous and differentiable Drawback Sensitive to window width Add more noise to long-tailed distributions 20

The kernel estimator (2) Using the density function and a synthetic dataset 21

The kernel estimator (3) Using the a triangular kernel and a different smoothing parameter 22

The kernel estimator (4) Using the a Gaussian kernel and a different smoothing parameter 23

The kernel estimator (5) Errors for the triangular and the Gaussian Kernels 24

The k th nearest neighbor method The k th nearest neighbor density estimator: The heavy tails and the discontinuities in the derivative are clear 25

The k th nearest neighbor method Packages FNN::knn 26

The k th nearest neighbor method The optimal k is

Maximum penalized likelihood estimators 28

Penalized Approaches R packages: gss uses a penalized likelihood technique for nonparametric density estimation 29

Penalized Approaches (2) Estimate probability densities using smoothing spline ANOVA models (Ssden function) 30

Penalized Approaches (3) 31

Additional R Packages for Non- Parametric methods General weight function estimators: Wle package Bounded domains and directional data: BelVen package 32

Evaluation MethodError Histogram with bw= E-05 Naïve Estimator with bw= E-06 Triangular Kernel with bw = E-06 Gaussian Kernel with bw= E-06 kth nearest neighbor with k= E-05 Penalized Approaches with a = E-06 33

Conclusions The initial mean and the sd affect the MLE performance Since our data are balanced, different kernels do not affect the error of the kernel estimation The knn estimator is slow and inaccurate, especially in a large dataset. The penalized approach which estimates the Probability Density Using Smoothing Splines is also slow but more accurate than the kernel 34

References [1] Silverman, Bernard W. Density estimation for statistics and data analysis. Vol. 26. CRC press, [2] Deng, Henry, and Hadley Wickham. "Density estimation in R." Electronic publication (2011). [3] /eshky.pdf 35

Questions 36