1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Slides:



Advertisements
Similar presentations
EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Dimension reduction (1)
Kernel methods - overview
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Principal Component Analysis
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Project 4 out today –help session today –photo session today Project 2 winners Announcements.
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
This week: overview on pattern recognition (related to machine learning)
Multimodal Interaction Dr. Mike Spann
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Linear Models for Classification
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
1 Bios 760R, Lecture 1 Overview  Overview of the course  Classification and Clustering  The “curse of dimensionality”  Reminder of some background.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
1 C.A.L. Bailer-Jones. Machine learning and pattern recognition Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Machine learning, pattern recognition and statistical data modelling
Machine learning, pattern recognition and statistical data modelling
Machine Learning Basics
Machine Learning Dimensionality Reduction
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Machine Learning Math Essentials Part 2
CS4670: Intro to Computer Vision
INTRODUCTION TO Machine Learning
Feature space tansformation methods
Generally Discriminant Analysis
CS4670: Intro to Computer Vision
LECTURE 09: DISCRIMINANT ANALYSIS
Announcements Project 2 artifacts Project 3 due Thursday night
Announcements Project 4 out today Project 2 winners help session today
Announcements Artifact due Thursday
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Hairong Qi, Gonzalez Family Professor
Announcements Artifact due Thursday
The “Margaret Thatcher Illusion”, by Peter Thompson
Presentation transcript:

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling Lecture 2. Data exploration Coryn Bailer-Jones

2 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Last week... ● supervised vs. unsupervised learning ● generalization and regularization ● regression vs. classification ● linear regression (fit via least squares) – assume global linear fit; stable (low variance) but biased ● k nearest neighbours – assumes local constant fit; less stable (high variance) but less biased ● more complex models permit lower errors on training data – but we want models to generalize – need to control complexity / nonlinearity (regularization) ⇒ assume some degree of smoothness. But how much?

3 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction 2-class classification: K-nn and linear regression © Hastie, Tibshirani, Friedman (2001) with enough training data, wouldn't k-nn be best?

4 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction The curse of dimensionality ● for p=10, to capture 1% of data must cover 63% of range of each input variable (95% for p=100) ● as p increases – distance to neighbours increases – most neighbours are near boundary ● to maintain density (i.e. properly sample variance), number of templates must increase as N p Data uniformly distributed in unit hypercube Define neighbour volume with edge length e (e<1) neighbour volume = e p p = no. of dimensions r = fraction of unit data volume e = r 1/p

5 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Overcoming the curse ● Avoid it by dimensionality reduction – throw away less relevant inputs – combine inputs – use domain knowledge to select/define features ● Make assumptions about the data – structured regression ● this is essential: an infinite number of functions pass through a finite number of data points – complexity control ● e.g. smoothness in a local region

6 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Data exploration ● density modelling – smoothing ● visualization – identify structure, esp. nonlinear ● dimensionality reduction – overcome 'the curse' – stabler, simpler, more easily understood models – identify relevant variables (or combinations thereof)

7 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Density estimation (non-parametric)

8 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Density estimation: histograms Bishop (1995)

9 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Kernel density estimation K() is a fixed kernel function with bandwidth h. K = no. neighbours N = total no. points V = volume occupied by K neighbours Simple (Parzen) kernel:

10 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Gaussian kernel Bishop (1995) where N is entire data set

11 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction K-NN density estimation K = no. neighbours N = total no. points V = volume occupied by K neighbours Overcome fixed kernel size: Vary search volume size, V, until reach K neighbours Bishop (1995)

12 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Histograms and 1D kernel density estimation From MASS4 section 5.6. See R scripts on web.

13 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction 2D kernel density estimation From MASS4 section 5.6. See R scripts on web.

14 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Classification via (parametric) density modelling

15 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Maximum likelihood estimate of parameters

16 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Example: modelling PDF with two Gaussians class 1 = (0.0, 0.0) = (0.5, 0.5) class 2 = (1.0, 1.0) = (0.7, 0.3) See R scripts on web page

17 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Capturing variance: Principal Components Analysis (PCA)

18 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Principal Components Analysis For given data vector a, minimizing b is equivalent to maximizing c

19 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Principal Components Analysis: the equations

20 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction PCA example: MHD stellar spectra N=5144 optical spectra 380 – 520 nm in p=820 bins Area normalized Show variance in spectral type (SpT) (Bailer-Jones et al. 1998)

21 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction MHD stellar spectra: average spectrum

22 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction MHD stellar spectra: first 20 eigenvectors

23 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction MHD stellar spectra: admix. coefs. vs. SpT

24 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction MHD stellar spectra

25 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction PCA reduced reconstruction

26 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Reconstruction quality for the MHD spectra shape of curve also depends on signal-to-noise level

27 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Reconstruction of an M star Key: - no. of PCs used - normalized reconstruction error:

28 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction PCA: Explanation is not discrimination PCA has no class information, so cannot provide optimal discrimination

29 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction PCA summary ● Linear projection of data which captures and orders variance – PCs are linear combinations of data which are uncorrelated and of highest variance – equivalent to a rotation of the coordinate system ● Data compression via a reduced reconstruction ● New data can be projected onto the PCs ● Reduced reconstruction acts as a filter – removes rare features (low variance measured across whole data set) – poorly reconstructs non-typical objects

30 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction PCA as filter residual reconstructed spectrum (R=25, E=5.4%) original spectrum

31 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction PCA ● What happens if there are fewer vectors than dimensions, i.e. N < p ?

32 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Summary ● curse of dimensionality ● density estimation – non-parametric: histograms, kernel method, k-nn ● trade-off between number of neighbours and volume size – parametric: Gaussian; fitting via maximum likelihood ● Principal Components Analysis – Principal Components ● are the eigenvectors of the covariance matrix ● are orthonormal ● ordered set describing directions of maximum variance – reduced reconstruction: data compression – a linear transformation (coordinate rotation)