Test #1 Thursday September 20th

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification and risk prediction
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture II-2: Probability Review
Today Wrap up of probability Vectors, Matrices. Calculus
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principles of Pattern Recognition
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Machine Learning Recitation 6 Sep 30, 2009 Oznur Tastan.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian classification review Bayesian statistics derive K nearest neighbors (KNN) classifier analysis of 2-way classification results homework assignment.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Applied statistics Usman Roshan.
Lecture 2. Bayesian Decision Theory
Quadratic Classifiers (QC)
Probability Theory and Parameter Estimation I
LECTURE 04: DECISION SURFACES
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Multivariate Methods Slides from Machine Learning by Ethem Alpaydin
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
ECE 417 Lecture 4: Multivariate Gaussians
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Machine Learning Math Essentials Part 2
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
5.2 Least-Squares Fit to a Straight Line
5.4 General Linear Least-Squares
INTRODUCTION TO Machine Learning
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Parametric Estimation
Radial Basis Functions: Alternative to Back Propagation
Linear Discrimination
Supervised machine learning: creating a model
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Test #1 Thursday September 20th No discussion questions 5 questions 4 require some math Good idea to review derivations

Multivariate Data In most machine-learning problems, attribute space is multi-dimensional This set of lectures will cover parametric methods specifically designed to deal with multivariate data This notation for a dataset used in supervised machine learning emphasizes the number of examples in the set and suppresses the dimension of the attribute space by vector notation.

Multivariate Data Matrix Matrix notation for a “d-variate” dataset explicitly shows both the dimension of the attribute space and the size of the training set. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

1D Gaussian Distribution p(x) = N ( x, μ, σ2) MLE for μ and σ2: μ σ Generalize this distribution to multivariate data 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Multivariate Gaussian Distribution is a vector Components are means of individual attributes Variance is a matrix called “covariance”. Diagonal elements are s2 of individual attributes. Off diagonals describe how fluctuations in one attribute affect fluctuations in another. is symmetric sik = ski

dxd dx1 1xd All of the elements of S are quadratic in attribute values Dividing off-diagonal elements by the product of variances, gives “correlation coefficients” A measure of the correlation between fluctuations in attributes i and j

Parameter Estimation Subscripts refer to particular attributes in input vector xt Sums are over the whole dataset 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Multivariate Normal Distribution p(x) is a scalar function of d variables μ σ d = 1 d = 2

Multivariate Normal Distribution dx1 1xd dxd The parts of p(x) are scalars Mahalanobis distance: (x – μ)T ∑–1 (x – μ) analogous to (x-m)2/2s2 We need inverse and determinant of S to calculate p(x). “nice” properties of S that ensure an inverse may not be true for its estimator

Bivariate Normal: To derive a more useful form of the bivariate 5 parameters: 2 means, 2 variances, and correlation r To derive a more useful form of the bivariate normal distribution Calculate determinant and inverse of S Define zi=(xi – mi)/si i= 1,2 After a lot of tedious algebra Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10

Bivariate Normal: z normalization z12 - 2rz1z2 + z22 = constant, with |r|< 1, defines an ellipse r > 0, major axis has positive slope; r < 0 negative If r = 0, S is diagonal (variables are uncorrelated), major axis aligned with coordinate axis If s1 = s2 ellipse becomes a circle Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

Bivariate normal contour plots Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13

Independent Attributes If xi are independent, off-diagonals of ∑ are 0, p(x) reduces to a product of probabilities for each component Using property of exponents, product becomes What property of exponents? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14

Multivariate Regression Essentially the same as scalar regression Label is still real number Since the domain of g(xt|q) has more dimension, q is usually a larger group of parameters As long as g(xt|q) is linear in parameters, greater number is easily handled by linear least squares (see Parametric Methods slides 46 & 47) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

General d = 2 quadratic model has 5 terms Kernal methods: Define new variables called “features” z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2 Model is a quadratic function of attributes and a linear function of features, but both are linear functions of the parameters The process for optimizing the model in feature space is the same as in attribute space.

In general the transformation, attribute space → feature space increases dimensionality 2D → 5D Most frequent objective of attribute space → feature space is to obtain linearly separable classes Example: XOR problem

XOR: one or the other but not both Truth table Graphical representation Classes are not linearly separable in attribute space

XOR in feature space f1 = exp(-||X – [1,1]T||2) X f1 f2 (1,1) 1 0.1353 (0,1) 0.3678 0.3678 (0,0) 0.1353 1 (1,0) 0.3678 0.3678 XOR in feature space Linear separation achieved without increase in dimensions Maximum margins

Classification with multivariate normal If p (x | Ci ) ~ N ( μi , ∑i ) Gaussian class likelihood Discriminant functions Model requires mean vector and covariance matrix for each class 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Estimate Class Parameters As in scalar case, use rit to pick out class i examples in sums over whole dataset scalar dx1 vector dxd matrix Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21

discriminant: P (C1|x ) = 0.5 2 attributes and 2 classes likelihoods posterior for C1 Set g1(x1) =g2(x2) Get a curve in (x1, x2) plane Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22

Analogous to the single attribute case Equal variances Setting g1(x) = g2(x) determines boundary between classes Example: 2 classes Class likelihoods have means + 2 and equal variance Priors are also equal. Between -2 and 2 have transition between essentially certain classification of classes as a function of x With equal priors and variances, Bayes discriminant point halfway between means 23 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Different Si for each class With d attributes and K classes, we have Kd means and Kd(d+1)/2 distinct elements of covariance matrices. Often dataset is too small to determine all of these parameters A solution: Keep class means but pool data to calculate a common covariance matrix for all classes. Kd + d(d+1)/2 parameters Example: 2 classes Class likelihoods have means + 2 and equal variance Priors are also equal. Between -2 and 2 have transition between essentially certain classification of classes as a function of x 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Common Covariance Matrix S linear discriminant: What happened to the quadratic term? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

2 attributes and 2 classes with common covariance Linear discriminant What can we say from this graph about the value of r in the common covariance matrix? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26

Further parameter reduction: Diagonal S Independent attributes; different class means common variance. Kd + d parameters p (x|Ci) = ∏j p (xj |Ci) S is diagonal; discriminant reduces to What is mij? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

2 attributes and 2 classes with common diagonal covariance matrix variances may be different Axis aligned, linear discriminant Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28

different means; equal variances Nearest mean classifier: Kd +1 parameters Classify based on Euclidean distance to the nearest mean mij = mean of jth attribute in ith class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

2 attributes and 2 classes with equal variance * Linear discriminant is perpendicular bisector of line connecting means Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30

Summary of options for covariance matrix Assumption Covariance matrix # of parameters Shared, Hyperspheric Si=S=s2I 1 Shared, Axis-aligned Si=S diagonal d Shared, Hyperellipsoidal Si=S d(d+1)/2 Different, Hyperellipsoidal Si K d(d+1)/2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31

Discrete attributes: xt and rt are Boolean vectors Bernoulli distribution #positive xj/size of class i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32

2 attributes and 2 classes: What is the value of r in each example?

2 attributes and 2 classes: Why is the discriminant curved in (b) – (d)?

2 attributes and 2 classes: How are the covariance matrices approximated in each case?

for each class likelihood Discriminant is dark blue 2 attributes 2 classes Same mean different Covariance One contour shown for each class likelihood Discriminant is dark blue Describe the covariance matrices in each case S = s2 I, one class has larger variance than other b) S1 = s12 I, S2 is diagonal; variance of x > variance of y c) S1 = s12 I, S2 has correlation with r > 0 d) Both S1 and S2 has correlation. r1 > 0 and r2 < 0

The function of 2 attributes f(x1, x2) = w0 + sin(w1x1) + w2x2 + w3x1x2 + w4x12 + w5x22 has been proposed as a model of dataset X = {xt, rt} Can the parameters be obtained by linear regression? The function of 2 attributes f(x1, x2) = sin(w0) + w1x1 + w2x2 + w3x1x2 + w4x12 + w5x22 has been proposed as a model of dataset X = {xt, rt} Can the parameters be obtained by linear regression?