Test #1 Thursday September 20th

Test #1 Thursday September 20th
No discussion questions 5 questions 4 require some math Good idea to review derivations

Multivariate Data In most machine-learning problems, attribute space
is multi-dimensional This set of lectures will cover parametric methods specifically designed to deal with multivariate data This notation for a dataset used in supervised machine learning emphasizes the number of examples in the set and suppresses the dimension of the attribute space by vector notation.

Multivariate Data Matrix
Matrix notation for a “d-variate” dataset explicitly shows both the dimension of the attribute space and the size of the training set. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

1D Gaussian Distribution
p(x) = N ( x, μ, σ2) MLE for μ and σ2: μ σ Generalize this distribution to multivariate data 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Multivariate Gaussian Distribution
is a vector Components are means of individual attributes Variance is a matrix called “covariance”. Diagonal elements are s2 of individual attributes. Off diagonals describe how fluctuations in one attribute affect fluctuations in another. is symmetric sik = ski

dxd dx1 1xd All of the elements of S are quadratic in attribute values
Dividing off-diagonal elements by the product of variances, gives “correlation coefficients” A measure of the correlation between fluctuations in attributes i and j

Parameter Estimation Subscripts refer to particular attributes in input vector xt Sums are over the whole dataset 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Multivariate Normal Distribution
p(x) is a scalar function of d variables μ σ d = 1 d = 2

Multivariate Normal Distribution
dx1 1xd dxd The parts of p(x) are scalars Mahalanobis distance: (x – μ)T ∑–1 (x – μ) analogous to (x-m)2/2s2 We need inverse and determinant of S to calculate p(x). “nice” properties of S that ensure an inverse may not be true for its estimator

Bivariate Normal: To derive a more useful form of the bivariate
5 parameters: 2 means, 2 variances, and correlation r To derive a more useful form of the bivariate normal distribution Calculate determinant and inverse of S Define zi=(xi – mi)/si i= 1,2 After a lot of tedious algebra Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10

Bivariate Normal: z normalization
z12 - 2rz1z2 + z22 = constant, with |r|< 1, defines an ellipse r > 0, major axis has positive slope; r < 0 negative If r = 0, S is diagonal (variables are uncorrelated), major axis aligned with coordinate axis If s1 = s2 ellipse becomes a circle Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

Bivariate normal contour plots
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
13

Independent Attributes
If xi are independent, off-diagonals of ∑ are 0, p(x) reduces to a product of probabilities for each component Using property of exponents, product becomes What property of exponents? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14

Multivariate Regression
Essentially the same as scalar regression Label is still real number Since the domain of g(xt|q) has more dimension, q is usually a larger group of parameters As long as g(xt|q) is linear in parameters, greater number is easily handled by linear least squares (see Parametric Methods slides 46 & 47) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

General d = 2 quadratic model has 5 terms
Kernal methods: Define new variables called “features” z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2 Model is a quadratic function of attributes and a linear function of features, but both are linear functions of the parameters The process for optimizing the model in feature space is the same as in attribute space.

In general the transformation,
attribute space → feature space increases dimensionality 2D → 5D Most frequent objective of attribute space → feature space is to obtain linearly separable classes Example: XOR problem

XOR: one or the other but not both
Truth table Graphical representation Classes are not linearly separable in attribute space

XOR in feature space f1 = exp(-||X – [1,1]T||2)
X f1 f2 (1,1) (0,1) (0,0) (1,0) XOR in feature space Linear separation achieved without increase in dimensions Maximum margins

Classification with multivariate normal
If p (x | Ci ) ~ N ( μi , ∑i ) Gaussian class likelihood Discriminant functions Model requires mean vector and covariance matrix for each class 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Estimate Class Parameters
As in scalar case, use rit to pick out class i examples in sums over whole dataset scalar dx1 vector dxd matrix Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21

discriminant: P (C1|x ) = 0.5 2 attributes and 2 classes
likelihoods posterior for C1 Set g1(x1) =g2(x2) Get a curve in (x1, x2) plane Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22

Analogous to the single attribute case
Equal variances Setting g1(x) = g2(x) determines boundary between classes Example: 2 classes Class likelihoods have means + 2 and equal variance Priors are also equal. Between -2 and 2 have transition between essentially certain classification of classes as a function of x With equal priors and variances, Bayes discriminant point halfway between means 23 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Different Si for each class
With d attributes and K classes, we have Kd means and Kd(d+1)/2 distinct elements of covariance matrices. Often dataset is too small to determine all of these parameters A solution: Keep class means but pool data to calculate a common covariance matrix for all classes. Kd + d(d+1)/2 parameters Example: 2 classes Class likelihoods have means + 2 and equal variance Priors are also equal. Between -2 and 2 have transition between essentially certain classification of classes as a function of x 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Common Covariance Matrix S
linear discriminant: What happened to the quadratic term? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

2 attributes and 2 classes with common covariance
Linear discriminant What can we say from this graph about the value of r in the common covariance matrix? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26

Further parameter reduction: Diagonal S
Independent attributes; different class means common variance. Kd + d parameters p (x|Ci) = ∏j p (xj |Ci) S is diagonal; discriminant reduces to What is mij? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

2 attributes and 2 classes with common diagonal covariance matrix
variances may be different Axis aligned, linear discriminant Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28

different means; equal variances
Nearest mean classifier: Kd +1 parameters Classify based on Euclidean distance to the nearest mean mij = mean of jth attribute in ith class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

2 attributes and 2 classes with equal variance
* Linear discriminant is perpendicular bisector of line connecting means Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30

Summary of options for covariance matrix
Assumption Covariance matrix # of parameters Shared, Hyperspheric Si=S=s2I 1 Shared, Axis-aligned Si=S diagonal d Shared, Hyperellipsoidal Si=S d(d+1)/2 Different, Hyperellipsoidal Si K d(d+1)/2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31

Discrete attributes: xt and rt are Boolean vectors
Bernoulli distribution #positive xj/size of class i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32

2 attributes and 2 classes:
What is the value of r in each example?

2 attributes and 2 classes:
Why is the discriminant curved in (b) – (d)?

2 attributes and 2 classes: How are the covariance matrices
approximated in each case?

for each class likelihood Discriminant is dark blue
2 attributes 2 classes Same mean different Covariance One contour shown for each class likelihood Discriminant is dark blue Describe the covariance matrices in each case S = s2 I, one class has larger variance than other b) S1 = s12 I, S2 is diagonal; variance of x > variance of y c) S1 = s12 I, S2 has correlation with r > 0 d) Both S1 and S2 has correlation. r1 > 0 and r2 < 0

The function of 2 attributes
f(x1, x2) = w0 + sin(w1x1) + w2x2 + w3x1x2 + w4x12 + w5x22 has been proposed as a model of dataset X = {xt, rt} Can the parameters be obtained by linear regression? The function of 2 attributes f(x1, x2) = sin(w0) + w1x1 + w2x2 + w3x1x2 + w4x12 + w5x22 has been proposed as a model of dataset X = {xt, rt} Can the parameters be obtained by linear regression?

Test #1 Thursday September 20th

Similar presentations

Presentation on theme: "Test #1 Thursday September 20th"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Test #1 Thursday September 20th

Similar presentations

Presentation on theme: "Test #1 Thursday September 20th"— Presentation transcript:

Similar presentations

About project

Feedback