5 Major concepts Gaussian, Multinomial, Bernoulli Distributions Joint vs. Conditional DistributionsMarginalizationMaximum LikelihoodRisk MinimizationGradient DescentFeature Extraction, Kernel Methods
6 Some favorite distributions BernoulliMultinomialGaussian
7 Maximum LikelihoodIdentify the parameter values that yield the maximum likelihood of generating the observed data.Take the partial derivative of the likelihood functionSet to zeroSolveNB: maximum likelihood parameters are the same as maximum log likelihood parameters
8 Maximum Log Likelihood Why do we like the log function?It turns products (difficult to differentiate) and turns them into sums (easy to differentiate)log(xy) = log(x) + log(y)log(xc) = c log(x)
9 Risk Minimization Pick a loss function Squared lossLinear lossPerceptron (classification) lossIdentify the parameters that minimize the loss function.Take the partial derivative of the loss functionSet to zeroSolve
10 Frequentists v. Bayesians Point estimates vs. PosteriorsRisk Minimization vs. Maximum LikelihoodL2-RegularizationFrequentists: Add a constraint on the size of the weight vectorBayesians: Introduce a zero-mean prior on the weight vectorResult is the same!
11 L2-Regularization Frequentists: Bayesians: Introduce a cost on the size of the weightsBayesians:Introduce a prior on the weights
12 Types of Classifiers Generative Models Discriminative Models Highest resource requirements.Need to approximate the joint probabilityDiscriminative ModelsModerate resource requirements.Typically fewer parameters to approximate than generative modelsDiscriminant FunctionsCan be trained probabilistically, but the output does not include confidence information
14 Linear Regression Extension to higher dimensions Polynomial fitting Arbitrary function fittingWaveletsRadial basis functionsClassifier output
15 Logistic Regression Fit gaussians to data for each class The decision boundary is where the PDFs crossNo “closed form” solution to the gradient.Gradient Descent
16 Graphical ModelsGeneral way to describe the dependence relationships between variables.Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
17 Junction Tree Algorithm Moralization“Marry the parents”Make undirectedTriangulationRemove cycles >4Junction Tree ConstructionIdentify separators such that the running intersection property holdsIntroduction of EvidencePass slices around the junction tree to generate marginals
18 Hidden Markov Models Sequential Modeling Generative ModelRelationship between observations and state (class) sequences
19 Perceptron Step function used for squashing. Classifier as Neuron metaphor.
20 Perceptron Loss Classification Error vs. Sigmoid Error Loss is only calculated on MistakesPerceptrons usestrictly classificationerror
21 Neural NetworksInterconnected Layers of Perceptrons or Logistic Regression “neurons”
22 Neural NetworksThere are many possible configurations of neural networksVary the number of layersSize of layers
23 Support Vector Machines Maximum Margin ClassificationSmall MarginLarge Margin
24 Support Vector Machines Optimization FunctionDecision Function
26 Questions?Now would be a good time to ask questions about Supervised Techniques.
27 Clustering Identify discrete groups of similar data points Data points are unlabeled
28 Recall K-Means Algorithm Select K – the desired number of clusters Initialize K cluster centroidsFor each point in the data set, assign it to the cluster with the closest centroidUpdate the centroid based on the points assigned to each clusterIf any data point has changed clusters, repeat
32 Soft k-meansWe still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data pointsConvergence is based on a stopping threshold rather than changed assignments
33 Gaussian Mixture Models Rather than identifying clusters by “nearest” centroidsFit a Set of k Gaussians to the data.p(x) = \pi_0f_0(x) + \pi_1f_1(x) + \pi_2f_2(x) + \ldots + \pi_kf_k(x)
35 Gaussian Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
36 Graphical Models with unobserved variables What if you have variables in a Graphical model that are never observed?Latent VariablesTraining latent variable models is an unsupervised learning applicationuncomfortableamusedsweatinglaughing
37 Latent Variable HMMsWe can cluster sequences using an HMM with unobserved state variablesWe will train the latent variable models using Expectation Maximization
38 Expectation Maximization Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation MaximizationStep 1: Expectation (E-step)Evaluate the “responsibilities” of each cluster with the current parametersStep 2: Maximization (M-step)Re-estimate parameters using the existing “responsibilities”Related to k-means
39 QuestionsOne more time for questions on supervised learning…
40 Next TimeGaussian Mixture Models (GMMs)Expectation Maximization