Supervised Learning Recap

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Pattern Recognition and Machine Learning
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Expectation Maximization
An Overview of Machine Learning
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Computer vision: models, learning and inference
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Gaussian Mixture Models and Expectation Maximization.
Collaborative Filtering Matrix Factorization Approach
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Machine Learning Queens College Lecture 13: SVM Again.
EM and expected complete log-likelihood Mixture of Experts
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Linear Models for Classification
Lecture 2: Statistical learning primer for biologists
Machine Learning Queens College Lecture 7: Clustering.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Today Graphical Models Representing conditional dependence graphically
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Statistical Learning Dong Liu Dept. EEIS, USTC.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Probabilistic Models with Latent Variables
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Linear Discrimination
Presentation transcript:

Supervised Learning Recap Machine Learning

Last Time Support Vector Machines Kernel Methods

Today Review of Supervised Learning Unsupervised Learning (Soft) K-means clustering Expectation Maximization Spectral Clustering Principle Components Analysis Latent Semantic Analysis

Supervised Learning Linear Regression Logistic Regression Graphical Models Hidden Markov Models Neural Networks Support Vector Machines Kernel Methods

Major concepts Gaussian, Multinomial, Bernoulli Distributions Joint vs. Conditional Distributions Marginalization Maximum Likelihood Risk Minimization Gradient Descent Feature Extraction, Kernel Methods

Some favorite distributions Bernoulli Multinomial Gaussian

Maximum Likelihood Identify the parameter values that yield the maximum likelihood of generating the observed data. Take the partial derivative of the likelihood function Set to zero Solve NB: maximum likelihood parameters are the same as maximum log likelihood parameters

Maximum Log Likelihood Why do we like the log function? It turns products (difficult to differentiate) and turns them into sums (easy to differentiate) log(xy) = log(x) + log(y) log(xc) = c log(x)

Risk Minimization Pick a loss function Squared loss Linear loss Perceptron (classification) loss Identify the parameters that minimize the loss function. Take the partial derivative of the loss function Set to zero Solve

Frequentists v. Bayesians Point estimates vs. Posteriors Risk Minimization vs. Maximum Likelihood L2-Regularization Frequentists: Add a constraint on the size of the weight vector Bayesians: Introduce a zero-mean prior on the weight vector Result is the same!

L2-Regularization Frequentists: Bayesians: Introduce a cost on the size of the weights Bayesians: Introduce a prior on the weights

Types of Classifiers Generative Models Discriminative Models Highest resource requirements. Need to approximate the joint probability Discriminative Models Moderate resource requirements. Typically fewer parameters to approximate than generative models Discriminant Functions Can be trained probabilistically, but the output does not include confidence information

Linear Regression Fit a line to a set of points

Linear Regression Extension to higher dimensions Polynomial fitting Arbitrary function fitting Wavelets Radial basis functions Classifier output

Logistic Regression Fit gaussians to data for each class The decision boundary is where the PDFs cross No “closed form” solution to the gradient. Gradient Descent

Graphical Models General way to describe the dependence relationships between variables. Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.

Junction Tree Algorithm Moralization “Marry the parents” Make undirected Triangulation Remove cycles >4 Junction Tree Construction Identify separators such that the running intersection property holds Introduction of Evidence Pass slices around the junction tree to generate marginals

Hidden Markov Models Sequential Modeling Generative Model Relationship between observations and state (class) sequences

Perceptron Step function used for squashing. Classifier as Neuron metaphor.

Perceptron Loss Classification Error vs. Sigmoid Error Loss is only calculated on Mistakes Perceptrons use strictly classification error

Neural Networks Interconnected Layers of Perceptrons or Logistic Regression “neurons”

Neural Networks There are many possible configurations of neural networks Vary the number of layers Size of layers

Support Vector Machines Maximum Margin Classification Small Margin Large Margin

Support Vector Machines Optimization Function Decision Function

Visualization of Support Vectors

Questions? Now would be a good time to ask questions about Supervised Techniques.

Clustering Identify discrete groups of similar data points Data points are unlabeled

Recall K-Means Algorithm Select K – the desired number of clusters Initialize K cluster centroids For each point in the data set, assign it to the cluster with the closest centroid Update the centroid based on the points assigned to each cluster If any data point has changed clusters, repeat

k-means output

Soft K-means In k-means, we force every data point to exist in exactly one cluster. This constraint can be relaxed. Minimizes the entropy of cluster assignment

Soft k-means example

Soft k-means We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points Convergence is based on a stopping threshold rather than changed assignments

Gaussian Mixture Models Rather than identifying clusters by “nearest” centroids Fit a Set of k Gaussians to the data. p(x) = \pi_0f_0(x) + \pi_1f_1(x) + \pi_2f_2(x) + \ldots + \pi_kf_k(x)

GMM example

Gaussian Mixture Models Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Graphical Models with unobserved variables What if you have variables in a Graphical model that are never observed? Latent Variables Training latent variable models is an unsupervised learning application uncomfortable amused sweating laughing

Latent Variable HMMs We can cluster sequences using an HMM with unobserved state variables We will train the latent variable models using Expectation Maximization

Expectation Maximization Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the “responsibilities” of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing “responsibilities” Related to k-means

Questions One more time for questions on supervised learning…

Next Time Gaussian Mixture Models (GMMs) Expectation Maximization