PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Computer vision: models, learning and inference Chapter 8 Regression.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Supervised Learning Recap
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Chapter 4: Linear Models for Classification
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Linear Models for Classification
Lecture 2: Statistical learning primer for biologists
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National University.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Linear Algebra Curve Fitting. Last Class: Curve Fitting.
PREDICT 422: Practical Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
CH 5: Multivariate Methods
CS668: Pattern Recognition Ch 1: Introduction
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Statistical Learning Dong Liu Dept. EEIS, USTC.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Loss.
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION

Example Handwritten Digit Recognition

Polynomial Curve Fitting

Sum-of-Squares Error Function

0 th Order Polynomial

1 st Order Polynomial

3 rd Order Polynomial

9 th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Data Set Size: 9 th Order Polynomial

Data Set Size: 9 th Order Polynomial

Regularization Penalize large coefficient values

Regularization:

Regularization: vs.

Polynomial Coefficients

Methodology Regression Machine Training Regression Machine Update weights w and degree M s.t. min error Regression

Questions ● How to find the curve that best fits the points? ● i.e., how to solve the minimization problem? ● Answer 1: use gnuplot or any other pre-built software (homework) ● Answer 2: recall linear algebra (will see with more details in upcoming class) ● Why solve this particular minimization problem? e.g., why solve this minimization problem rather than doing linear interpolation? Principled answer follows...

Linear Algebra ● To find curve that best fits points: apply basic concepts from linear algebra ● What is linear algebra? ● The study of vector spaces ● What are vector spaces? ● Set of elements that can be ● Summed ● Multiplied by scalar Operation is well defined

Linear Algebra=Study of Vectors ● Examples of vectors ● [0,1,0,2] ● ● Signals ● Signals are vectors!

Dimension of Vector Space ● Basis = set of element that span space ● Dimension of line = 1 ● basis = (1,0) ● Dimension of plane = 2 ● basis = {(1,0), (0,1)} ● Dimension of space = 3 ● basis = {(1,0,0), (0,1,0), (0,0,1)} ● Dimension of Space = # Elements Basis

Functions are Vectors I ● Vectors Functions ● Size Size ● x=[x1,x2] y=f(t) ● |x|^2=x1^2+x2^2 |y|^2 = ● Inner product Inner Product ● y=[y1,y2] z=g(t) ● x y = x1 y1 + x2 y2

Signals are Vectors II ● Vectors Functions ● x=[x1,x2] y=f(t) ● BasisBasis ● {[1,0],[1,0]} {1, sin wt, cos wt, sin 2wt, cos 2wt,...} ● Components Components ● x1 coefficients of basis x2 functions

A Very Special Basis ● Vectors Functions ● x=[x1,x2] y=f(t) ● BasisBasis ● {[1,0],[1,0]} {1, sin wt, cos wt, sin 2wt, cos 2wt,...} ● Components Components ● x1 coefficients of basis x2 functions Basis in which each element Is a vector, formed by set of M samples of a function (Subspace of R^M)

Approximating a Vector I ● e = g - c x ● g' x = |g| |x| cos  ● |x|^2 = x' x ● |g| cos  c |x| e g c x x e1 g c1 x x e2 g c2 x x

Approximating a Vector II ● e = g - c x ● g' x = |g| |x| cos  ● |x|^2 = x' x ● |g| cos  c |x| ● Major result: c = g' x (x' x)^(-1) e g c x x

Approximating a Vector III ● e = g - c x ● g' x = |g| |x| cos  ● |x|^2 = x' x ● |g| cos  c |x| ● c =g'x / |x|^2 ● Major result: c = g' x ((x' x)^(-1))' e g c x Coefficient to compute Given target point (basis matrix coords) Given basis matrix x

Approximating a Vector IV ● e = g - x c ● g' x = |g| |x| cos  ● |x|^2 = x' x ● g' x  x c ● Major result: c = g' x ((x' x)^(-1))' Coefficient to compute Given target point (vector space V) Given basis matrix (elements in basis matrix are in vector space U contained in V) c=vector of computed coefficients g=vector of target points x=basis matrix - each column is a basis elem. - each column is a polynomial evaluated at desired points

Least Square Error Solution! ● As most machine learning problems, least square is equivalent to basis conversion! ● Major result: c = g' x ((x' x)^(-1))' c=vector of computed coefficients g=vector of target points x=basis matrix - each column is a basis elem. - each column is a polynomial evaluated at desired points Coefficients to compute (point in R^M) Given points (element in R^N)

Least Square Error Solution! Space of points in R^N Space of functions in R^M Machine learning algorithm maps to closest

Probability Theory Apples and Oranges

Probability Theory Marginal Probability Conditional Probability Joint Probability

Probability Theory Sum Rule Product Rule

The Rules of Probability Sum Rule Product Rule

Bayes’ Theorem posterior  likelihood × prior

Probability Densities

Transformed Densities

Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)

Variances and Covariances

The Gaussian Distribution

Gaussian Mean and Variance

The Multivariate Gaussian

Gaussian Parameter Estimation Likelihood function

Maximum (Log) Likelihood

Properties of and

Curve Fitting Re-visited

Maximum Likelihood Determine by minimizing sum-of-squares error,.

Predictive Distribution

MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error,.

Bayesian Curve Fitting

Bayesian Predictive Distribution

Model Selection Cross-Validation

Curse of Dimensionality

Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

Decision Theory Inference step Determine either or. Decision step For given x, determine optimal t.

Minimum Misclassification Rate

Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

Minimum Expected Loss Regions are chosen to minimize

Reject Option

Why Separate Inference and Decision? Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models

Decision Theory for Regression Inference step Determine. Decision step For given x, make optimal prediction, y(x), for t. Loss function:

The Squared Loss Function

Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

Entropy Important quantity in coding theory statistical physics machine learning

Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x ? All states equally likely

Entropy

In how many ways can N identical objects be allocated M bins? Entropy maximized when

Entropy

Differential Entropy Put bins of width ¢ along the real line Differential entropy maximized (for fixed ) when in which case

Conditional Entropy

The Kullback-Leibler Divergence

Mutual Information