Presentation is loading. Please wait.

Presentation is loading. Please wait.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Similar presentations


Presentation on theme: "PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION."— Presentation transcript:

1 PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION

2 Example Handwritten Digit Recognition

3 Polynomial Curve Fitting

4 Sum-of-Squares Error Function

5 0 th Order Polynomial

6 1 st Order Polynomial

7 3 rd Order Polynomial

8 9 th Order Polynomial

9 Over-fitting Root-Mean-Square (RMS) Error:

10 Polynomial Coefficients

11 Data Set Size: 9 th Order Polynomial

12 Data Set Size: 9 th Order Polynomial

13 Regularization Penalize large coefficient values

14 Regularization:

15

16 Regularization: vs.

17 Polynomial Coefficients

18 Methodology Regression Machine Training Regression Machine 1 0.5 Update weights w and degree M s.t. min error Regression

19 Questions ● How to find the curve that best fits the points? ● i.e., how to solve the minimization problem? ● Answer 1: use gnuplot or any other pre-built software (homework) ● Answer 2: recall linear algebra (will see with more details in upcoming class) ● Why solve this particular minimization problem? e.g., why solve this minimization problem rather than doing linear interpolation? Principled answer follows...

20 Linear Algebra ● To find curve that best fits points: apply basic concepts from linear algebra ● What is linear algebra? ● The study of vector spaces ● What are vector spaces? ● Set of elements that can be ● Summed ● Multiplied by scalar Operation is well defined

21 Linear Algebra=Study of Vectors ● Examples of vectors ● [0,1,0,2] ● ● Signals ● Signals are vectors!

22 Dimension of Vector Space ● Basis = set of element that span space ● Dimension of line = 1 ● basis = (1,0) ● Dimension of plane = 2 ● basis = {(1,0), (0,1)} ● Dimension of space = 3 ● basis = {(1,0,0), (0,1,0), (0,0,1)} ● Dimension of Space = # Elements Basis

23 Functions are Vectors I ● Vectors Functions ● Size Size ● x=[x1,x2] y=f(t) ● |x|^2=x1^2+x2^2 |y|^2 = ● Inner product Inner Product ● y=[y1,y2] z=g(t) ● x y = x1 y1 + x2 y2

24 Signals are Vectors II ● Vectors Functions ● x=[x1,x2] y=f(t) ● BasisBasis ● {[1,0],[1,0]} {1, sin wt, cos wt, sin 2wt, cos 2wt,...} ● Components Components ● x1 coefficients of basis x2 functions http://www.falstad.com/fourier/

25 A Very Special Basis ● Vectors Functions ● x=[x1,x2] y=f(t) ● BasisBasis ● {[1,0],[1,0]} {1, sin wt, cos wt, sin 2wt, cos 2wt,...} ● Components Components ● x1 coefficients of basis x2 functions http://www.falstad.com/fourier/ Basis in which each element Is a vector, formed by set of M samples of a function (Subspace of R^M)

26 Approximating a Vector I ● e = g - c x ● g' x = |g| |x| cos  ● |x|^2 = x' x ● |g| cos  c |x| e g c x x e1 g c1 x x e2 g c2 x x

27 Approximating a Vector II ● e = g - c x ● g' x = |g| |x| cos  ● |x|^2 = x' x ● |g| cos  c |x| ● Major result: c = g' x (x' x)^(-1) e g c x x

28 Approximating a Vector III ● e = g - c x ● g' x = |g| |x| cos  ● |x|^2 = x' x ● |g| cos  c |x| ● c =g'x / |x|^2 ● Major result: c = g' x ((x' x)^(-1))' e g c x Coefficient to compute Given target point (basis matrix coords) Given basis matrix x

29 Approximating a Vector IV ● e = g - x c ● g' x = |g| |x| cos  ● |x|^2 = x' x ● g' x  x c ● Major result: c = g' x ((x' x)^(-1))' Coefficient to compute Given target point (vector space V) Given basis matrix (elements in basis matrix are in vector space U contained in V) c=vector of computed coefficients g=vector of target points x=basis matrix - each column is a basis elem. - each column is a polynomial evaluated at desired points

30 Least Square Error Solution! ● As most machine learning problems, least square is equivalent to basis conversion! ● Major result: c = g' x ((x' x)^(-1))' c=vector of computed coefficients g=vector of target points x=basis matrix - each column is a basis elem. - each column is a polynomial evaluated at desired points Coefficients to compute (point in R^M) Given points (element in R^N)

31 Least Square Error Solution! Space of points in R^N Space of functions in R^M Machine learning algorithm maps to closest

32 Probability Theory Apples and Oranges

33 Probability Theory Marginal Probability Conditional Probability Joint Probability

34 Probability Theory Sum Rule Product Rule

35 The Rules of Probability Sum Rule Product Rule

36 Bayes’ Theorem posterior  likelihood × prior

37 Probability Densities

38 Transformed Densities

39 Expectations Conditional Expectation (discrete) Approximate Expectation (discrete and continuous)

40 Variances and Covariances

41 The Gaussian Distribution

42 Gaussian Mean and Variance

43 The Multivariate Gaussian

44 Gaussian Parameter Estimation Likelihood function

45 Maximum (Log) Likelihood

46 Properties of and

47 Curve Fitting Re-visited

48 Maximum Likelihood Determine by minimizing sum-of-squares error,.

49 Predictive Distribution

50 MAP: A Step towards Bayes Determine by minimizing regularized sum-of-squares error,.

51 Bayesian Curve Fitting

52 Bayesian Predictive Distribution

53 Model Selection Cross-Validation

54 Curse of Dimensionality

55 Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

56 Decision Theory Inference step Determine either or. Decision step For given x, determine optimal t.

57 Minimum Misclassification Rate

58 Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

59 Minimum Expected Loss Regions are chosen to minimize

60 Reject Option

61 Why Separate Inference and Decision? Minimizing risk (loss matrix may change over time) Reject option Unbalanced class priors Combining models

62 Decision Theory for Regression Inference step Determine. Decision step For given x, make optimal prediction, y(x), for t. Loss function:

63 The Squared Loss Function

64 Generative vs Discriminative Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

65 Entropy Important quantity in coding theory statistical physics machine learning

66 Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x ? All states equally likely

67 Entropy

68 In how many ways can N identical objects be allocated M bins? Entropy maximized when

69 Entropy

70 Differential Entropy Put bins of width ¢ along the real line Differential entropy maximized (for fixed ) when in which case

71 Conditional Entropy

72 The Kullback-Leibler Divergence

73 Mutual Information


Download ppt "PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION."

Similar presentations


Ads by Google