Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning CMPT 726 Simon Fraser University

Similar presentations

Presentation on theme: "Machine Learning CMPT 726 Simon Fraser University"— Presentation transcript:

1 Machine Learning CMPT 726 Simon Fraser University

2 Outline Comments on general approach. Probability Theory.
Joint, conditional and marginal probabilities. Random Variables. Functions of R.V.s Bernoulli Distribution (Coin Tosses). Maximum Likelihood Estimation. Bayesian Learning With Conjugate Prior. The Gaussian Distribution. More Probability Theory. Entropy. KL Divergence.

3 Our Approach The course generally follows statistics, very interdisciplinary. Emphasis on predictive models: guess the value(s) of target variable(s). “Pattern Recognition” Generally a Bayesian approach as in the text. Compared to standard Bayesian statistics: more complex models (neural nets, Bayes nets) more discrete variables more emphasis on algorithms and efficiency

4 Things Not Covered Within statistics:
Hypothesis testing Frequentist theory, learning theory. Other types of data (not random samples) Relational data Scientific data (automated scientific discovery) Action + learning = reinforcement learning. Could be optional – what do you think?

5 Probability Theory Apples and Oranges

6 Probability Theory Marginal Probability Conditional Probability
Joint Probability Key point to remember: from joint probability can get everything else

7 Probability Theory Sum Rule Product Rule Proof of product rule:
Sum rule: In logical terms, X = xi, is equivalent to OR(X=xi, Y=yj) where the OR is over all yj.

8 The Rules of Probability
Sum Rule Product Rule

9 Bayes’ Theorem posterior  likelihood × prior Exercise: prove this

10 Bayes’ Theorem: Model Version
Let M be model, E be evidence. P(M|E) proportional to P(M) x P(E|M) Intuition prior = how plausible is the event (model, theory) a priori before seeing any evidence. likelihood = how well does the model explain the data? Exercise: prove this

11 Probability Densities

12 Transformed Densities
Important concept: function of random variable. X = g(y), non-linear invertible transformation.

13 Expectations Conditional Expectation (discrete)
Approximate Expectation (discrete and continuous)

14 Variances and Covariances
Exercise: prove the variance formula

15 The Gaussian Distribution

16 Gaussian Mean and Variance

17 The Multivariate Gaussian

18 Reading exponential prob formulas
In infinite space, cannot just form sum Σx p(x)  grows to infinity. Instead, use exponential, e.g. p(n) = (1/2)n Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. Use p(x) = exp(-f(x)).

19 Example: exponential form sample size
Fair Coin: The longer the sample size, the less likely it is. p(n) = 2-n. ln[p(n)] Try to do matlab plot Sample size n

20 Exponential Form: Gaussian mean
The further x is from the mean, the less likely it is. ln[p(x)] 2(x-μ)

21 Smaller variance decreases probability
The smaller the variance σ2, the less likely x is (away from the mean). ln[p(x)] -σ2

22 Minimal energy = max probability
The greater the energy (of the joint state), the less probable the state is. ln[p(x)] E(x)

23 Gaussian Parameter Estimation
Likelihood function Independent identically distributed data points

24 Maximum (Log) Likelihood

25 Properties of and Sample Mean is unbiased estimator
Sample variance is not

26 Curve Fitting Re-visited

27 Maximum Likelihood Determine by minimizing sum-of-squares error,

28 Predictive Distribution

29 Frequentism vs. Bayesianism
Frequentists: probabilities are measured as the frequencies of repeatable events. E.g., coin flips, snow falls in January. Bayesian: in addition, allow probabilities to be attached to parameter values (e.g., P(μ=0). Frequentist model selection: give performance guarantees (e.g., 95% of the time the method is right). Bayesian model selection: choose prior distribution over parameters, maximize resulting cost function (posterior).

30 MAP: A Step towards Bayes
Key point: Bayesian hyperparameters leads to regularization Determine by minimizing regularized sum-of-squares error,

31 Bayesian Curve Fitting
Probably skip

32 Bayesian Predictive Distribution

33 Model Selection Cross-Validation

34 Curse of Dimensionality
Rule of Thumb: 10 datapoints per parameter.

35 Curse of Dimensionality
Polynomial curve fitting, M = 3 Gaussian Densities in higher dimensions

36 Decision Theory Inference step Determine either or . Decision step For given x, determine optimal t.

37 Minimum Misclassification Rate

38 Minimum Expected Loss Example: classify medical images as ‘cancer’ or ‘normal’ Decision Truth

39 Minimum Expected Loss Regions are chosen to minimize

40 Why Separate Inference and Decision?
Minimizing risk (loss matrix may change over time) Unbalanced class priors Combining models

41 Decision Theory for Regression
Inference step Determine . Decision step For given x, make optimal prediction, y(x), for t. Loss function: Probably skip

42 The Squared Loss Function

43 Generative vs Discriminative
Generative approach: Model Use Bayes’ theorem Discriminative approach: Model directly

44 Entropy Important quantity in coding theory statistical physics
machine learning

45 Entropy

46 Entropy Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x? All states equally likely

47 Entropy General compression principle, widely used, intuitive.

48 The Maximum Entropy Principle
Commonly used principle for model selection: maximize entropy. Example: In how many ways can N identical objects be allocated M bins? Entropy maximized when

49 Differential Entropy and the Gaussian
Put bins of width ¢ along the real line Differential entropy maximized (for fixed ) when in which case

50 Conditional Entropy

51 The Kullback-Leibler Divergence
Used in many ML applications as predictive quality measure

52 Mutual Information

Download ppt "Machine Learning CMPT 726 Simon Fraser University"

Similar presentations

Ads by Google