Presentation on theme: "Machine Learning CMPT 726 Simon Fraser University"— Presentation transcript:
1 Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION
2 Outline Comments on general approach. Probability Theory. Joint, conditional and marginal probabilities.Random Variables.Functions of R.V.sBernoulli Distribution (Coin Tosses).Maximum Likelihood Estimation.Bayesian Learning With Conjugate Prior.The Gaussian Distribution.More Probability Theory.Entropy.KL Divergence.
3 Our ApproachThe course generally follows statistics, very interdisciplinary.Emphasis on predictive models: guess the value(s) of target variable(s). “Pattern Recognition”Generally a Bayesian approach as in the text.Compared to standard Bayesian statistics:more complex models (neural nets, Bayes nets)more discrete variablesmore emphasis on algorithms and efficiency
4 Things Not Covered Within statistics: Hypothesis testingFrequentist theory, learning theory.Other types of data (not random samples)Relational dataScientific data (automated scientific discovery)Action + learning = reinforcement learning. Could be optional – what do you think?
9 Bayes’ Theoremposterior likelihood × priorExercise: prove this
10 Bayes’ Theorem: Model Version Let M be model, E be evidence.P(M|E) proportional to P(M) x P(E|M)Intuitionprior = how plausible is the event (model, theory) a priori before seeing any evidence.likelihood = how well does the model explain the data?Exercise: prove this
18 Reading exponential prob formulas In infinite space, cannot just form sum Σx p(x) grows to infinity.Instead, use exponential, e.g. p(n) = (1/2)nSuppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”.Use p(x) = exp(-f(x)).
19 Example: exponential form sample size Fair Coin: The longer the sample size, the less likely it is.p(n) = 2-n.ln[p(n)]Try to do matlab plotSample size n
20 Exponential Form: Gaussian mean The further x is from the mean, the less likely it is.ln[p(x)]2(x-μ)
21 Smaller variance decreases probability The smaller the variance σ2, the less likely x is (away from the mean).ln[p(x)]-σ2
22 Minimal energy = max probability The greater the energy (of the joint state), the less probable the state is.ln[p(x)]E(x)
23 Gaussian Parameter Estimation Likelihood functionIndependent identically distributed data points
29 Frequentism vs. Bayesianism Frequentists: probabilities are measured as the frequencies of repeatable events.E.g., coin flips, snow falls in January.Bayesian: in addition, allow probabilities to be attached to parameter values (e.g., P(μ=0).Frequentist model selection: give performance guarantees (e.g., 95% of the time the method is right).Bayesian model selection: choose prior distribution over parameters, maximize resulting cost function (posterior).
30 MAP: A Step towards Bayes Key point: Bayesian hyperparameters leads to regularizationDetermine by minimizing regularized sum-of-squares error,