Oliver Schulte Machine Learning 726

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Point Estimation Notes of STAT 6205 by Dr. Fan.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Oliver Schulte Machine Learning 726
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Visual Recognition Tutorial
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CMPT 726 Simon Fraser University
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Probability Distributions Random Variables: Finite and Continuous A review MAT174, Spring 2004.
Continuous Random Variables and Probability Distributions
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
PROBABILITY & STATISTICAL INFERENCE LECTURE 3 MSc in Computing (Data Analytics)
Random Sampling, Point Estimation and Maximum Likelihood.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
1 Lesson 3: Choosing from distributions Theory: LLN and Central Limit Theorem Theory: LLN and Central Limit Theorem Choosing from distributions Choosing.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Theory of Probability Statistics for Business and Economics.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Since everything is a reflection of our minds, everything can be changed by our minds.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Random Variables Ch. 6. Flip a fair coin 4 times. List all the possible outcomes. Let X be the number of heads. A probability model describes the possible.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
CONTINUOUS RANDOM VARIABLES
Sampling and estimation Petter Mostad
Continuous Random Variables and Probability Distributions
- 1 - Matlab statistics fundamentals Normal distribution % basic functions mew=100; sig=10; x=90; normpdf(x,mew,sig) 1/sig/sqrt(2*pi)*exp(-(x-mew)^2/sig^2/2)
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
7.2 Means & Variances of Random Variables AP Statistics.
Selecting Input Probability Distributions. 2 Introduction Part of modeling—what input probability distributions to use as input to simulation for: –Interarrival.
Random Variables By: 1.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Lecture 1.31 Criteria for optimal reception of radio signals.
Oliver Schulte Machine Learning 726
MECH 373 Instrumentation and Measurements
Last Time Proportions Continuous Random Variables Probabilities
Introductory statistics is …
Probability Theory and Parameter Estimation I
Random Variables Random variables assigns a number to each outcome of a random circumstance, or equivalently, a random variable assigns a number to each.
Random Variables Random variables assigns a number to each outcome of a random circumstance, or equivalently, a random variable assigns a number to each.
Oliver Schulte Machine Learning 726
STAT 311 REVIEW (Quick & Dirty)
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Bayes Net Learning: Bayesian Approaches
CONTINUOUS RANDOM VARIABLES
Oliver Schulte Machine Learning 726
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
POPULATION (of “units”)
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Geology Geomath Chapter 7 - Statistics tom.h.wilson
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
M248: Analyzing data Block A UNIT A3 Modeling Variation.
These probabilities are the probabilities that individual values in a sample will fall in a 50 gram range, and thus represent the integral of individual.
HKN ECE 313 Exam 2 Review Session
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Oliver Schulte Machine Learning 726 Linear Regression Oliver Schulte Machine Learning 726 If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.

Parameter Learning Scenarios The general problem: predict the value of a continuous variable from one or more continuous features. Parent Node/ Child Node Discrete Continuous Maximum Likelihood Decision Trees logit distribution (logistic regression) conditional Gaussian (not discussed) linear Gaussian (linear regression)

Single Node Case Examples of Continuous Random Variables NHL: Sum of Games Played After 7 Years House Prices Term Marks in Course How to define probability distribution? Can no longer use a table show example from 310 grades Term Marks

Discretization Simplest Idea: Assign continuous data to discrete groups (bins). E.g. A,B,C,D,E,F Equal width: set bins according to variable values. E.g. 100-90,90-80,70-60,50-40 Equal frequency: set bins so that each bin has same number of students. E.g. choose cut-offs so that 10% of students get an A, 10% a B, 10% a C Curving: set bins to match a prior distribution e.g. match grade distribution in other CS 3rd-year courses See actual cut-offs in 310

Density Function P(T) = 1 becomes Define a function p(X=x) Behaves like a probability for X=x but not quite. Better intuition: The density function defines a probability for intervals via integration. P(T) = 1 becomes Exercise: Find the p.d.f. of the uniform distribution over a closed interval [a,b].

Probability Densities x can be anything

Histograms approximate densities See also 310-grades sheets in google drive and https://www.benlcollins.com/spreadsheets/histograms-normal-distribution/ and histogram-gaussian figure

Densities as limit histograms let B be a bin in a histogram height(B) = count(B) by definition area(B) = height(B) x width(B) = count(B) x width(B) p(y in B) = area(B) as width(B) 0 p(y) = count(B) x width(B) as width(B) 0 Can think of p(x) as proportional to counts around x

Mean Aka average, expectation, or mean of P. Notation: E, µ. How to define for a density function? Example Excel Give example of grades.

Variance Variance of a distribution: Find mean of distribution. For each point, find distance to mean. Square it. (Why?) Take expected value of squared distance. Measures the spread of continuous values. Example Excel

The Gaussian Density Function

The Gaussian Distribution

Meet the exponential family A common way to define a probability density p(x) is as an exponential function of x. Simple mathematical motivation: multiplying numbers between 0 and 1 yields a number between 0 and 1. E.g. (1/2)n, (1/e)x. Deeper mathematical motivation: exponential pdfs have good statistical properties for learning. E.g., conjugate prior, maximum likelihood, sufficient statistics. In fact, the exponential family is the only class with these properties. To see this for conjugate prior, note that the likelihood is typically a product (i.i.d. data). So the posterior is proportional to prior x product. If the prior is also a product (exponential), then the posterior is a product like the prior. If the prior is something else, then something else x product is usually not something else.

Reading exponential prob formulas Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”. f(x), p(x) Use p(x) = α exp(-f(x)).

Example: exponential form sample size Fair Coin: The longer the sample size, the less likely it is. p(n) = 2-n. ln[p(n)] Try to do matlab plot. Slope goes down because of minus sign. Sample size n

Location Parameter The further x is from the center μ, the less likely it is. ln[p(x)] (x-μ)2

Spread/Precision parameter The greater the spread σ2, the more likely x is (away from the mean). The greater the precision β, the less likely x is. ln[p(x)] 1/σ2 = β

Normalization Let p*(x) be an unnormalized density function. To make a probability density function, need to find normalization constant α s.t. Therefore For the Gaussian (Laplace 1782) So all I have to do is solve the integral!

Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. Laplace (1810). Example: N uniform [0,1] random variables.

Log-linear Models Many probability distributions used in machine learning follow this exponential pattern for assigning a probability or density to an object x. Define a set of aggregate/basis functions T1(x),..,Tm(x) known as sufficient statistics The log-probability is proportional to a weighted sum of the sufficient statistics Works for anything: documents, videos, term marks,... e.g. discrete BN estimation: Division of labour: define statistics vs. learn the weights

Normal Distribution Parameter Learning

Maximum Likelihood Learning Same fundamental idea as with discrete variables: maximize data likelihood given parameters. The likelihood function for Gaussian distribution: Find derivatives of L, set to 0. Result: The maximum likelihood estimates are the sample mean and the sample variance. I.e. mean and variance computed directly from the data. For details see text and assignment.

Conclusion Density functions represent probability distributions for continuous data Like histograms with infinitely many bins Most important density is the Gaussian distribution Two parameters: mean μ represents typical or expected value probability of point depends on distance to mean variance σ2 represents spread or flatness of curve As with discrete parameters, MLE estimates use parameter values computed directly from data.