Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Probabilistic models Haixu Tang School of Informatics.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
What is Statistical Modeling
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification and risk prediction
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Machine Learning CMPT 726 Simon Fraser University
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Thanks to Nir Friedman, HU
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Review: Probability Random variables, events Axioms of probability
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Data Modeling Patrice Koehl Department of Biological Sciences
Oliver Schulte Machine Learning 726
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
CH 5: Multivariate Methods
Maximum Likelihood Estimation
ECE 5424: Introduction to Machine Learning
Special Topics In Scientific Computing
Data Mining Lecture 11.
Bias and Variance of the Estimator
Classification Techniques: Bayesian Classification
Mathematical Foundations of BME Reza Shadmehr
Pattern Recognition and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Presentation transcript:

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith

Using Data Action Model Data estimation; regression; learning; training classification; decision pattern classification machine learning statistical inference...

Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured,...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. Where do models come from? 2. If we have a model, how do we use it?

Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. Model x P(spam | x) P(mail | x)

Bayes’ Rule what we said the model must define likelihood: one distribution over complex observations per y prior normalizes into a distribution:

Naive Bayes Models Suppose X = (X 1, X 2, X 3,..., X m ). Let

Naive Bayes: Graphical Model Y X1X1 X2X2 X3X3 XmXm...

Part II Where do the model parameters come from?

Using Data Action Model Data estimation; regression; learning; training

Warning This is a HUGE topic. We will barely scratch the surface.

Forms of Models Recall that a model defines P(x | y) and P(y). These can have a simple multinomial form, like P(mail) = 0.545, P(spam) = Or they can take on some other form, like a binomial, Gaussian, etc.

Example: Gaussian Suppose y is {male, female}, and one observed variable is H, height. P(H | male) ~ N (μ m, σ m 2 ) P(H | female) ~ N (μ f, σ f 2 ) How to estimate μ m, σ m 2, μ f, σ f 2 ?

Maximum Likelihood Pick the model that makes the data as likely as possible max P(data | model)

Maximum Likelihood (Gaussian) Estimating the parameters μ m, σ m 2, μ f, σ f 2 can be seen as  fitting the data  estimating an underlying statistic (point estimate)

Using the model

Example: Regression Suppose y is actual runtime, and x is input length. Regression tries to predict some continuous variables from others.

Regression Linear: assume linear relationship, fit a line. We can turn this into a model!

Linear Model Given x, predict y. y = β 1 x + β 0 + N (0, σ 2 ) true regression line random deviation

Principle of Least Squares Minimize the sum of squared vertical deviations. Unique, closed form solution! vertical deviation

Other kinds of regression transform one or both variables (e.g., take a log) polynomial regression  (least squares → linear system) multivariate regression logistic regression

Example: text categorization Bag-of-words model:  x is a histogram of counts for all words   y is a topic

MLE for Multinomials “Count and Normalize”

The Truth about MLE You will never see all the words. For many models, MLE isn’t safe. To understand why, consider a typical evaluation scenario.

Evaluation Train your model on some data. How good is the model? Test on different data that the system never saw before.  Why?

Tradeoff overfits the training data low variance doesn’t generalizelow accuracy

Text categorization again Suppose never appeared in any document in training, ever. What is the above probability for a new document containing at test time?

Solutions Regularization  Prefer less extreme parameters Smoothing  “Flatten out” the distribution Bayesian Estimation  Construct a prior over model parameters, then train to maximize P(data | model) × P(model)

One More Point Building models is not the only way to be empirical.  Neural networks, SVMs, instance- based learning MLE and smoothed/Bayesian estimation are not the only ways to estimate.  Minimize error, for example (“discriminative” estimation)

Assignment 3 Spam detection We provide a few thousand examples Perform EDA and pick features Estimate probabilities Build a Naive-Bayes classifier