Basics of Statistical Estimation

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

3.6 Support Vector Machines
Bayes rule, priors and maximum a posteriori
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS Random Variables and Distribution Functions
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.
EU market situation for eggs and poultry Management Committee 20 October 2011.
Chapter 15 Complex Numbers
2 |SharePoint Saturday New York City
VOORBLAD.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Squares and Square Root WALK. Solve each problem REVIEW:
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
LO: Count up to 100 objects by grouping them and counting in 5s 10s and 2s. Mrs Criddle: Westfield Middle School.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
H to shape fully developed personality to shape fully developed personality for successful application in life for successful.
Januar MDMDFSSMDMDFSSS
Statistical Inferences Based on Two Samples
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Chapter 8 Estimation Understandable Statistics Ninth Edition
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
Energy Generation in Mitochondria and Chlorplasts
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Commonly Used Distributions
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Parameter Estimation using likelihood functions Tutorial #1
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Thanks to Nir Friedman, HU
Maximum Likelihood Estimation
Presentation transcript:

Basics of Statistical Estimation Alan Ritter rittera@cs.cmu.edu

Parameter Estimation How to estimate parameters from data? Maximum Likelihood Principle: Choose the parameters that maximize the probability of the observed data!

Maximum Likelihood Estimation Recipe Use the log-likelihood Differentiate with respect to the parameters Equate to zero and solve

An Example Let’s start with the simplest possible case Single observed variable Flipping a bent coin We Observe: Sequence of heads or tails HTTTTTHTHT Goal: Estimate the probability that the next flip comes up heads Before I get into talking about modeling words and documents, I’d like to start out with a bit of background on estimating parameters in Bayesian networks. Let’s start with the simplest possible case, which is flipping a coin. This is the general setup: we perform a sequence of coin flip experiments and record whether the result is heads or tails. For example, maybe we flip a coin 10 times and get this sequence of heads and tails. Our (general) goal is to estimate the probability that the next flip comes up heads.

Assumptions Fixed parameter Each flip is independent Probability that a flip comes up heads Each flip is independent Doesn’t affect the outcome of other flips (IID) Independent and Identically Distributed There are a couple of important assumptions that make sense here, specifically that there is a fixed parameter \theta_H (which is the probability a flip comes up heads) – this doesn’t change, for example we don’t switch coins in the middle, or bend the coin more during the experiment. The other assumption is that each flip is independent. The outcome of one flip doesn’t affect the outcome of others. These assumptions are often referred to as “IID” – Independent and Identically distributed.

Example Let’s assume we observe the sequence: HTTTTTHTHT What is the best value of ? Probability of heads Intuition: should be 0.3 (3 out of 10) Question: how do we justify this? Going back to our example sequence of coin flips, what should our best guess be about the probability the coin lands heads?

Maximum Likelihood Principle The value of which maximizes the probability of the observed data is best! Based on our assumptions, the probability of “HTTTTTHTHT” is: This is the Likelihood Function

Maximum Likelihood Principle Probability of “HTTTTTHTHT” as a function of Θ=0.3

Maximum Likelihood Principle Probability of “HTTTTTHTHT” as a function of Θ=0.3

Maximum Likelihood value of Log Identities

Maximum Likelihood value of

The problem with Maximum Likelihood What if the coin doesn’t look very bent? Should be somewhere around 0.5? What if we saw 3,000 heads and 7,000 tails? Should this really be the same as 3 out of 10? Maximum Likelihood No way to quantify our uncertainty. No way to incorporate our prior knowledge! Q: how to deal with this problem?

Bayesian Parameter Estimation Let’s just treat like any other variable Put a prior on it! Encode our prior knowledge about possible values of using a probability distribution Now consider two probability distributions: Use probability to reason about our uncertainty about the parameter,

Posterior Over

How can we encode prior knowledge? Example: The coin doesn’t look very bent Assign higher probability to values of near 0.5 Solution: The Beta Distribution Now we have these new parameters alpha and beta which are referred to as “hyperparameters”. And as you can imagine, we can put additional prior distributions on them  Beta Distribution looks similar to the likelihood conjugate prior to the Bernoulli distribution Gamma function can be understood as a continuous generalization of the factorial function. Gamma is a continuous generalization of the Factorial Function

Beta Distribution Beta(5,5) Beta(100,100) Beta(0.5,0.5) Beta(1,1)

Marginal Probability over single Toss Beta prior indicates α imaginary heads and β imaginary tails

More than one toss If the prior is Beta, so is posterior! Beta is conjugate to the Bernoulli likelihood

Prediction What is the probability the next coin flip is heads? But is really unknown. We should consider all possible values OK, so now we know how to find a distribution over \theta_H, so how do we predict the probability the next coin flip will be heads? One option is just to pick the value of \theta_H which maximizes likelihood (like we discussed previously). This is referred to as a “point estimate”. But \theta_H is really unknown, so if we’re going to be really detailed (bayesian) about this, the thing to do is to consider all possible values of theta in our estimate.

Prediction Integrate the posterior over to predict the probability of heads on the next toss.

Prediction

Marginalizing out

Marginalizing out Parameters (cont) Beta Integral We can use this to answer the question: what’s the probability the next flip comes up heads?

Prediction Immediate result Can compute the probability over the next toss: We can use the previous result to predict

Summary: Maximum Likelihood vs. Bayesian Estimation Maximum likelihood: find the “best” Bayesian approach: Don’t use a point estimate Keep track of our beliefs about Treat like a random variable In this class we will mostly focus on Maximum Likelihood

Modeling Text Not a sequence of coin tosses… Instead we have a sequence of words But we could think of this as a sequence of die rolls Very large die with one word on each side Multinomial is n-dimensional generalization of Bernoulli Dirichlet is an n-dimensional generalization of Beta distribution

Multinomial Rather than one parameter, we have a vector Likelihood Function:

Dirichlet Generalizes the Beta distribution from 2 to K dimensions Conjugate to Multinomial

Example: Text Classification Problem: Spam Email classification We have a bunch of email (e.g. 10,000 emails) labeled as spam and non-spam Goal: given a new email, predict whether it is spam or not How can we tell the difference? Look at the words in the emails Viagra, ATTENTION, free

Naïve Bayes Text Classifier By making independence assumptions we can better estimate these probabilities from data

Naïve Bayes Text Classifier Simplest possible classifier Assumption: probability of each word is conditionally independent given class memberships. Simple application of Bayes Rule

Bent Coin Bayesian Network Probability of Each coin flip is conditionally independent given Θ

Bent Coin Bayesian Network (Plate Notation)

Naïve Bayes Model For Text Classification Data is a set of “documents” Z variables are categories Z’s Observed during learning Hidden at test time. Learning from training data: Estimate parameters (θ,β) using fully-observed data Prediction on test data: Compute P(Z|w1,…wn) using Bayes’ rule

Naïve Bayes Model For Text Classification Q: How to estimate θ? Q: How to estimate β?