Presentation on theme: "Parameter Estimation using likelihood functions Tutorial #1"— Presentation transcript:
1 Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes made by Dan Geiger and Ydo Wexler..
2 Example: Binomial Experiment HeadTailWhen tossed, it can land in one of two positions: Head or TailWe denote by the (unknown) probability P(H).Estimation task:Given a sequence of toss samples x, x, …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 -
3 Statistical Parameter Fitting Consider instances x, x, …, x[M] such thatThe set of values that x can take is knownEach is sampled from the same distributionEach sampled independently of the resti.i.d.samplesThe task is to find a vector of parameters that have generated the given data. This vector parameter can be used to predict future data.i.I.d - idependent, identically distributed
4 The Likelihood Function How good is a particular ? It depends on how likely it is to generate the observed dataThe likelihood for the sequence H,T, T, H, H is0.20.40.60.81L()
5 Sufficient Statistics To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails)NH and NT are sufficient statistics for the binomial distributionFor example: H,T,T,H,H is just like T,H,H,H,T
6 Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihoodDatasetsStatisticsFormally, s(D) is a sufficient statistics if for any two datasets D and D’s(D) = s(D’ ) LD() = LD’ ()
7 Maximum Likelihood Estimation MLE Principle:Choose parameters that maximize the likelihood functionThis is one of the most commonly used estimators in statisticsIntuitively appealingOne usually maximizes the log-likelihood function defined as lD() = logeLD()
8 Example: MLE in Binomial Data Applying the MLE principle we get(Which coincides with what one would expect)0.20.40.60.81L()Example:(NH,NT ) = (3,2)MLE estimate is 3/5 = 0.6
9 From Binomial to Multinomial For example, suppose X can have the values 1,2,…,K (For example a die has 6 sides)We want to learn the parameters 1, 2. …, KSufficient statistics:N1, N2, …, NK - the number of times each outcome is observedLikelihood function:MLE:A case of a die where 1….6 are the possible values
10 Example: Multinomial Let be a protein sequence We want to learn the parameters q1, q2,…,q20corresponding to the frequencies of the 20 amino acidsN1, N2, …, N20 - the number of times each amino acid is observed in the sequenceLikelihood function:MLE:A case of a die where 1….6 are the possible values
11 Is MLE all we need? Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtackWould you bet on heads for the next toss?Suppose now that after 10 observations,ML estimates P(H) = 0.7 for a coinWould you place the same bet?Solution: The Bayesian approach which incorporates your subjective prior knowledge. E.g., you may know a priori that some amino acids have the same frequencies and some have low frequencies. How would one use this information ?
13 Example: Dishonest Casino A casino uses 2 kind of dice:99% are fair1% is loaded: 6 comes up 50% of the timesWe pick a die at random and roll it 3 timesWe get 3 consecutive sixesWhat is the probability the die is loaded?
14 Dishonest Casino (cont.) The solution is based on using Bayes rule and the fact that while P(loaded | 3sixes) is not known, the other three terms in Bayes rule are known, namely:P(3sixes | loaded)=(0.5)3P(loaded)=0.01P(3sixes) =P(3sixes | loaded) P(loaded)+P(3sixes | not loaded) (1-P(loaded))
16 Biological Example: Proteins Extracellular proteins have a slightly different amino acid composition than intracellular proteins.From a large enough protein database (SWISS-PROT), we can get the following:p(ai|int) - the frequency of amino acid ai for intracellular proteinsp(ai|ext) - the frequency of amino acid ai for extracellular proteinsp(int) - the probability that any new sequence is intracellularp(ext) - the probability that any new sequence is extracellular
17 Biological Example: Proteins (cont.) What is the probability that a given new protein sequence x=x1x2….xn is extracellular?Assuming that every sequence is either extracellular or intracellular (but not both)we can writeThus,
18 Biological Example: Proteins (cont.) Using conditional probability we get,By Bayes’ theoremThe probabilities p(int), p(ext) are called the prior probabilities.The probability P(ext|x) is called the posterior probability.
19 Bayesian Parameter Estimation (step by step) View the unknown parameter as a random variable.Use probability to quantify the uncertainty about the unknown parameter.Update of the parameter follows from the rules of probability using Bayes ruleThe updated parameter is set to be the expected value of the posteriorposterior oflikelihoodprior of
20 Example: Binomial Data Revisited Prior: say, a uniform distribution for in [0,1]P( ) = 1 (say due to seeing one head h = 1 and one tail t = 1 before seeing the data)Then for data (NH,NT ) = (4,1)MLE for is 4/5 = 0.8Bayesian estimation gives0.20.40.60.81Dirichletintegral
21 The multinomial caseThe hyper-parameters 1,…,K can be thought of as “imaginary” counts from our prior experienceImaginary sample size = 1+…+KThe larger the imaginary sample size the more confident we are in our prior. The more it influences the outcome.Bayesian Estimation vs. MLE
22 Protein Example Continued (estimation of ext) Let’s assume that from biological literature we know that 40% of the proteins are extracellular and 60% are intracellularext= p(ext)=0.4Recall that we hadNow, given a protein sequence x we have all the information we need to calculate p(ext|x).
23 Protein Example (cont.) (updating ext) To update ext after seeing the sequence x, we need to decide what weight should we give our prior knowledge.If our prior knowledge is worth seeing M sequencesthen the update will be as follow: