Course Introduction What these courses are about What I expect What you can expect.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probabilistic models Haixu Tang School of Informatics.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Hidden Markov Model.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Probabilities and Probabilistic Models
Parameter Estimation using likelihood functions Tutorial #1
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
. Algorithms in Computational Biology – אלגוריתמים בביולוגיה חישובית – (
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Algorithms in Computational Biology – אלגוריתמים בביולוגיה חישובית – (
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Bayesian learning finalized (with high probability)
CpG islands in DNA sequences
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Hidden Markov Models.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Computer vision: models, learning and inference
Introduction to Bayesian Parameter Estimation
Thanks to Nir Friedman, HU
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule Probabilistic models Maximum likelihood Maximum a posteriori.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Univariate Gaussian Case (Cont.)
Review of statistical modeling and probability theory Alan Moses ML4bio.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Chapter5 Statistical and probabilistic concepts, Implementation to Insurance Subjects of the Unit 1.Counting 2.Probability concepts 3.Random Variables.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Univariate Gaussian Case (Cont.)
Chapter 3: Maximum-Likelihood Parameter Estimation
Bayesian Estimation and Confidence Intervals
Review of Probability and Estimators Arun Das, Jason Rebello
Tutorial #3 by Ma’ayan Fishelson
OVERVIEW OF BAYESIAN INFERENCE: PART 1
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
Chapter 5: Sampling Distributions
Presentation transcript:

Course Introduction What these courses are about What I expect What you can expect

What these courses are about overview of ways in which computers are used to solve problems in biology supervised learning of illustrative or frequently-used algorithms and programs supervised learning of programming techniques and algorithms selected from these uses

I Expect students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) students will have basic familiarity with statistics students have some programming experience and willingness to work to improve

You can expect Homework assignments –80% of grade Final (20% of grade) Grades totally determined by points system

Textbook Required textbook: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al. Recommended additional textbook: Introduction to Computational Biology by Waterman

Chapter 1 Introduction

Purpose A great acceleration in the accumulation of biological knowledge started in our era Part of the challenge is to organize, classify and parse the immense richness of sequence data This is not just a task of string parsing, for behind the string of base or amino acids is the whole complexity of molecular biology

A major task in computational molecular biology is to “ decipher ” information contained in biological sequences Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers Information Flow

Review of basic biochemistry Central Dogma: DNA makes RNA makes protein Sequence determines structure determines function

Structure DNA composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids

Purpose This class is about methods which are in principle capable of capturing some of the complexity of biology, by integrating diverse sources of biological information into clean, general, and tractable probabilistic models for sequence analysis.

However, The most reliable way to determine a biological molecule’s structure or function is by direct experimentation. It is far easier to obtain the DNA sequence of the gene corresponding to an RNA or protein than it is to experimentally determine its function or its structure.

The Human Genome Project Gives us the raw sequence of an estimated 20,000-25,000 human genes, only a small fraction of which have been studied experimentally. The development of computational methods have become more important (computer science, statisticians, and etc….)

Basic Information New sequences are adapted from pre-existing sequences We compare a new sequence with an old sequence with known structure or function Two related sequences are called homologous and we are transferring information by homology It is somewhat similar to determine the similarity between two text strings In fact, we will be trying to find a plausible alignment between sequences

Definition A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids –DNA composed of four nucleotides or "bases": A,C,G,T –RNA composed of four also: A,C,G,U (T transcribed as U) –proteins are composed of amino acids (20)

Character representation of sequences DNA or RNA –use 1-letter codes (e.g., A,C,G,T) protein –use 1-letter codes can convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine)

Alignment Find the best alignment between two strings under some scoring system “+1” for a match; “-1” for a mismatch Most important, we want a scoring system to give the biologically most likely alignment the highest score Note that biological molecules have evolutionary histories, 3D folded structures, and other features This is more the realm of statistics than computer science Probabilistic modeling approach might be used and extend

Probabilities & Probabilistic Models A model means a system that simulates the object under consideration A probabilistic model is to produce different outcomes with different probabilities That is, it stimulates a whole class of objects, and assign each object an associated probability The objects will be sequences, and a model might describe a family of related sequences

Example: Rolling a six-sided die A probability model of rolling a 6-sided die involves 6 parameters p 1, p 2, p 3, p 4, p 5, and p 6 The probability of rolling i is p i p i ≧ 0 and Σp i =1 Rolling the die 3 times independently, P([1,6,3])= p 1 p 6 p 3

Example: Biological Sequence Biological sequences are strings from finite alphabet of residues (4 nucleotides or 20 amino acids) A residue a occurs at random with probability q a, independent of all other residues in the sequence If the sequence is denoted by x 1 … x n, the probability of the whole sequence is q x1 q x2 … q xn

Maximum Likelihood Estimation The parameters of a probability model is estimated from a training set (sample) The probability q a for amino acid a can be estimated as the observed frequency of residues in a database of known protein sequences (SWISS-PROT) The training sequences are not systematically biased towards a peculiar residue composition

MLE (continued) This way of estimating models is called maximum likelihood estimation (MLE) The MLE maximizes the total probability of all sequences given the model (the likelihood) Given a model with parameters θ and a set of data D, the maximum likelihood estimate for θ is that value which maximizes P(D|θ)

Estimation If estimating parameters from a limited amount of data, there is a danger of overfitting Overfitting: The model becomes very well adapted to the training data, but it will not generalize well to new data For example, observing the three flips of a coin [tail, tail, tail] would lead to the maximum likelihood estimate that the probability of head is 0 and that of tail is 1

Conditional, Joint, and Marginal We have two dies, D 1 and D 2 The conditional probability of rolling i given die D 1 is called P(i|D 1 ) We pick a die with probability P(D j ), j=1, 2 The probability for picking die D j and rolling an i is the product of the two probabilities, P(i, D j )=P( D j )P(i|D j ), the joint probability P(X, Y)=P(X|Y)P(Y) P(X)=Σ Y P(X, Y)=Σ Y P(X|Y)P(Y), the marginal probability

Bayes Theorem Bayes’ theorem The denominator is the marginal The numerator is the joint

Example 1 Consider an occasionally dishonest casino that uses two kinds of dice. Of the dice 99% are fair but 1% are loaded so that a six comes up 50% of the time. Suppose we pick a die at random and roll it three times, getting three consecutive sixes. What is P(D loaded |3 sixes)?

Example 1 (Continued)

We will still more likely pick up a fair die, despite seeing three successive sixes.

Example 2 Assume that, on average, extracellular protein have a slightly different amio acid composition than intracellular proteins For example, cysteine is more common in extracellular than intracellular proteins Question: whether a new protein sequence x=x 1 …x n is intracellular or extracellular?

Example 2 (continued) We first split our training examples from SWISS- PROT into extracellular and intracellular proteins Estimate a set of frequencies for intracellular proteins, and a corresponding set of extracellular frequencies The probability that any new sequence is extracellular is p ext, and the corresponding probability of being intracellular is p int. Note that p int =1- p ext

Example 2 (continued)

Bayesian Model θ is the parameter of interest Before collecting data, the information regarding θ is called the prior information, P(θ) After collected the data, the information regarding θ is called the posterior information, P(θ|D) If we do not have enough data to reliably estimate the parameters, we can use prior knowledge to constrain the estimates

Bayesian and Frequentist D ~ N(θ,1) To frequentists, θ is fixed (unknown) To Bayesians, θ is random If θ is random, what should its distribution be? Frequentists argue that the determination of the prior distribution of θ is very subjective

Prior and Posterior Suppose that θ has a probability distribution P(θ) (prior) Assume that θ and D|θ are independent P(D, θ) is the joint distribution of D and θ P(D | θ) is the conditional distribution of D given θ P(θ | D) is the conditional distribution of θ given D (the posterior)

Prior and Posterior P(D| θ)P(θ)=P(D, θ)=P(θ | D)P(D) Bayes’ theorem:

Posterior Distribution Given D’s density p(D|θ) and a prior probability density P(θ), the posterior density for θ is given as p(θ|D)=cp(θ) p(D|θ), where c -1 =∫ p(θ) p(D|θ) dθ (the marginal of D).

Example D ~ N(θ,  2 ),  is known. P(  )=N(  0,  0 2 ) Then the posterior density is normal with and

Conjugate Prior D ~ N(θ,  2 ) is a normal distribution Prior distribution, P(  )=N(  0,  0 2 ), is also a normal distribution Posterior distribution, P(  |D), is also a normal distribution The normal distribution is conjugate to the normal

Specification of the Prior Conjugate priors: –The beta distribution is conjugate to the binomial –The normal distribution is conjugate to the normal –The gamma distribution is conjugate to the Poisson

Specification of the Prior Noninformative (uninformative) priors: P(θ)  constant When we don’t have a strong belief or in public policy situations strongly differ

Specification of the Prior Sometimes, we will wish to use an informative P(θ). We know a priori that the amino acids phenylalanine (Phe, F), tyrosine (Tyr, Y), and tryptophan (Trp, W) are structurally similar and often evolutionarily interchangeable. We would want to use a P(θ) that tends to favor parameter sets that assign them very similar probabilities.

Parameter Estimation Choose the parameter value for θ that maximize P(  |D) This is called maximum a posteriori or MAP estimation MAP estimation maximizes the likelihood times the prior If the prior is flat (uninformative), then MAP is equal to the MLE Another parameter estimation is to choose the mean of the posterior

Maximum A Posteriori (MAP) Estimation Ex: estimating probabilities for a die We roll 1, 3, 4, 2, 4, 6, 2, 1, 2, 2 MLE: p 5 =0 ~ overfitting add 1 to each observed number of counts (pseudocount): MAP: p 5 =1/16

When estimating large parameter sets from small amounts of data, we believe that Bayesian methods provide a consistent formalism for bringing in additional information from previous experience with the same type of data.