Identifying essential genes in M. tuberculosis by random transposon mutagenesis Karl W. Broman Department of Biostatistics Johns Hopkins University

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

A Tutorial on Learning with Bayesian Networks
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
METHODS FOR HAPLOTYPE RECONSTRUCTION
Bayesian Estimation in MARK
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Ab initio gene prediction Genome 559, Winter 2011.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Confidence Intervals © Scott Evans, Ph.D..
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Visual Recognition Tutorial
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Inferences About Process Quality
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Maximum likelihood (ML)
Statistical Decision Theory
Model Inference and Averaging
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
Mid-Term Review Final Review Statistical for Business (1)(2)
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
통계적 추론 (Statistical Inference) 삼성생명과학연구소 통계지원팀 김선우 1.
Not in FPP Bayesian Statistics. The Frequentist paradigm Defines probability as a long-run frequency independent, identical trials Looks at parameters.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Ledolter & Hogg: Applied Statistics Section 6.2: Other Inferences in One-Factor Experiments (ANOVA, continued) 1.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CLASS: B.Sc.II PAPER-I ELEMENTRY INFERENCE. TESTING OF HYPOTHESIS.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Assessing Estimability of Latent Class Models Using a Bayesian Estimation Approach Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University Departments.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
1 ES Chapter 18 & 20: Inferences Involving One Population Student’s t, df = 5 Student’s t, df = 15 Student’s t, df = 25.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Conditional Expectation
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
Independent Samples: Comparing Means Lecture 39 Section 11.4 Fri, Apr 1, 2005.
MCMC Output & Metropolis-Hastings Algorithm Part I
ICS 280 Learning in Graphical Models
Model Inference and Averaging
Ch3: Model Building through Regression
Bayesian inference Presented by Amir Hadadi
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Multidimensional Integration Part I
CONTEXT DEPENDENT CLASSIFICATION
LECTURE 07: BAYESIAN ESTIMATION
Presentation transcript:

Identifying essential genes in M. tuberculosis by random transposon mutagenesis Karl W. Broman Department of Biostatistics Johns Hopkins University

2 Mycobacterium tuberculosis The organism that causes tuberculosis –Cost for treatment: ~ $15,000 –Other bacterial pneumonias: ~ $ Mbp circular genome, completely sequenced known or inferred genes

3 Aim Identify the essential genes –Knock-out  non-viable mutant Random transposon mutagenesis –Rather than knock out each gene systematically, we knock out them out at random.

4 The Himar1 transposon 5’ - TCGAAGCCTGCGACTAACGTTTAAAGTTTG - 3' 3’ - AGCTTCGGACGCTGATTGCAAATTTCAAAC - 5' Note: 30 or more stop codons in each reading frame

5 Sequence of the gene MT598

6 Random transposon mutagenesis

7 Locations of tranposon insertion determined by sequencing across junctions. Viable insertion within a gene  gene is not essential Essential genes: we will never see a viable insertion Complication: Insertions in the very distal portion of an essential gene may not be sufficiently disruptive. Thus, we omit from consideration insertion sites within the last 20% and last 100 bp of a gene.

8 The data Number, locations of genes Number of insertion sites in each gene Viable mutants with exactly one transposon Location of the transposon insertion in each mutant

9 TA sites in M. tuberculosis 74,403 sites 65,659 sites within a gene 57,934 sites within proximal portion of a gene 4204/4250 genes with at least one TA site

insertion mutants 1025 within proximal portion of a gene 21 double hits 770 unique genes hit Questions: Proportion of essential genes in Mtb? Which genes are likely essential?

11 Statistics, Part 1 Find a probability model for the process giving rise to the data. Parameters in the model correspond to characteristics of the underlying process that we wish to determine

12 The model Transposon inserts completely at random (each TA site equally likely to be hit) Genes are either completely essential or completely non-essential. LetN = no. genest i = no. TA sites in gene i n = no. mutantsm i = no. mutants of gene i

13 A picture of the model

14 Part of the data GeneNo. TA sitesNo. mutants ::: ::: Total57,9341,025

15 A related problem How many species of insects are there in the Amazon? –Get a random sample of insects. –Classify according to species. –How many total species exist? The current problem is a lot easier: –Bound on the total number of classes. –Know the relative proportions (up to a set of 0/1 factors).

16 Statistics, Part 2 Find an estimate of  = (  1,  2, …,  N ). We’re particularly interested in Frequentist approach –View parameters {  i } as fixed, unknown values –Find some estimate that has good properties –Think about repeated realizations of the experiment. Bayesian approach –View the parameters as random. –Specify their joint prior distribution. –Do a probability calculation.

17 The likelihood Note: Depends on which m i > 0, but not directly on the particular values of m i.

18 Frequentist method Maximum likelihood estimates (MLEs): Estimate the  i by the values for which L(  | m) achieves its maximum. In this case, the MLEs are Further, = No. genes with at least one hit. This is a really stupid estimate!

19 Bayes: The prior  + ~ uniform on {0, 1, …, N}  |  + ~ uniform on sequences of 0s and 1s with  + 0s Note: –We are assuming that Pr(  i = 1) = 1/2. –This is quite different from taking the  i to be like coin tosses. –We are assuming that  i is independent of t i and the length of the gene. –We could make use of information about the essential or non-essential status of particular genes (e.g., known viable knock-outs).

20 Uniform vs. Binomial

21 Markov chain Monte Carlo Goal: Estimate Pr(  | m). Begin with some initial assignment,  (0), ensuring that  i (0) = 1 whenever m i > 0. For iteration s, consider each gene one at a time and –Calculate Pr(  i = 1 |  -i (s), m) –Assign  i (s) = 1 at random with this probability Repeat many times

22 MCMC in action

23 A further complication Many genes overlap Of 4250 genes, 1005 pairs overlap (mostly by exactly 4 bp). The overlapping regions contain 547 insertion sites. Omit TA sites in overlapping regions unless in the proximal portion of both genes. The algebra gets a bit more complicated.

24 Percent essential genes

25 Percent essential genes

26 Probability a gene is essential

27 Yet another complication Operon: A group of adjacent genes that are transcribed together as a single unit. Insertion at a TA site could disrupt all downstream genes. If a gene is essential, insertion in any upstream gene would be non-viable. Re-define the meaning of “essential gene”. If operons were known, one could get an improved estimate of the proportion of essential genes. If one ignores the presence of operons, estimates are still unbiased.

28 Summary Bayesian method, using MCMC, to estimate the proportion of essential genes in a genome with data from random transposon mutagenesis. Critical assumptions: –Randomness of transposon insertion –Essentiality is an all-or-none quality –No relationship between essentiality and no. insertion sites. For M. tuberculosis, with data on 1400 mutants: – % of genes are essential –20 genes that have > 64 TA sites and for which no mutant has been observed have > 75% chance of being essential.

29 Acknowledgements Natalie Blades (now at The Jackson Lab) Gyanu Lamichhane, Hopkins William Bishai, Hopkins