Speaker: Bin-Shenq Ho Dec. 19, 2011

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Experimental Design, Response Surface Analysis, and Optimization
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Sampling distributions of alleles under models of neutral evolution.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Bioinformatics and Phylogenetic Analysis
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Reduced Support Vector Machine
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE © 2012 The McGraw-Hill Companies, Inc.
Probabilistic methods for phylogenetic trees (Part 2)
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
AP Statistics Section 13.1 A. Which of two popular drugs, Lipitor or Pravachol, helps lower bad cholesterol more? 4000 people with heart disease were.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Basic Probability (Chapter 2, W.J.Decoursey, 2003) Objectives: -Define probability and its relationship to relative frequency of an event. -Learn the basic.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Selecting and Recruiting Subjects One Independent Variable: Two Group Designs Two Independent Groups Two Matched Groups Multiple Groups.
1 Experimental Design. 2  Single Factor - One treatment with several levels.  Multiple Factors - More than one treatment with several levels each. 
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogeny GENE why is coalescent theory important for understanding phylogenetics (species trees)? coalescent theory lets us test our assumptions.
Copyright © Cengage Learning. All rights reserved. 14 Elements of Nonparametric Statistics.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BPS - 3rd Ed. Chapter 131 Confidence Intervals: The Basics.
Experimental Design Experimental Designs An Overview.
METHODS IN BEHAVIORAL RESEARCH NINTH EDITION PAUL C. COZBY Copyright © 2007 The McGraw-Hill Companies, Inc.
Chapter Twelve The Two-Sample t-Test. Copyright © Houghton Mifflin Company. All rights reserved.Chapter is the mean of the first sample is the.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
C82MST Statistical Methods 2 - Lecture 1 1 Overview of Course Lecturers Dr Peter Bibby Prof Eamonn Ferguson Course Part I - Anova and related methods (Semester.
Processing & Testing Phylogenetic Trees. Rooting.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
T tests comparing two means t tests comparing two means.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Inferential Statistics Psych 231: Research Methods in Psychology.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Sampling and Sampling Distribution
Chapter 8 Introducing Inferential Statistics.
Phylogenetic basis of systematics
3. The X and Y samples are independent of one another.
Inferring a phylogeny is an estimation procedure.
Inference about Comparing Two Populations
Patterns in Evolution I. Phylogenetic
Summary and Recommendations
UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE
What are their purposes? What kinds?
Development of a real-time PCR assay for the specific detection and identification of Streptococcus pseudopneumoniae using the recA gene  V. Sistek, M.
Molecular data assisted morphological analyses
Summary and Recommendations
Presentation transcript:

Speaker: Bin-Shenq Ho Dec. 19, 2011 Inadequacies of Minimum Spanning Trees in Molecular Epidemiology Stephen J. Salipante and Barry G. Hall JOURNAL OF CLINICAL MICROBIOLOGY, Oct. 2011, p. 3568–3575 Speaker: Bin-Shenq Ho Dec. 19, 2011

Sep. 22, 2010

Underlying Reasoning How will be the representativeness of a single, arbitrarily selected MST in terms of potentially many equally optimal solutions How could be the role of statistical metrics in the credibility of MST estimations Equally parsimonious paths if two or more edges have the same lengths Better considered a MSN To infer population structure, statistical methods are employed to gauze the credibility of inferences. Common techniques for assessing the statistical robutness of population structure estimations include bootstrapping and the use of Bayesian posterior probabilities. To implement a bootstrapping metric to evaluate the credibility of alternative MST solutions To implement a systemic approach to MST estimation, through bootstrapping metric

MST gold http://www.bellinghamresearchinstitute.com Materials and Methods MST gold http://www.bellinghamresearchinstitute.com http://web.me.com/barryghall/ Max amount of time Max number of unique MSTs Min rate of new discovery

Distance matrix calculation Materials and Methods Distance matrix calculation Equidistant method sequence, spoligotype, SNP Difference method VNTR

spacer oligonucleotide type spoligotype spacer oligonucleotide type The genome size of the M. tuberculosis H37Rv strain is around 4 million base pairs with 3959 genes. Spacer oligonucleotide typing is a hybridization assay that detects variability in the direct repeat (DR) region in the DNA of M. tuberculosis. The DR region consists of multiple copies of a conserved 36-base-pair sequence (the direct repeats) separated by multiple unique spacer sequences (the standard spoligotyping assay uses 43). Different M. tuberculosis strains have various complements of the 43 spacers, and these different complements form the basis of the assay (Kamerbeek 1997). The standard spoligotyping assay is performed by using a membrane. In this format, each of the 43 spacers produces either a dark band (indicating the presence of the spacer) or no band (indicating the spacer’s absence). As Figure 3.1 shows, for each M. tuberculosis isolate, the spoligotyping assay produces a series of bands, much like a bar code. (http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm) (http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

variable number of tandem repeat VNTR variable number of tandem repeat Variable number of tandem repeat (VNTR) typing is based on analysis of DNA segments containing “tandem repeated” sequences in which the number of copies of the repeated sequence varies among strains. The method relies on PCR amplification and calculation of the number of repeats on the basis of the size of the amplified product. MIRUs are a class of tandem repeated sequences. There are a total of 41 MIRU loci, of which 12 have been selected for genotyping. The names of the 12 loci that will be analyzed are 02, 04, 10, 16, 20, 23, 24, 26, 27, 31, 39, and 40 (Mazars 2001). MIRU results are reported as 12-character designations, each character corresponding to the number of repeats at one of the 12 MIRU loci, listed in a standard order (Table 3.1). In rare instances, the number of repeats is greater than 9. To avoid the use of double digits, the following designations are used in reporting results: 10 repeats = “a”; 11 repeats = “b”; 12 repeats = “c”; etc. Occasionally, the repeat number is 0. If the region is deleted and no amplification product is obtained, this is indicated by a dash (-). A few strains give an anomalous result for MIRU locus number 04 (i.e., the second digit in the MIRU type). These anomalous results at 04 are designated “x,” “y,” or “z,” depending on the number of repeats. (http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm) (http://www.cdc.gov/tb/programs/genotyping/Chap3/3_CDCLab_2Description.htm)

multilocus sequence type MLST multilocus sequence type The procedure characterizes isolates of bacterial species using the DNA sequences of internal fragments of multiple housekeeping genes. For each housekeeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the loci define the allelic profile or sequence type (ST). Nucleotide differences between strains can be checked at a variable number of genes depending on the degree of discrimination desired. Approximately 450-500 bp internal fragments of each gene are used, as these can be accurately sequenced on both strands using an automated DNA sequencer. (http://en.wikipedia.org/wiki/Multilocus_sequence_typing)

MSTs estimation and MSNs creation Materials and Methods MSTs estimation and MSNs creation Kruskal’s algorithm with input by node order randomization Combination of all edges defined within unique MSTs constitutes MSN.

Number estimation of possible MSTs Materials and Methods Number estimation of possible MSTs through mark-recapture (Schnabel method) N = [(M+1)(C+1)] ÷ (R+1) - 1 N+1 = [(M+1)(C+1)] ÷ (R+1) (M+1) ÷ (N+1) = (R+1) ÷ (C+1) M:Mark C:Current R:Recapture

Materials and Methods Bootstrapping To establish confidence level of a model 100 individual pseudoreplicates for each MST Bootstrap value expressed as the fraction of pseudoreplicates yielding the same inference as the original data Given enough information, there should be sufficiently redundant data that independent pseudoreplicates will yield analyses identical to that of the complete data set. Computational simplicity; widespread use in phylogenetics; used anecdotally with MSTs

Bootstrap Efron and Gong (1983) Diaconis and Efron (1983) Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791 Inferring the variability in an unknown distribution from which your data were drawn by resampling from the data The basic idea of the bootstrap involves inferring the variability in an unknown distribution from which your data were drawn by resampling from the data. The resampling process is done many times, each time producing a fictional sample of n points by sampling with replacement from the original n data points. The essential idea of the bootstrap is that this set of estimates has a distribution that approximates the distribution of the actual estimate t. A justifiable procedure is to bootstrap across the sites, that is, to sample sites from the data table with replacement. A more serious difficulty is lack of independence of the evolutionary processes in different sites.

Estimating alternative MSTs Results Estimating alternative MSTs Multiple, equally parsimonious solutions possible Kruskal’s MST algorithm sensitive to node input order Schnabel method appropriate to estimate the number of alternative MSTs, esp. after discarding the early cycles of node order randomization

note for number estimation of possible MSTs through mark-recapture (Schnabel method) N = [(M+1)(C+1)] ÷ (R+1) - 1 N+1 = [(M+1)(C+1)] ÷ (R+1) (M+1) ÷ (N+1) = (R+1) ÷ (C+1) M:Mark C:Current R:Recapture

The number of possible MSTs is proportional only to the number of minimal pairwise distances with equal lengths. There is a relationship between the number of possible MSTs and the method used to compute the pairwise distance matrix.

note for distance matrix computation Equidistant method – sites scored merely as “same” or “different” such that any difference carries the same weight Difference method – distances between sites calculated on the basis of the difference between the values of the two sites

There were significantly fewer alternative MSTs possible when the same data were processed using the difference method. There is a relationship between the type of data used and the number of possible alternative MSTs.

Estimating alternative MSTs Results Estimating alternative MSTs When there are limited numbers of informative sites and alleles are treated as equidistant from one another, there are many pairwise distances of the same length, and large numbers of MSTs are possible. Basing analyses on the arithmetic number of pairwise differences among individuals both limits the number of possible MSTs and more faithfully represents the genetic distances between individuals.

Results Creating MSN Approximation by majority rule dashed line – edges present in ≧ 50% of MSTs solid line – edges present in 100% of MSTs Fraction ≠ Credibility

Estimating credibility of MSTs Results Estimating credibility of MSTs Within any set of alternative MSTs examined, the individual trees demonstrated a considerable range of average bootstrap values. Although all MSTs in the MSN are equally parsimonious, some tree configurations are more statistically robust.

Estimating credibility of MSTs Results Estimating credibility of MSTs By restricting analysis to a single, arbitrary MST, there is considerable risk in picking a tree with an inferior credibility. By surveying and evaluating trees within the MSN, it is possible to identify those with more credible configurations.

Results Systematic approach to MST estimation

Discussion Failing to consider alternative solutions (MSTs) can easily mislead or confound our understanding of population structure. Molecular epidemiology has yet to adopt measures to evaluate the credibility of the estimation. Presenting a single MST neither explores the range of alternative hypotheses nor evaluates the quality of MSTs based on their relative credibilities.

Discussion ~ proposed approach to MST analysis ~ 1. The distance matrix that maximizes the differences between individuals is calculated. For VNTR data, a distance matrix calculated by the difference method should be used, and for MLST data, distances should be computed from the underlying DNA sequence data. 2. Instead of returning a single, arbitrarily selected MST, the MSN (representing or approximating the entire population of alternative MSTs) is reported. The total number of possible MSTs is estimated using a mark-recapture calculation.

Discussion ~ proposed approach to MST analysis ~ 3. A bootstrapping metric is employed to estimate the credibility of individual MSTs within the population of alternative solutions comprising the MSN. As many MSTs as time permits are subjected to bootstrap analysis so that the most reliable MST topology can be estimated and statistical support for particular relationships may be ascertained. 4. The most credible hypothesis or hypotheses within the larger population of MSTs are reported.

Thanks for Your Attention !