Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans,

Slides:

Advertisements

Similar presentations

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.

Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Dynamic Bayesian Networks (DBNs)

Challenges in Computational Linguistic Phylogenetics Tandy Warnow Departments of Computer Science and Bioengineering.

Introduction of Probabilistic Reasoning and Bayesian Networks

1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.

Phylogenetic reconstruction

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Molecular Evolution Revised 29/12/06

Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.

Phylogeny Reconstruction Methods in Linguistics with François Barbançon, Steve Evans, Luay Nakhleh, Don Ringe, and Ann Taylor Tandy Warnow The University.

Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.

BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.

Molecular Clocks, Base Substitutions, & Phylogenetic Distances.

The history of the Indo-Europeans Tandy Warnow The University of Texas at Austin.

CS 394C September 16, 2013 Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.

Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.

Molecular phylogenetics

A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.

Detecting language contact in Indo-European Tandy Warnow The Program for Evolutionary Dynamics at Harvard The University of Texas at Austin (Joint work.

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Phylogenetics II.

Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Calculating branch lengths from distances. ABC A B C----- a b c.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.

Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

The history of the Indo-Europeans Tandy Warnow (sorry I spelled my name wrong) The University of Texas at Austin.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

A simulation study comparing phylogeny reconstruction methods for linguistics Collaborators: Francois Barbancon, Don Ringe, Luay Nakhleh, Steve Evans Tandy.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

CS 394C March 21, 2012 Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Most Likely Rates given Phylogeny L[  |001] =  0 x (P[  A ] + P[  B ]) +  1 x (P[  C ] + P[  D ])

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.

Network Topology Single-level Diversity Coding System (DCS) An information source is encoded by a number of encoders. There are a number of decoders, each.

Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor

12. Principles of Parameter Estimation

Distance based phylogenetics

Modelling language evolution

Challenges in constructing very large evolutionary trees

Maximum likelihood (ML) method

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Tandy Warnow Department of Computer Sciences

The Most General Markov Substitution Model on an Unrooted Tree

CS 394C: Computational Biology Algorithms

Algorithms for Inferring the Tree of Life

12. Principles of Parameter Estimation

Phylogeny estimation under a model of linguistic character evolution

A handbook on validation methodology. Metrics.

Presentation transcript:

Perfect phylogenetic networks, and inferring language evolution Tandy Warnow The University of Texas at Austin (Joint work with Don Ringe, Steve Evans, and Luay Nakhleh)

Species phylogeny Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Possible Indo-European tree (Ringe, Warnow and Taylor 2000)

Controversies for Indo-European history Subgrouping: Other than the 10 major subgroups, what is likely to be true? In particular, what about –Indo-Hittite –Italo-Celtic, –Greco-Armenian, –Anatolian + Tocharian, –Satem Core?

Historical Linguistic Data A character is a function that maps a set of languages, L, to a set of states. Three kinds of characters: –Phonological (sound changes) –Lexical (meanings based on a wordlist) –Morphological (especially inflectional)

Homoplasy-free evolution When a character changes state, it changes to a new state not in the tree In other words, there is no homoplasy (character reversal or parallel evolution) First inferred for weird innovations in phonological characters and morphological characters in the 19th century

Lexical characters can also evolve without homoplasy For every cognate class, the nodes of the tree in that class should form a connected subset - as long as there is no undetected borrowing nor parallel semantic shift. However, in practice, lexical characters are more likely to evolve homoplastically than complex phonological or morphological characters

Differences between different characters Lexical: most easily borrowed (most borrowings detectable), and homoplasy relatively frequent (we estimate about 25-30% overall for our wordlist, but a much smaller percentage for basic vocabulary). Phonological: can still be borrowed but much less likely than lexical. Complex phonological characters are infrequently (if ever) homoplastic, although simple phonological characters very often homoplastic. Morphological: least easily borrowed, least likely to be homoplastic.

Linguistic character evolution Characters are lexical, phonological, and morphological. Homoplasy is much less frequent than in biomolecular data: most changes result in a new state, and hence there is an unbounded number of possible states. Borrowing between languages occurs, but can often be detected.

Our methods/models Ringe & Warnow “Almost Perfect Phylogeny”: most characters evolve without homoplasy under a no-common- mechanism assumption (various publications since 1995) Ringe, Warnow, & Nakhleh “Perfect Phylogenetic Network”: extends APP model to allow for borrowing, but assumes homoplasy-free evolution for all characters (Language, 2005) Warnow, Evans, Ringe & Nakhleh “Extended Markov model”: parameterizes PPN and allows for homoplasy provided that homoplastic states can be identified from the data (to appear in Cambridge University Press) Ongoing work: incorporating unidentified homoplasy.

First analysis: Almost Perfect Phylogeny The original dataset contained 375 characters (336 lexical, 17 morphological, and 22 phonological). We screened the dataset to eliminate characters likely to evolve homoplastically or by borrowing. On this reduced dataset (259 lexical, 13 morphological, 22 phonological), we attempted to maximize the number of compatible characters while requiring that certain of the morphological and phonological characters be compatible. (Computational problem NP-hard.)

Indo-European Tree (95% of the characters compatible)

Initial analysis Initial analysis of the IE dataset revealed that no perfect phylogeny for that dataset existed, even after careful screening. Possible explanations: –Homoplasy –Polymorphism (e.g. rock/stone) –Mistakes in character coding –Borrowing (horizontal transmission)

Modelling borrowing: Networks and Trees within Networks

Perfect Phylogenetic Networks 1221 An underlying tree + additional contact edges No cycles that involve tree edges Each character is compatible on at least one of the trees “inside” the network

PPN Reconstruction Method Minimum Increment to a PPN (MIPPN): (1)Estimate the underlying “genetic” tree (2)Add a minimum number of contact edges to make all characters compatible (NP-hard to solve exactly even when the genetic tree is known, so we do exhaustive search on each candidate tree.)

The Indo-European (IE) Dataset 24 languages 294 characters: 22 phonological, 13 morphological, and 259 lexical We examined five different “genetic” trees, one of which had a minimum number of incompatible characters (14 lexical characters)

Quality of Solutions Three mathematical criteria: 1.# characters incompatible with the “genetic” tree T 2.# additional contact edges needed to obtain a PPN from T 3.# borrowing events needed to make all characters compatible on the PPN Also: feasibility with respect to the archaeological and historical record

Our best PPN (Language, 2005)

Are we done? We observed that of the three contact edges, only two are well-supported. If we eliminate that weakly supported edge, then we must explain the incompatibility of some characters (either through homoplasy or polymorphism). Challenge: How to model polymorphism, homoplasy, borrowing, and genetic transmission?

Other work Stochastic model of language evolution incorporating homoplasy, showing identifiability and efficient reconstructability (to appear Cambridge University Press) Comparison of various methods on the IE dataset (to appear, Transactions of the Philological Society) Modelling polymorphism (SIAM J. Computing, and ongoing) Simulation study (ongoing)

Extended Markov model There are two types of states: those that can arise more than once, but others can arise only once, and for each state of each character we know which type it is. (This information is not inferred by the estimation procedure.) There are two types of substitutions: homoplastic and non- homoplastic. Parameters: each character has its own 2x2 substitution matrix, and a relative probability of being borrowed. Each “contact edge” has a relative probability of transmitting character states. Each character evolves down a tree contained within the network. The characters evolve independently under this no-common-mechanism model.

Initial results The model tree is identifiable under very mild conditions (where the substitution probabilities are bounded away from 0 and 1). Statistically consistent and efficient methods exist for reconstructing trees (as well as some networks). Maximum Likelihood and Bayesian analyses are also feasible, since likelihood calculations can be done in linear time.

Ongoing model development Not all homoplastic states are identifiable! Therefore, our ongoing work is seeking to develop improved models of language evolution which permit unidentified homoplasy. Such models are not likely to be identifiable, making inference of evolution more difficult Polymorphism (i.e., two or more states of a character present in a language) remains insufficiently characterized, and therefore cannot yet be used rigorously in a phylogenetic analysis. Our earlier work provided an initial model when evolution is tree-like, but we need to extend the model in the presence of borrowing.

The no-common-mechanism model In this model, there is a separate random variable for every combination of site and edge - the underlying tree is fixed, but otherwise there are no constraints on variation between sites. Including this assumption in the usual molecular evolution models makes the tree and dates unidentifiable. A B C D A B C D

For more information Please see the Computational Phylogenetics for Historical Linguistics web site for papers, data, and additional material

Acknowledgements The Program for Evolutionary Dynamics at Harvard NSF, the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Studies, and the Institute for Cellular and Molecular Biology at UT-Austin. Collaborators: Don Ringe, Steve Evans, and Luay Nakhleh.