CS 581 / BIOE 540: Algorithmic Computational Genomics

CS 581 / BIOE 540: Algorithmic Computational Genomics
Tandy Warnow Departments of Bioengineering and Computer Science

Course Details Office hours: Tuesdays 12:30-1:30 (Siebel 3235) Course webpage: Textbook: Computational Phylogenetics, available for download at TA: Pranjal Vachaspati (to be confirmed)

Today •  Describe some important problems in computational biology, for which students in this course could develop improved methods •  Explain how the course will be run •  Answer questions

This Course Topics: computational and statistical problems in sequence analysis (e.g., multiple sequence alignment, phylogeny estimation, metagenomics, etc.). Focus: understanding the mathematical foundations, and designing algorithms with outstanding accuracy and speed on large, complex datasets. This is not a course about how to use the tools.

Prerequisites No background in biology is needed. However, the course has the following prerequisites: CS 374: computational complexity, algorithm design techniques, and proving theorems about algorithms CS 361: probability and statistics By recursion, CS 225: programming

If you haven’t satisfied the pre-reqs:
You need permission to stay in the course. The first homework is due (by ) on Saturday at 1 PM. See the homework webpage Then make an appointment to see me to review the homework.

This course Phylogeny estimation based on stochastic models of sequence evolution and genome evolution Multiple sequence alignment Applications to metagenomics, protein structure prediction, and other biological problems

Species Tree Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website, University of Arizona

Evolution informs about everything in biology
•  Big genome sequencing projects just produce data -‐-‐ so what? •  Evolutionary history relates all organisms and genes, and helps us understand and predict –  interactions between genes (genetic networks) –  drug design –  predicting functions of genes –  inﬂuenza vaccine development –  origins and spread of disease –  origins and migrations of humans

Constructing the Tree of Life: Hard Computational Problems
NP-hard problems Large datasets 100,000+ sequences thousands of genes “Big data” complexity: model misspecification fragmentary sequences errors in input data streaming data

Phylogenomic pipeline
Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Research Strategies Improved algorithms through: Statistical modelling
Divide-and-conquer “Bin-and-conquer” Iteration Bayesian statistics Hidden Markov Models Graph theory Combinatorial optimization Statistical modelling Massive Simulations High Performance Computing

Avian Phylogenomics Project
Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Austin S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… Approx. 50 species, whole genomes 14,000 loci Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project
G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people… Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci)

DNA Sequence Evolution
-3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

Performance criteria Running time Space
Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution “Topological accuracy” with respect to the underlying true tree or true alignment, typically studied in simulation Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Quantifying Error FN FP FN: false negative (missing edge)
FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Phylogenetic reconstruction methods
Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Local optimum Cost Global optimum Phylogenetic trees Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

Solving maximum likelihood (and other hard optimization problems) is… unlikely
# of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x

Quantifying Error FN: false negative (missing edge)
FP: false positive (incorrect edge) FP 50% error rate

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]
0.8 NJ Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining! Error Rate 0.6 0.4 0.2 400 800 No. Taxa 1200 1600

Major Challenges •  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

The Real Problem! U V W X Y X U Y V W AGGGCATGA AGAT TAGACTT TGCACAA
TGCGCTT X U Y V W

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA…
Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment –  Reflects historical substitution, insertion, and deletion events –  Defined using transitive closure of pairwise alignments computed on edges of the true tree

Input: unaligned sequences
= AGGCTATCACCTGACCTCCA S2 TAGCTATCACGACCGC S3 TAGCTGACCGC S4 TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC
S3 = TAGCTGACCGC S4 = S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA TCACGACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA TCACGACCGACA S1 S2 S4 S3

Two-phase estimation • Bayesian MCMC • Maximum parsimony
Alignment methods •  Clustal •  POY (and POY*) •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc. Phylogeny methods •  Bayesian MCMC •  Maximum parsimony •  Maximum likelihood •  Neighbor joining •  FastME •  UPGMA •  Quartet puzzling •  Etc.

Estimated tree and alignment
Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 Compare S4 S3 True tree and alignment S2 S3 Estimated tree and alignment

1000 taxon models, ordered by diﬃculty (Liu et al., 2009)

Multiple Sequence Alignment (MSA): another grand challenge1
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- … Sn = TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Major Challenges •  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements) •  Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

Phylogenomics (Phylogenetic estimation from whole genomes)

Species Tree Estimation requires multiple genes!
Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Two basic approaches for species tree estimation
•  Concatenate (“combine”) sequence alignments for different genes, and run phylogeny estimation methods •  Compute trees on individual genes and combine gene trees

Using multiple genes gene 3 gene 2 gene 1 S1 S2 S3 S4 S7 S8 TCTAATGGAA
GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC S1 S3 S4 S7 S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC S4 S5 S6 S7

Concatenation gene 2 gene 3 gene 1 S1 TCTAATGGAA TATTGATACA S2
? ? ? ? ? ? ? ? ? ? TATTGATACA S2 GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S3 TCTAAGGGAA ? ? ? ? ? ? ? ? ? ? TCTTGATACC S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S5 ? ? ? ? ? ? ? ? ? ? GCTAAACCTC ? ? ? ? ? ? ? ? ? ? S6 ? ? ? ? ? ? ? ? ? ? GGTGACCATC ? ? ? ? ? ? ? ? ? ? S7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S8 TATAACGGAA ? ? ? ? ? ? ? ? ? ? CATTCATACC

Red gene tree ≠ species tree (green gene tree okay)

Gene Tree Incongruence
Gene trees can differ from the species tree due to: Duplication and loss Horizontal gene transfer Incomplete lineage sorting (ILS)

Incomplete Lineage Sorting (ILS)
Confounds phylogenetic analysis for many groups: Hominids Birds Yeast Animals Toads Fish Fungi There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.

Lineage Sorting Population-level process, also called the “Multi-species coalescent” (Kingman, 1982) Gene trees can differ from species trees due to short times between speciation events or large population size; this is called “Incomplete Lineage Sorting” or “Deep Coalescence”.

The Coalescent Present Past Courtesy James Degnan

Gene tree in a species tree
Courtesy James Degnan

Key observation: Under the multi-species coalescent model, the species tree defines a probability distribution on the gene trees, and is identifiable from the distribution on gene trees Courtesy James Degnan

Two competing approaches
gene gene gene k . . . Species Concatenation point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”. . . . Analyze separately Summary Method

Species tree estimation: difficult, even for small datasets!
Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Major Challenges: large datasets, fragmentary sequences
•  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. •  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). •  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Avian Phylogenomics Project
Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Austin S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… Approx. 50 species, whole genomes 14,000 loci Challenges: Species tree estimation under the multi-species coalescent model, from 14,000 poor estimated gene trees, all with different topologies (we used “statistical binning”) Maximum likelihood estimation on a million-site genome-scale alignment – 250 CPU years Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project
G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people… Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci): Species tree estimation under the multi-species coalescent from hundreds of conflicting gene trees on >1000 species (we will use ASTRAL – Mirarab et al. 2014, Mirarab & Warnow 2015) Multiple sequence alignment of >100,000 sequences (with lots of fragments!) – we will use UPP (Nguyen et al., 2015)

Constructing the Tree of Life: Hard Computational Problems
NP-hard problems Large datasets 100,000+ sequences thousands of genes “Big data” complexity: model misspecification fragmentary sequences errors in input data streaming data

Research Strategies Improved algorithms through: Statistical modelling
Divide-and-conquer “Bin-and-conquer” Iteration Bayesian statistics Hidden Markov Models Graph theory Combinatorial optimization Statistical modelling Massive Simulations High Performance Computing

Evolution informs about everything in biology
•  Big genome sequencing projects just produce data -‐-‐ so what? •  Evolutionary history relates all organisms and genes, and helps us understand and predict –  interactions between genes (genetic networks) –  drug design –  predicting functions of genes –  inﬂuenza vaccine development –  origins and spread of disease –  origins and migrations of humans

Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic data analysis
NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed Mihai Pop, Univ Maryland

Metagenomic taxon identification
Objective: classify short reads in a metagenomic sample

Possible Indo-European tree (Ringe, Warnow and Taylor 2000)
Anatolian Vedic Iranian Greek Italic Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

“Perfect Phylogenetic Network” for IE Nakhleh et al., Language 2005
Anatolian Vedic Iranian Greek Italic Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

Grading Homework: 25% (one hw dropped) Midterm: 40% (March 30)
Final Project: 25% (due May 6) Course Participation: 10% No final exam.

Homework Assignments Homework assignments are listed at and are due at 1 PM (in person or via ) – late homeworks have reduced credit and will not be accepted after 48 hours past the deadline. You are encouraged to work with others on your homework, but you must write up solutions by yourself and indicate who you worked with on each homework.

Course Schedule A detailed course schedule is athttp://tandy.cs.illinois.edu/cs detailed-syllabus.html This schedule includes material you are expected to have looked at before coming to class: assigned reading (from textbook and/or scientific literature) PPT and/or PDF of my lecture

Final Project and Class Presentation
Either research project (can be with another student) or survey paper (done by yourself). Many interesting and publishable problems to address: see Your class presentation should be related to your final project.

Academic Integrity Please see course website at and also For this course: Examine the policy about collaboration Learn and understand what plagiarism is (and then don’t do it). This applies to homework, all writing assignments, and the final project.

Course Research Projects
Evaluating existing methods on simulated and real (biological or linguistic) datasets Designing a new method, and establishing its performance (using theory and data) Analyzing a biological dataset using several different methods, to address biology

Examples of published course projects
Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S7. T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S11. J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. “A comparative study of SVDquartets and other coalescent-based species tree estimation methods”. RECOMB-Comparative Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2. P. Vachaspati and T. Warnow (2016). FastRFS: Fast and Accurate Robinson-Foulds Supertrees using Constrained Exact Optimization Bioinformatics 2016; doi: /bioinformatics/btw600. (Special issue for papers from RECOMB-CG)

Research Projects you could join
Phylogenomics projects (Avian and the 1KP) Species tree and network estimation from conflicting genes Large-scale multiple sequence alignment Large-scale maximum likelihood tree estimation Improving gene tree estimation using whole genomes Metagenomics (with Mihai Pop, University of Maryland, and Bill Gropp) Identifying genes and taxa from short sequences Metagenomic assembly Applications to clinical diagnostics Protein Sequence Analysis (with Jian Peng) What function and structure does this protein have? How did structure and function evolve? Historical Linguistics (with Donald Ringe, UPenn) How did Indo-European evolve? Designing and implementing statistical estimation methods for language phylogenies

CS 581 / BIOE 540: Algorithmic Computational Genomics

Similar presentations

Presentation on theme: "CS 581 / BIOE 540: Algorithmic Computational Genomics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 581 / BIOE 540: Algorithmic Computational Genomics

Similar presentations

Presentation on theme: "CS 581 / BIOE 540: Algorithmic Computational Genomics"— Presentation transcript:

Similar presentations

About project

Feedback