CS 466 and BIOE 498: Introduction to Bioinformatics

Slides:



Advertisements
Similar presentations
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Advertisements

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
EECS 395/495 Algorithmic Techniques for Bioinformatics General Introduction 9/27/2012 Ming-Yang Kao 19/27/2012.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Constrained Exact Optimization in Phylogenetics
Introduction to Bioinformatics Resources for DNA Barcoding
Advances in Ultra-large Phylogeny Estimation
WABI: Workshop on Algorithms in Bioinformatics
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
Statistical tree estimation
Chalk Talk Tandy Warnow
Distance based phylogenetics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
New methods for simultaneous estimation of trees and alignments
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
BNFO 602 Phylogenetics Usman Roshan.
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Tandy Warnow Founder Professor of Engineering
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Chapter 19 Molecular Phylogenetics
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
TIPP and SEPP (plus PASTA)
Presentation transcript:

CS 466 and BIOE 498: Introduction to Bioinformatics Tandy Warnow Founder Professor of Engineering

Brief Description Algorithmic approaches in bioinformatics: biological problems that can be solved computationally (e.g., genome assembly, sequence alignment, and phylogeny estimation) algorithmic techniques with wide applicability in solving these problems (e.g., dynamic programming and probabilistic methods) practical issues in translating the basic algorithmic ideas into accurate and efficient tools that biologists may use.

But also… Applications of these techniques to cutting edge research problems in biology that require computational approaches, such as: Systems Biology Protein Structure and Function Prediction The Tree of Life

Saurabh Sinha: gene regulation, big data to knowledge Two broad areas: How is information about us encoded in our DNA ? How do we bring the latest and greatest in machine learning and graph mining to the biologist’s desktop computer? Research questions: Gene regulation: How are genes turned on and off in precisely orchestrated ways? Regulatory evolution: Can we model evolution of regulatory sequences? Genomics of behavior: How does DNA encode animal behavior ? Cancer pharmacogenomics: Can a person’s DNA predict the best drug treatment? Big Data To Knowledge (BD2K): Build a “Knowledge Engine for Genomics”. http://www.sinhalab.net/

Jian Peng: Machine learning for molecular and systems biology Biological data Machine learning Knowledge Understanding of the sequence-structure-function relationship Machine learning algorithms for biological network integration Prediction of gene/protein function from heterogeneous datasets Role of protein mutations in human diseases Graphical models, statistical modeling and structured prediction

Computational Phylogenomics NP-hard problems Large datasets Complex statistical estimation problems Metagenomics Protein structure and function prediction Medical forensics Systems biology Population genetics

This course: Material Core bioinformatics problems: Genome assembly, Phylogeny estimation, Multiple sequence alignment, and Database search Probabilistic models of sequence evolution Algorithm design techniques Recursion Divide-and-conquer Dynamic programming Designing heuristic search strategies Use of profile Hidden Markov Models Computational complexity NP-hardness Approximation algorithms

This course: Material Core bioinformatics problems: Genome assembly, Phylogeny estimation, Multiple sequence alignment, and Database search Probabilistic models of sequence evolution Algorithm design techniques Recursion Divide-and-conquer Dynamic programming Designing heuristic search strategies Use of profile Hidden Markov Models Computational complexity NP-hardness Approximation algorithms

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

However… U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT X U Y V W

Indels (insertions and deletions) Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… 12

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Homology = nucleotides lined up since they come from a common ancestor. Indel = dash. 13

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Two-phase estimation Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. RAxML: heuristic for large-scale ML optimization 19

1000-taxon models, ordered by difficulty (Liu et al., 2009) 1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000-taxon models, ordered by difficulty (Liu et al., 2009) 20

Estimate ML tree on merged alignment Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

SATé-1 (Science 2009) performance 1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? SATé-1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SATé-1 can analyze up to about 8,000 sequences.

This course: Material Core bioinformatics problems: Genome assembly, Phylogeny estimation, Multiple sequence alignment, and Database search Probabilistic models of sequence evolution Algorithm design techniques Recursion Divide-and-conquer Dynamic programming Designing heuristic search strategies Use of profile Hidden Markov Models Computational complexity NP-hardness Approximation algorithms

Pre-requisites CS 466: requires CS 225 BIOE 498: requires CS 173 All students should be able to program in some high-level language (of their choice).

This course: grading Homework: 35% (includes assigned reading) Midterm (March 31): 35% Class participation (includes class presentations): 10% Final project: 20% (due last day of course)

Homework Generally due Tuesdays by 1 PM (emailed in PDF format or delivered to my SC office Homework assignments will include Programming assignments (language of your choice) Algorithm designs Proofs Calculations Written critiques of published papers (in PDF format)

Final Project Can be done by yourself or with one other person Options: Easiest: Critique of a collection of papers on a particular problem Analysis of biological data using different techniques, and paper describing the differences Comparison of two or more methods for the same problem Hardest: A new algorithm that you design Timeline: Project proposals (PDF format, 1 page): April 3 In class presentation of final project plans: April 7-17 Projects (PDF format, 6-10 pages): May 3

The Tree of Life: Multiple Challenges Tandy Warnow The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes “BigData” complexity Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima

Other http://tandy.cs.illinois.edu/CS466.html Textbooks: Jones and Pevzner, Introduction to Bioinformatics Algorithms, MIT Press Computational Phylogenetics, at http://tandy.cs.illinois.edu/textbook.pdf Office hours: Wednesday 9-10 AM in Siebel 3235 Homework policy: homework submitted late but within 24 hours are 80%, no homework can be submitted more than 24 hours late