CS 466 and BIOE 498: Introduction to Bioinformatics

CS 466 and BIOE 498: Introduction to Bioinformatics
Tandy Warnow Founder Professor of Engineering

Brief Description Algorithmic approaches in bioinformatics:
biological problems that can be solved computationally (e.g., genome assembly, sequence alignment, and phylogeny estimation) algorithmic techniques with wide applicability in solving these problems (e.g., dynamic programming and probabilistic methods) practical issues in translating the basic algorithmic ideas into accurate and efficient tools that biologists may use.

But also… Applications of these techniques to cutting edge research problems in biology that require computational approaches, such as: Systems Biology Protein Structure and Function Prediction The Tree of Life

Saurabh Sinha: gene regulation, big data to knowledge
Two broad areas: How is information about us encoded in our DNA ? How do we bring the latest and greatest in machine learning and graph mining to the biologist’s desktop computer? Research questions: Gene regulation: How are genes turned on and off in precisely orchestrated ways? Regulatory evolution: Can we model evolution of regulatory sequences? Genomics of behavior: How does DNA encode animal behavior ? Cancer pharmacogenomics: Can a person’s DNA predict the best drug treatment? Big Data To Knowledge (BD2K): Build a “Knowledge Engine for Genomics”.

Jian Peng: Machine learning for molecular and systems biology
Biological data Machine learning Knowledge Understanding of the sequence-structure-function relationship Machine learning algorithms for biological network integration Prediction of gene/protein function from heterogeneous datasets Role of protein mutations in human diseases Graphical models, statistical modeling and structured prediction

Computational Phylogenomics
NP-hard problems Large datasets Complex statistical estimation problems Metagenomics Protein structure and function prediction Medical forensics Systems biology Population genetics

This course: Material Core bioinformatics problems:
Genome assembly, Phylogeny estimation, Multiple sequence alignment, and Database search Probabilistic models of sequence evolution Algorithm design techniques Recursion Divide-and-conquer Dynamic programming Designing heuristic search strategies Use of profile Hidden Markov Models Computational complexity NP-hardness Approximation algorithms

DNA Sequence Evolution
-3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

However… U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT X U Y V W

Indels (insertions and deletions)
Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… 12

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA…
Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Homology = nucleotides lined up since they come from a common ancestor. Indel = dash. 13

Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3

Phylogenomic pipeline
Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Two-phase estimation Phylogeny methods Bayesian MCMC Maximum parsimony
Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. RAxML: heuristic for large-scale ML optimization 19

1000-taxon models, ordered by difficulty (Liu et al., 2009)
1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000-taxon models, ordered by difficulty (Liu et al., 2009) 20

Estimate ML tree on merged alignment
Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé and PASTA Algorithms
Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

SATé-1 (Science 2009) performance
1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? SATé-1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SATé-1 can analyze up to about 8,000 sequences.

This course: Material Core bioinformatics problems:
Genome assembly, Phylogeny estimation, Multiple sequence alignment, and Database search Probabilistic models of sequence evolution Algorithm design techniques Recursion Divide-and-conquer Dynamic programming Designing heuristic search strategies Use of profile Hidden Markov Models Computational complexity NP-hardness Approximation algorithms

Pre-requisites CS 466: requires CS 225 BIOE 498: requires CS 173
All students should be able to program in some high-level language (of their choice).

This course: grading Homework: 35% (includes assigned reading)
Midterm (March 31): 35% Class participation (includes class presentations): 10% Final project: 20% (due last day of course)

Homework Generally due Tuesdays by 1 PM ( ed in PDF format or delivered to my SC office Homework assignments will include Programming assignments (language of your choice) Algorithm designs Proofs Calculations Written critiques of published papers (in PDF format)

Final Project Can be done by yourself or with one other person
Options: Easiest: Critique of a collection of papers on a particular problem Analysis of biological data using different techniques, and paper describing the differences Comparison of two or more methods for the same problem Hardest: A new algorithm that you design Timeline: Project proposals (PDF format, 1 page): April 3 In class presentation of final project plans: April 7-17 Projects (PDF format, 6-10 pages): May 3

The Tree of Life: Multiple Challenges
Tandy Warnow The Tree of Life: Multiple Challenges Large datasets: 100,000+ sequences 10,000+ genes “BigData” complexity Large-scale statistical phylogeny estimation Ultra-large multiple-sequence alignment Estimating species trees from incongruent gene trees Supertree estimation Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima

Other http://tandy.cs.illinois.edu/CS466.html Textbooks:
Jones and Pevzner, Introduction to Bioinformatics Algorithms, MIT Press Computational Phylogenetics, at Office hours: Wednesday 9-10 AM in Siebel 3235 Homework policy: homework submitted late but within 24 hours are 80%, no homework can be submitted more than 24 hours late

CS 466 and BIOE 498: Introduction to Bioinformatics

Similar presentations

Presentation on theme: "CS 466 and BIOE 498: Introduction to Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 466 and BIOE 498: Introduction to Bioinformatics

Similar presentations

Presentation on theme: "CS 466 and BIOE 498: Introduction to Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback