CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Molecular Evolution Revised 29/12/06
BNFO 602 Multiple sequence alignment Usman Roshan.
CIS786, Lecture 5 Usman Roshan.
CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Lecture 1 BNFO 135 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Programs for comparing.
Expected accuracy sequence alignment
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 602, Lecture 2 Usman Roshan Some of the slides are based upon material by David Wishart of University.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
CIS786, Lecture 3 Usman Roshan.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.
Similar Sequence Similar Function Charles Yan Spring 2006.
Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
BNFO 602 Multiple sequence alignment Usman Roshan.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Lecture 1 BNFO 136 Usman Roshan. Course overview Pre-req: BNFO 135 or approval of instructor Python progamming language and Perl for continuing students.
CIS786, Lecture 4 Usman Roshan.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Chapter 3 Computational Molecular Biology Michael Smith
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Expected accuracy sequence alignment Usman Roshan.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Multiple sequence alignment (msa)
BNFO 602 Lecture 2 Usman Roshan.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
BNFO 602 Lecture 2 Usman Roshan.
CS 394C: Computational Biology Algorithms
Presentation transcript:

CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University of Alberta and Ron Shamir of Tel Aviv University

Previously…

Iterated local search: Recursive-Iterative-DCM3 Local optimum Output of Recursive-DCM3 Local search

13921 Proteobacteria rRNA

How to run Rec-I-DCM3 then? Unanswered question: what about better TNT heuristics? Can Rec-I-DCM3 improve upon them? Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics. Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes

Maximum likelihood

Four problems –Given data, tree, edge lengths, and ancestral states find likelihood of tree: polynomial time –Given data, tree and edge lengths find likelihood of tree: polynomial time dynamic programming –Given data and tree, find likelihood: unknown complexity –Given data find tree with best likelihood: unknown complexity

Sequential RAxML Compute randomized parsimony starting tree with dnapars from PHYLIP Apply exhaustive subtree rearrangements Iterate while tree improves

Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1 Need to optimize all branches ?

Idea: Lazy Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

Idea: Lazy Subtree Rearrangements ST5 ST2 ST6 ST4 ST3 ST1

Comparison across all datasets Dataset sizeImprovement as % Steps improvement Max pAvg p 2025 (ARB)-0.002% Bininda- Emonds 0.004% (RG)1.251% (RG)2.338% (ARB)0.03%

Parallel Rec-I-DCM3 Local optimum Output of DCM3 Recursive- DCM3 Local search (1)Solve subproblems in parallel (2)Merge subtrees in the proper subtree order Use parallel RAxML developed by Du and Stamatakis

P-Rec-I-DCM3 vs Rec-I-DCM3 DatasetParallel LHSequential LH Improvement in steps Improvement (as a %) 500 rbcL (Zilla) % 2560 rbcL (Kallersjo) % s Actinobacteria (RDP) % 6281 ssu rRNA Eukaryotes (ERNA) % s Firmicutes Bacteria (RDP) % 7769 rRNA 3- dom+2org (Gutell) %

Parallel performance limits Performance appears sub-optimal because of significant load imbalance caused by different subproblem sizes Optimal speedup=(total subproblem time)/(minimum time) Dataset 3 –19 subproblems of which 3 require at least 5K seconds (max is 5569 seconds) –Optimal speedup: 37353/5569=6.71 Dataset 6 –43 subproblems of which longest takes seconds –Optimal speedup: 63620/12164=5.23 Dataset Dataset ProcessorsBaseGlobalOverall

Summary of last time Rec-I-DCM3 in detail Rec-I-DCM3(TNT) Maximum likelihood (ML) problem RAxML for solving ML Rec-I-DCM3(RAxML) Parallel Rec-I-DCM3(RAxML)

Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing and genetic variation across species Involves identifying evolutionary events: insertions, deletions, and substitutions Goal is to “align” sequences such that number of mutations is minimized

Sequencing Successes T7 bacteriophage completed in ,937 bp, 59 coded proteins Escherichia coli completed in ,639,221 bp, 4293 ORFs Sacchoromyces cerevisae completed in ,069,252 bp, 5800 genes

Sequencing Successes Caenorhabditis elegans completed in ,078,296 bp, 19,099 genes Drosophila melanogaster completed in ,117,226 bp, 13,601 genes Homo sapiens completed in ,201,762,515 bp, 31,780 genes

Genomes to Date 8 vertebrates (human, mouse, rat, fugu, zebrafish) 3 plants (arabadopsis, rice, poplar) 2 insects (fruit fly, mosquito) 2 nematodes (C. elegans, C. briggsae) 1 sea squirt 4 parasites (plasmodium, guillardia) 4 fungi (S. cerevisae, S. pombe) 200+ bacteria and archebacteria viruses

So what do we do with all this sequence data?

Comparative bioinformatics

DNA Sequence Evolution AAGACTT -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT T_GACTTAAGGCTT _GGGCTTTAGACCTTA_CACTT ACCTT (Cat) ACACTTC (Lion) TAGCCCTTA (Monkey) TAGGCCTT (Human) GGCTT (Mouse) T_GACTTAAGGCTT AAGACTT _GGGCTTTAGACCTTA_CACTT AAGGCTTT_GACTT AAGACTT TAGGCCTT (Human) TAGCCCTTA (Monkey) A_C_CTT (Cat) A_CACTTC (Lion) _G_GCTT (Mouse) _GGGCTTTAGACCTTA_CACTT AAGGCTTT_GACTT AAGACTT

Sequence alignments They tell us about Function or activity of a new gene/protein Structure or shape of a new protein Location or preferred location of a protein Stability of a gene or protein Origin of a gene or protein Origin or phylogeny of an organelle Origin or phylogeny of an organism And more…

Pairwise alignment How to align two sequences?

Pairwise alignment

Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Dynamic programming Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Tabular computation of scores

Traceback to get alignment

Local alignment Finding optimally aligned local regions

Local alignment

Database searching Suppose we have a set of 1,000,000 sequences You have a query sequence q and want to find the m closest ones in the database--- that means 1,000,000 pairwise alignments! How to speed up pairwise alignments?

FASTA FASTA was the first software for quick searching of a database Introduced the idea of searching for k-mers Can be done quickly by preprocessing database

FASTA: combine high scoring hits into diagonal runs

BLAST Key idea: search for k-mers (short matchig substrings) quickly by preprocessing the database.

BLAST This key idea can also be used for speeding up pairwise alignments when doing multiple sequence alignments

Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families Compute probabilities of change and background probabilities by simple counting

PAM In this model the unit of evolution is the amount of evolution that will change 1 in 100 amino acids on the average The scoring matrix S ab is the ratio of M ab to p b

PAM M ij matrix (x10000)

Multiple sequence alignment “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk Computationally very hard---NP-hard

Formally…

Multiple sequence alignment Unaligned sequences GGCTT TAGGCCTT TAGCCCTTA ACACTTC ACTT Aligned sequences _G_ _ GCTT_ TAGGCCTT_ TAGCCCTTA A_ _CACTTC A_ _C_ CTT_ Conserved regions help us to identify functionality

Sum of pairs score

What is the sum of pairs score of this alignment?

Tree alignment score

Tree Alignment TAGGCCTT (Human) TAGCCCTTA (Monkey) ACCTT (Cat) ACACTTC (Lion) GGCTT (Mouse)

Tree Alignment TAGGCCTT_ (Human) TAGCCCTTA (Monkey) A__C_CTT_ (Cat) A__CACTTC (Lion) _G__GCTT_ (Mouse) TAGGCCTT_A__CACTT_ TGGGGCTT_ AGGGACTT_ Tree alignment score = 14

Tree Alignment---depends on tree TAGGCCTT_ (Human) TAGCCCTTA (Monkey) A__C_CTT_ (Cat) A__CACTTC (Lion) _G__GCTT_ (Mouse) TA_CCCTT_ TA_CCCTTA TA_CCCTT_ TA_CCCTTA Tree alignment score = 15 Switch monkey and cat

Profiles Before we see how to construct multiple alignments, how do we align two alignments? Idea: summarize an alignment using its profile and align the two profiles

Profile alignment

Iterative alignment (heuristic for sum-of-pairs) Pick a random sequence from input set S Do (n-1) pairwise alignments and align to closest one t in S Remove t from S and compute profile of alignment While sequences remaining in S –Do |S| pairwise alignments and align to closest one t –Remove t from S

Iterative alignment Once alignment is computed randomly divide it into two parts Compute profile of each sub-alignment and realign the profiles If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

Progressive alignment Idea: perform profile alignments in the order dictated by a tree Given a guide-tree do a post-order search and align sequences in that order Widely used heuristic Can be used for solving tree alignment

Simultaneous alignment and phylogeny reconstruction Given unaligned sequences produce both alignment and phylogeny Known as the generalized tree alignment problem---MAX-SNP hard Iterative improvement heuristic: –Take starting tree –Modify it using say NNI, SPR, or TBR –Compute tree alignment score –If better then select tree otherwise continue until reached a local minimum

Median alignment Idea: iterate over the phylogeny and align every triplet of sequences---takes o(m 3 ) (in general for n sequences it takes O(2 n m n ) time Same profiles can be used as in progressive alignment Produces better tree alignment scores (as observed in experiments) Iteration continues for a specified limit

Popular alignment programs ClustalW: most popular, progressive alignment MUSCLE: fast and accurate, progressive and iterative combination T-COFFEE: slow but accurate, consistency based alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment) PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme DIALIGN: very good for local alignments

MUSCLE

Profile sum-of-pairs score Log expectation score used by MUSCLE

Evaluation of multiple sequence alignments Compare to benchmark “true” alignments Use simulation Measure conservation of an alignment Measure accuracy of phylogenetic trees How well does it align motifs? More…

BAliBASE Most popular benchmark of alignments Alignments are based upon structure BAliBASE currently consists of 142 reference alignments, containing over 1000 sequences. Of the 200,000 residues in the database, 58% are defined within the core blocks. The remaining 42% are in ambiguous regions that cannot be reliably aligned. The alignments are divided into four hierarchical reference sets, reference 1 providing the basis for construction of the following sets. Each of the main sets may be further sub-divided into smaller groups, according to sequence length and percent similarity.

BAliBASE The sequences included in the database are selected from alignments in either the FSSP or HOMSTRAD structural databases, or from manually constructed structural alignments taken from the literature. When sufficient structures are not available, additional sequences are included from the HSSP database (Schneider et al., 1997). The VAST Web server (Madej, 1995) is used to confirm that the sequences in each alignment are structural neighbours and can be structurally superimposed. Functional sites are identified using the PDBsum database (Laskowski et al., 1997) and the alignments are manually verified and adjusted, in order to ensure that conserved residues are aligned as well as the secondary structure elements.FSSP HOMSTRADHSSP VAST PDBsum

BAliBASE Reference 1 contains alignments of (less than 6) equi- distant sequences, ie. the percent identity between two sequences is within a specified range. All the sequences are of similar length, with no large insertions or extensions. Reference 2 aligns up to three "orphan" sequences (less than 25% identical) from reference 1 with a family of at least 15 closely related sequences. Reference 3 consists of up to 4 sub-groups, with less than 25% residue identity between sequences from different groups. The alignments are constructed by adding homologous family members to the more distantly related sequences in reference 1. Reference 4 is divided into two sub-categories containing alignments of up to 20 sequences including N/C-terminal extensions (up to 400 residues), and insertions (up to 100 residues).

Comparison of alignments on BAliBASE

Next time… Comparison of alignments under simulation Heuristics for simultaneous alignment and phylogeny reconstruction Comparison of alignments for motif detection---functional sites in proteins Performance of alignments for phylogeny reconstruction