The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.

Slides:



Advertisements
Similar presentations
Bioinformatics Methods Course Multiple Sequence Alignment Burkhard Morgenstern University of Göttingen Institute of Microbiology and Genetics Department.
Advertisements

Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November.
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Structural bioinformatics
BNFO 602 Multiple sequence alignment Usman Roshan.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Bioinformatics and Phylogenetic Analysis
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection.
Multiple sequence alignments and motif discovery Tutorial 5.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Heuristic Approaches for Sequence Alignments
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Blast Basic Local Alignment Search Tool
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Sequence Based Analysis Tutorial
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs

The Basic Local Alignment Search Tool (BLAST) A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y Most local alignments contain highly conserved sections without gaps

The Basic Local Alignment Search Tool (BLAST) A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y -> search for high scoring segment pairs (HSP), i.e. gap-free local alignments

The Basic Local Alignment Search Tool (BLAST)

A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y Advantages: (a) speed (b) statistical theory about HSP exists.

The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs (2) Use word pairs as seeds

Pair-wise sequence alignment T W L M H C A Q Y I C I M X H X C X T H Y (1) Search word pairs of length 3 with score > T, Use them as seeds.

Pair-wise sequence alignment Naïve algorithm would have a complexity of O(l 1 * l 2 ) Solution: Preprocess query sequence: Compile a list of all words that have a Score > T when aligned to a word in the Query.

Pair-wise sequence alignment Naïve algorithm would have a complexity of O(l 1 * l 2 ) Solution: Preprocess query sequence: Compile a list of all words that have a Score > T when aligned to a word in the Query. Complexity: O(l 1 ) Organize words in efficient data structure (tree) for fast look-up

The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs (2) Use word pairs as seeds (3) Extend seed alignments until score drops below threshold value

Pair-wise sequence alignment T W L M H C A Q Y I C I M X H X C X T H Y Extend seeds until score drops by X.

Pair-wise sequence alignment T W L M H C A Q Y I C I X M X H X C X T X H X Y Extend seeds until score drops by X.

Pair-wise sequence alignment Algorithm not guaranteed to find best segment pair (Heuristic) But works well in practice!

The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy

Pair-wise sequence alignment W L M H C A Q Y A R V I M X H X C X T H W A X R X v X Search two word pairs of at the same diagonal, use lower threshold T

The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST (PSI BLAST)

The Basic Local Alignment Search Tool (BLAST)

Multiple sequence alignment 1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg eG 1vie 1.drvrkksga awqGQIVGWYctnlt peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment First question: how to score multiple alignments? Possible scoring scheme: Sum-of-pairs score

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQtkngqGWVPSNYITPVN 1ycsB 39 WWWARlndkeGYVPRNLLGLYP

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment Multiple alignment implies pairwise alignments: Use sum of scores of these p.a. 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......

Multiple sequence alignment Goal: Find multi-alignment with maximum score !

Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Multidimensional search space instead of two- dimensional matrix!

Multiple sequence alignment

Complexity: For sequences of length l 1 * l 2 * l 3 O( l 1 * l 2 * l 3 ) For n sequences ( average length l ): O( l n ) Exponential complexity!

Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible:

Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible: -> Heuristics necessary

Multiple sequence alignment (A) Carillo and Lipman (MSA) Find sub-space in dynamic-programming Matrix where optimal path can be found

Multiple sequence alignment (B) Stoye, Dress (DCA) Divide search space into small Calculate optimal alignment for sub-spaces Concatenate sub-alignments

Multiple sequence alignment (B) Stoye, Dress (DCA)

Multiple sequence alignment (B) Stoye, Dress (DCA)

Multiple sequence alignment Progressive alignment. Carry out a series of pair-wise alignment

Most popular way of constructing multiple alignments: Progressive alignment. Carry out a series of pair-wise alignment Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Align most similar sequences Multiple sequence alignment

WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Align sequence to alignment

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Align alignment to alignment

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Rule: “once a gap - always a gap”

Multiple sequence alignment Order of pair-wise profile alignments determined by phylogenetic tree based on pair-wise similarity values (guide tree)

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

Multiple sequence alignment Problem: simple guide tree determines multiple alignment; multiple alignment determines phyolgeneitc analysis

Multiple sequence alignment Implementations: Clustal W, PileUp, MultAlin

Local multiple alignment M M

M M M

M M M M´

Local multiple alignment Find motifs contained in all sequences in data set Problem: motifs often present in only sub-families

Neither local nor global methods appliccable

Alignment possible if order conserved

The DIALIGN approach

Combination of local and global methods.

The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments)

The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments) Compose alignments from fragments

The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments) Compose alignments from fragments Ignore non-related parts of the sequences

The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc--

The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc atcTAATAGTTAaaccccctcgtGCTTag AGATCCaaac cagtgcgtgTATTACTAAc GGTTcaatcgcgcACATCCgc--

The DIALIGN approach Score of an alignment: Define score of fragment f: l(f) = length of f s(f) = sum of matches (similarity values) P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences. Score w(f) = -ln P(f)

The DIALIGN approach Score of an alignment: Define score of alignment as sum of scores w(f) of its fragments No gap penalty is used! Optimization problem for pair-wise alignment: Find chain of fragments with maximal total score

The DIALIGN approach atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc-- Fragment-chaining algorithm finds optimal chain of fragments.

The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa

The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaac ggttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency: it is possible to introduce gaps such that all segment pairs are aligned.

The DIALIGN approach Multiple fragment alignment atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa

Program evaluation Use biologically verified alignments (known 3D structure of proteins) Compare alignments produced by computer programs to “biologically correct” alignments.

Program evaluation (1) First evaluation of multiple alignment programs (McClure, Vasi, Fitch,1994) 4 protein families used: Globin, kinase, protease, ribonuclease H, all globally related -> global programs performed best

Program evaluation (2) The BAliBASE (Thompson et al., 1999) ~ 100 protein families with known 3D structure, some with large insertions/deletions.

Program evaluation 1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg eG 1vie 1.drvrkksga awqGQIVGWYctnlt peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN Key alpha helix RED beta strand GREEN core blocks UNDERSCORE

Program evaluation Results: Four programs performed best, but no method was best in all test examples. ClustalW, SAGA and RPPR best for global alignment, DIALIGN best for sequences with large insertions or deletions.

Program evaluation (3) Lassmann and Sonnhammer (2002) Used BAliBASE plus artificial sequences for local alignment Results: T-COFFEE best for closely related sequences, DIALIGN best for distal sequences.

Program evaluation

Alignment of large genomic sequences Important tool for identifying functional sites (e.g. genes or regulatory elements)

Alignment of large genomic sequences Phylogenetic Footprinting: Functional sites more conserved during evolution => Sequence similarity indicates biological function

Alignment of large genomic sequences DIALIGN performs well in identifying local homologies, but is slow

Quadratic program running time

Solution: Anchored alignments

Find anchor points to reduce search space

Solution: Anchored alignments Use fast heuristic method to find anchor points: CHAOS developed together with Mike Brudno Brudno et al. (2003), BMC Bioinformatics 4:66

Solution: Anchored alignments

(3) Anchored alignments

First step to gene prediction: Exon discovery by genomic alignment

Evaluation of different alignment programs: Compare local sequence similarity identified by alignment programs to known exons Morgenstern et al. (2002), Bioinformatics 18:

DIALIGN alignment of human and murine genomic sequences

DIALIGN alignment of tomato and Thaliana genomic sequences

Evaluation of DIALIGN, PipMaker, WABA, BLASTN and TBLASTX on a set of 42 human and murine genomic sequences. Compare similarities to annotated exons Apply cut-off parameter to resulting alignments Measure sensitivity and specificity

Performance of long-range alignment programs for exon discovery (human - mouse comparison)

Performance of long-range alignment programs for exon discovery (thaliana - tomato comparison)

AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons Recursive algorithm finds biologically consistent chain of potential exons

Identification of candidate exons Fragments in DIALIGN alignment

Identification of candidate exons Build cluster of fragments

Identification of candidate exons Identify conserved splice sites

Identification of candidate exons Candidate exons bounded by conserved splice sites

Construct gene models using candidate exons Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

Find optimal consistent chain of candidate exons

atggtaggtagtgaatgtga

Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2

Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in N log N time

DIALIGN fragments

Candidate exons

Complete model

Results: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

AGenDA GenScan 64 % 12 % 17 %

Results: Quality of AGenDA-based gene models comparable to results from GenScan Exons identified that have not been identified by GenScan No statistical models derived from known genes (no training data necessary!) Method generally appliccable

AGenDA: Alignment-based Gene Detection Algorithm WWW server: Rinner, Taher, Goel, Sczyrba, Brudno, Batzoglou, Morgenstern, submitted