Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Structural bioinformatics
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics and Phylogenetic Analysis
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignments
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple sequence alignment Monday, December 6, 2010 Bioinformatics J. Pevsner
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Pairwise Alignments Part 1 Biology 224 Instructor: Tom Peavy Sept 8
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Protein Sequence Alignment and Database Searching.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Multiple sequence alignment Tuesday, Feb Suggested installation for the following tools on your own computer: ClustalX, Mega4, GeneDoc; treeview.
Multiple sequence alignment Monday, December 8, 2008 Introduction to Bioinformatics ME: J. Pevsner
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignment (MSA) 1.Uses of MSA 2.Technical difficulties 1.Select sequences 2.Select objective function 3.Optimize the objective function.
10/18/20151 Multiple sequence alignment. 10/18/20152 Copyright notice Many of the images in this power point presentation are from Bioinformatics and.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple sequence alignment
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
7/9/20161 Multiple sequence alignment. 7/9/20162 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple Sequence Alignment
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Presentation transcript:

Biology 224 Instructor: Tom Peavy October 18 & 20, 2010 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner> Multiple Sequence Alignment

Multiple sequence alignment: definition a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned Homologous residues are aligned in columns across the length of the sequences residues are homologous in an evolutionary sense residues are homologous in a structural sense

Multiple sequence alignment: properties not necessarily one “correct” alignment of a protein family protein sequences evolve......the corresponding three-dimensional structures of proteins also evolve may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures

Multiple sequence alignment: features some aligned residues, such as cysteines that form disulfide bridges, may be highly conserved there may be conserved motifs such as a transmembrane domain there may be conserved secondary structure features there may be regions with consistent patterns of insertions or deletions (indels)

Multiple sequence alignment: methods There are two main ways to make a multiple sequence alignment: (1)Progressive alignment (Feng & Doolittle). (e.g. ClustalW) (2) Iterative approaches.

Use Clustal W to do a progressive MSA ac.uk/clustalw/

Feng-Doolittle MSA occurs in 3 stages [1] Do a set of global pairwise alignments (Needleman and Wunsch) [2] Create a guide tree [3] Progressively align the sequences

Progressive MSA stage 1 of 3: generate global pairwise alignments Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 84 Sequences (1:3) Aligned. Score: 84 Sequences (1:4) Aligned. Score: 91 Sequences (1:5) Aligned. Score: 92 Sequences (2:3) Aligned. Score: 99 Sequences (2:4) Aligned. Score: 86 Sequences (2:5) Aligned. Score: 85 Sequences (3:4) Aligned. Score: 85 Sequences (3:5) Aligned. Score: 84 Sequences (4:5) Aligned. Score: 96 five closely related lipocalins best score

Number of pairwise alignments needed For N sequences, (N-1)(N)/2 For 5 sequences, (4)(5)/2 = 10

Feng-Doolittle stage 2: guide tree Convert similarity scores to distance scores A tree shows the distance between objects Distance methods used (i.e. Neighbor joining) ClustalW provides a syntax to describe the tree A guide tree is not a phylogenetic tree

Progressive MSA stage 2 of 3: generate guide tree five closely related lipocalins 3 (rat RBP) 2 (murine RBP) 4 (porcine RBP) 5 (bovine RBP) 1 (human RBP) ((Human RBP: ,(Mouse RBP: , Rat RBP: ) : ) : , Pig RBP: , Bovine RBP: );

Feng-Doolittle stage 3: progressive alignment Make a MSA based on the order in the guide tree Start with the two most closely related sequences Then add the next closest sequence Continue until all sequences are added to the MSA Rule: “once a gap, always a gap”

Clustal W alignment of 5 closely related lipocalins CLUSTAL W (1.82) multiple sequence alignment gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50 gi|132403|sp|P18902|RETB_BOVIN ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32 gi| |ref|NP_ | MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 gi| |sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:***** gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100 gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82 gi| |ref|NP_ | EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 gi| |sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:************** gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150 gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132 gi| |ref|NP_ | PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 gi| |sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********

Why “once a gap, always a gap”? There are many possible ways to make a MSA Where gaps are added is a critical question Gaps are often added to the first two (closest) sequences To change the initial gap choices later on would be to give more weight to distantly related sequences To maintain the initial gap choices is to trust that those gaps are most believable

Multiple sequence alignment to profile HMMs Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment HMMs are probabilistic models Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments

GTWYA (hs RBP) GLWYA (mus RBP) GRWYE (apoD) GTWYE (E Coli) GEWFS (MUP4) An HMM is constructed from a MSA Example: five lipocalins

GTWYA GLWYA GRWYE GTWYE GEWFS Prob p(G)1.0 p(T)0.4 p(L)0.2 p(R)0.2 p(E) p(W)1.0 p(Y)0.8 p(F)0.2 p(A)0.4 p(S)0.2

GTWYA GLWYA GRWYE GTWYE GEWFS P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = G:1.0 T:0.4 L:0.2 R:0.2 E:0.2 W:1.0 Y:0.8 F:0.2 E:0.4 A:0.4 S:0.2

BLOCKS (HMM) CDD (HMM) DOMO (Gapped MSA) INTERPRO iProClass MetaFAM Pfam (profile HMM library) PRINTS PRODOM (PSI-BLAST) PROSITE SMART Databases of multiple sequence alignments

Query = your favorite protein Database = set of many PSSMs CDD is related to PSI-BLAST, but distinct CDD searches against profiles generated from pre-selected alignments Purpose: to find conserved domains in the query sequence You can access CDD via DART at NCBI CDD uses RPS-BLAST: reverse position-specific

Multiple sequence alignment algorithms Progressive Iterative LocalGlobal PIMA DIALIGNSAGA CLUSTAL PileUp other

AMAS CINEMA ClustalW ClustalX DIALIGN HMMT Match-Box MultAlin MSA Musca PileUp SAGA T-COFFEE Multiple sequence alignment programs

Clustal X

GCG PileUp

Boxshade Alignment (“Pretty Shading”) Boxshade server=

[1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity. Proteins <25% identity: 65% of residues align well Proteins <40% identity: 80% of residues align well Assessment of alternative multiple sequence alignment algorithms [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.