Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
Profiles for Sequences
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Sequence similarity.
Multiple alignment: heuristics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Multiple Sequence Alignments
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Bioinformatics Sequence Analysis III
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Comp. Genomics Recitation 3 The statistics of database searching.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Sequence Based Analysis Tutorial
In Bioinformatics use a computational method - Dynamic Programming.
Sequence Based Analysis Tutorial
Introduction to Bioinformatics
Alignment IV BLOSUM Matrices
Presentation transcript:

Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture 9

Comparing a pair of sequences is not sufficient for many research purposes, mainly for evolutionary reconstructions and study functional similarities. It is obvious that MSA is much more demanding in computational sense. For two protein sequences each 300 aa in length and excluding gaps, the number of comparisons to be made using dynamic programming approach is equal to = 9 x For 3 sequences of the same length this number is = 2.7 x For 10 sequences it becomes staggering. Fortunately in late 1980 and mid 1990 methods, which dramatically reduce a number of comparisons, were invented. The MSA alignment is usually done in three consecutive steps. 1. Finding alignments between each pair of sequences; 2. A trial MSA is then produced by predicting a phylogenetic tree for the sequences (for instance neighbor-joining method); 3. The sequences are then multiply aligned in the order of their relationship on the tree. Multiple sequence alignments (MSA)

Scoring Multiple sequence alignments Sequence Column A Column B Column C 1 …………..……N…………………N…………………..N 2 ………..………N…………………N…………………..N 3 ………..………N…………………N…………………..N 4 ………..………N…………………N…………………..C 5 ………..………N…………………C…………………..C No. of N-N matched pairs (each scores 6): No. of N-C matched pairs (each scores -3): BLOSUM62 score: N N N N N N N C N N N C C N N

The most closely related sequences are first aligned by dynamic programming to build a MSA starting from the most related sequences The tree is based on pairwise comparisons of the sequences using one of the phylogenetic methods Unfortunately uncertainty is growing in the lower levels of the tree, as deletions or insertions not easy to recognise The challenge is to utilize an appropriate combination of sequence weighting, scoring matrix and gap penalties, which prevents optimal MSA Progressive methods of MSA N Y L S N K Y L S N F S N F L S N K/- Y L S N F L/- S N K/- Y/F L/- S

ClustalW This is one of the advanced version of the popular and powerful program, where W stand for weighting. ClustalW provides more realistic alignments that should reflect evolutionary changes and more appropriate distribution of gaps between conserved domains ClustalW performs a global-multiple sequence alignment by a different method than MSA, although the initial global-multiple sequence alignment is calculated similarly The steps involved are: 1. Pairwise alignment of all sequences; 2. Use the alignment scores for building a phylogenetic tree; 3. Progressive alignment guided by the phylogenetic relationships indicated by the tree The most closely related (similar) sequences are aligned first, and then additional sequences are added The initial alignments used to produce the guide tree may be obtained by a fast k- tuple approach (similar to FASTA) or a slower dynamic programming method For building a tree genetic distances between sequences are calculated as the numbers of mismatched positions in an alignment divided by the total number of matched positions

ClustalW Sequence A (weight a) ………..K………… Sequence C (weight c) ………..L………… Sequence B (weight b) ………..I…………. Sequence C (weight c) ………..L………… The same procedure applies to other columns in all pairwise alignments Scores for matching these two columns in an MSA = {[a x c x score (K,L )] + [b x c x score (I,L)] +…}/n columns m pairwisecomparisons } Weighting factor Normalized A. Calculation of sequence weights B. Use of sequence weights A /2 = /0.5 = 0.7 B /2 = /0.5 = 0.5 C /0.5 = 1 Columns in alignment 1 &

An output from ClustalW sequences have significant similarity CLUSTAL W (1.82) multiple sequence alignment gi| |gb|AAH | MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKMVAAGICRS- 49 gi|825623|emb|CAA | MGTKGKVIKCKAAIAWEAGKPLCIEEVEVAPPKAHEVRIQIIATSLCHT- 49 gi| |gb|AAS | --MQNFVFRNPTKLIFGKGQ---LEQLKTEIPQFGKKVLLVYGGGSIKRN 45. *:: : : : : :*:::. *: : : :.. :. gi| |gb|AAH | ---DEHVVSGNLV-TPLPVILGHEAAGIVESVGEGVTTVKPG--DKVIPL 93 gi|825623|emb|CAA | ---DASVIDSKFEGLAFPVIVGHEAAGIVESIGPGVTNVKPG--DKVIPL 94 gi| |gb|AAS | GIYDNVISILKDINAEVFELTGVEPNPRVSTVKKGIQICKDNGVEFILAV 95. * : :.. : * *. *.:: *: *.. : ::.: gi| |gb|AAH | FTPQCGKCRICKNPESNYCLKN-DLGNPRG T 123 gi|825623|emb|CAA | YAPLCRKCKFCLSPLTNLCGKISNLKSPASDQ QL 128 gi| |gb|AAS | GGGSVIDCTKAIAAGSKYDGDVWDIVTKKAFASEALPFGTVLTLAATGSE 145.*.. ::. ::.. :.:..: : :::.. gi| |gb|AAH | LQDGTRRFTCSGKPIHHFVGVSTFSQYTVVDENAVAKIDAASPLEKVCLI 173 gi|825623|emb|CAA | MEDKTSRFTCKGKPVYHFFGTSTFSQYTVVSDINLAKIDDDANLERVCLL 178 gi| |gb|AAS | MNAGSVITNWETNEKYGWGSPVTFPQFSILDPVHTASVPRDQTIYGMVDI 195 :: :.. : : :. **.*::::. *.: : : : alcohol dehydrogenase, iron-containing [Bacillus cereus Class I alcohol dehydrogenase, gamma subunit [Homo sapiens] Different form of alcohol dehydrogenase [Homo sapiens]

MSA programs discussed so far are based on global alignments, including all available parts of sequences However many sequences may have blocks of similarity, which are separated by low similarity regions Three approaches were used to develop methods more oriented toward this structural feature: 1. Profile analysis; 2. Block analysis; 3. Pattern searching Profiles are found by performing the global MSA of a group of sequences and them choosing the more highly conserved regions. A score matrix for such MSA, called profile, is then made. Once produced, the profile is used to search a target sequence for possible matches to the profile using scores in the table to evaluate the likelihood at each position. Localised alignments in sequences

Profile analysis: pattern identification CONS A B C D……………………….V W Y Z Gap Len I ……………………… T ………………….……3 -28 – L ……………………… S ……………………… T ……………………… T ……………………… C ……………………… V ……………………… – C ……………………… V ……………………… The profile represents the specific motif pattern found for the chosen location for a set of hsp70 proteins. It is used to search a target sequence for matches to the profile. The values are log odds score of giving the probability of finding the amino acid in the target sequence at that position in the profile divided by the probability of aligning the two aa by random chance. There are 23 columns, representing 20 aa + 1 unknown aa (Z) + gap opening and extension penalties. Gaps are costly unless the profile itself include gaps, as in the row 3.

Profile analysis: pattern identification The log odds scores for the profile (Profile ij ) are given by: Profile ij = log [  (W ai x p aij )/p randomj ] all a’s Where W ai is the weight of an ancestral amino acid a at row i in the profile, p aij is the frequency of amino acid j in the PAM amino acid distribution that best matches at row i, and p randomj is the background frequency of amino acid j. Steps: 1. A profile for a protein family is prepared using a few sequences. 2. The profile is used to search a protein DB for the family members 3. Receiver operating characteristic (ROC) test plot, which could be as high as 95.6  0.6% of the known family members for as little as 6 initial sequences. The success rate may slightly be increased by using >100 sequences for the profile search.

Block Analysis This method is very similar to the profile search. The major difference is that insertions and deletions are not considered. As a result the patterns found contain regions of high similarity separated by loosely similar or dissimilar sequences These ungapped patterns may be extracted from these aligned regions and used to produce blocks. Profile matrices the same as in the previous method are built. Seq1 GVDVLVATPG RLLDLEHQNA..VKLDQV EILVLDEADR Seq2 GPDALVSTPG RYLTLEHRNV..LKPDIV TIRVLDEADR Seq3 AVEVIVSTPG RLWDLHHQNA..VQLSQD ELLDLDEADK ……………………………………………………………………………………………………………………………………… Seqn GCDKLNATPG RLMDLKHQGA..VKLLFV SILVMDEADR

Hidden Markov Models Sequence alignment N  F L S N K Y L T Q  W - T DEL INS

Hidden Markov Model for sequence alignment BEG D1 D2 D3 D4 I0 I1 I2 I3 I4 M1 END M4 M3 M2 Del Ins Match/ MisMa Transition probability Loop transition to accommodate multi- residue insertions

Hidden Markov Models: calculation of transition probabilities N  F L S N K Y L T Q  W - T A pathway for sequence N K Y LT is: BEG  M1  I1  M2  M3  M4  END Each transition has an associated probability, and sum of the probabilities of transitions leaving each state is 1. It is equal for all states 0.33, except M4 and D4. Assuming for simplicity that a match state contains a uniform distribution across the 20 aa, then p = P NKYLT = 0.33 x 0.05 x 0.33 x 0.05 x 0.33 x 0.05 x 0.33 x 0.05 x 0.33 x 0.05 x 0.5 = 6.1 x The secret of successful using of HMM is to adjust the transition values and the distribution of each state by training the computer model. HMMER is a good example of such training. The training process leaves a memory and improve the ability to make better MSA. A generated pathway is called a Markov chain because the next state is dependent on the previous one. As the actual sequence/pathway information is hidden, the model is described as a Hidden Markov Model.

GeneDoc a multiple sequence alignment editor