Scoring Matrices.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Global Sequence Alignment by Dynamic Programming.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Bioinformatics in Biosophy
Pairwise & Multiple sequence alignments
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
Presentation transcript:

Scoring Matrices

Limitations to Needleman-Wunsch The problem with Needleman-Wunsch is the amount of processor memory resources it requires. Because of this, it is not favored for practical use, despite the guarantee of an optimal alignment.

What is the problem? There are about 10 88 possible alignments for two sequences with 300 nucleotides long( There are only about 10 80 elementary particles in the universe. It is not possible to solve the alignment problem with brute force. Therefore, we need some smart methods (or algorithms to overcome this problem

Limitations to Needleman-Wunsch The other difficulty is that the concept of global alignment is not used in pairwise sequence comparison searches.

Global Alignment vs. Local Alignment Needleman-Wunsch Method Local Dot Plots Smith-Waterman FastA BLAST

LGPSTKDFGKISESREFDN LNQLERSFGKINMRLE-DA Global alignment: The global alignment optimizes the alignment over the full length of the sequences. LGPSTKDFGKISESREFDN LNQLERSFGKINMRLE-DA Local Alignment: ---------------FGKI----------- In local alignment ,stretches with the highest density of matches are given the highest priority. The alignment tends to stop at the ends of regions of identity or strong similarity.

Purpose of Smith Waterman Algorithm Smith-Waterman dynamic programming algorithm, finds the most similar subsequences of two sequences, that has been generally recognized as the most sensitive sequence. The search sequences in protein and DNA databases searches for similarity to the query sequence by using Smith-Waterman algorithm as the core sequence comparison method.

Smith-Waterman searches A more sensitive brute force approach to searching much slower than BLAST or FASTA uses dynamic programming SSEARCH is a GCG program for Smith-Waterman searches

Differences Needleman- Wunsch Smith - Waterman Global alignments Requires alignments score for a pair of residues to be >=0 No gap penalty required Local alignments Residue alignment score may be positive or negative Requires a gap penalty to work effectively Score can increase, decrease or stay level between two cells of a pathway.

Scoring Matrix/Substitution Matrix To score quality of an alignment Contains scores for pairs of residues (amino acids or nucleic acids) in a sequence alignment For protein/protein comparisons: a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character (e.g. Ile, Leu) give higher scores compared to those of different character (e.g. Ile, Asp). Symmetric, so often only half is shown. Contains scores for matches between residues, according to observed substitution rates across large evolutionary distances Scoring Matrices are designed to detect signal above background, to detect similarities beyond what would be observed by chance alone. All algorithms to compare protein sequences rely on some scheme to score the equivalencing of each of the 210 possible pairs of amino acids. (i.e. 190 pairs of different amino acids + 20 pairs of identical amino acids). 20x20=400-20=380/2=190 The choice of matrix determines both the pattern and the extent of substitution in the sequences the database search is most likely to discover

Substitution Matrices Not all amino acids are equal Some are more easily substituted than others Some mutations occur more often Some substitutions are kept more often Mutations tend to favor some substitutions Some amino acids have similar codons They are more likely to be changed from DNA mutation Selection tends to favor some substitutions Some amino acids have similar properties or structure They are more likely to be kept The two forces together yield substitution matrices (From 18.417 computational biology) Example of CODONS: TTT & TTC code for Phe TTA & TTG code for Leu

Substitution Matrix A substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time. This is used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

Substitution Matrix An amino acid substitution matrix is a symmetrical 20*20 matrix, where each element contains the score for substituting a residue of type i with a residue of type j in a protein, where i and j are one of the 20 amino-acid residue types. Same residues should obviously have high scores, but if we have different residues in a position, how should that be scored?

Scoring Matrices Scoring matrices tell how similar amino acids are. There are two main sets of scoring matrices: PAM and BLOSUM. PAM is based on evolutionary distances BLOSUM is based on structure/function similarities

Substitution Matrix Scoring The same residues in a position give the score value 1, and different residues give 0. The same residues give a score 1, similar residues (for example: Tyr/Phe, or Ile/Leu) give 0.5, and all others 0. One may calculate, using well established sequence alignments, the frequencies (probabilities) that a particular residue in a position is exchanged for another.

Similarity Searching It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. Should they get a 0 (non-identical) or a 1 (identical) or something in between? Leucine Isoleucine

Scoring Similarity 1) Can only score aligned sequences 2) DNA is usually scored as identical or not 3) Modified scoring for gaps - single vs. multiple base gaps (gap extension) 4) AAs have varying degrees of similarity a. # of mutations to convert one to another b. chemical similarity c. observed mutation frequencies 5) PAM matrix calculated from observed mutations in protein families

Dayhoff Matrix This was done originally be Margaret Dayhoff. Her matrices are called the PAM (Point Accepted Mutation) matrices, which describe the exchange frequencies after having accepted a given number of point mutations over the sequence. Typical values are PAM 120 (120 mutations per 100 residues in a protein) and PAM 250. There are many other substitution matrices: BLOSUM, Gonnet, etc.

Dayhoff Matrix Derived from how often different amino acids replace other amino acids in evolution. Created from a dataset of closely similar protein sequences (less than 15% amino acid difference). These could be unambiguously aligned. A mutation probability matrix was derived where the entries reflect the probabilities of a mutational event. This matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues Possibly the most widely used scheme for scoring amino acid pairs is that developed by Dayhoff and co-workers. The system arose out of a general model for the evolution of proteins. 1978!!!, 1572 changes in 71 groups of closely related proteins. Atlas of Protein Sequences. Dataset of 71 aligned sequences? Newer PAM matrices do not differ greatly from the original ones Dayhoff and co workers examined alignments of closely similar sequences where the the likelihood of a particular mutation (e. A-D) being the result of a set of successive mutations (eg. A-x-y-D) was low. Since relatively few families were considered, the resulting matrix of accepted point mutations included a large number of entries equal to 0 or 1. A complete picture of the mutation process including those amino acids which did not change was determined by calculating the average ratio of the number of changes a particular amino acid type underwent to the total number of amino acids of that type present in the database. for example after 2 PAM (Percentage of Acceptable point Mutations per 10^8 years). An evolutionary distance of 1 PAM means there has been 1 point mutation per 100 residues (percent accepted mutation?) 1 PAM corresponds to an average change in 1% of all amino acids positions. Take a list of aligned proteins every time you see a substitution between two amino acids, increment the similarity score betweent them must normalize it by how often amino acids occur in general. Rare amino acids will give rare substitutions PAM model of molecular evolution After 100 PAMs of evolution, not every residue will have changed: some will have mutated several times, perhaps returning to their original state, and others not at all. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates. The probabilities represent the average mutational change that will take place when 1 residue out of 100 undergo mutation = 1 PAM (Point Accepted Mutation). 2 sequences 1 PAM apart have 99% identical residues

Importance of Scoring Matrices Scoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice.

Scoring Matrix Conventions Scoring matrices are conventionally numbered with numeric indices corresponding to the rows and columns of the matrix. For example, M11 refers to the entry at the first row and the first column. In general, Mij refers to the entry at the ith row and the jth column.

Scoring Matrices To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence. For example, if the matrix is: {A,C,T,G} then A = 1, C = 2, etc. Thus, one would find the score for a match between A and C at M12.

The Filled-in F matrix for global alignment of x=AAGT and Y=AGCGT(using BLOSUM50 substitution matrix) Y/X D A G T -8 -16 -24 -32 5 -3 -11 -19 C 2 4 -40 -27 10

Global alignment using BLOSUM50 substitution matrix Y/X D A G T -8 -16 -24 -32 5 -3 -11 -19 C 2 4 -40 -27 10 alignment: AAG _T AGCGT

Amino Acid Scoring Matrices There are two major scoring matrices for amino acid sequence comparisons PAM-derived from sequences known to be closely related (Eg. Chimpanzee and human). Ranges from PAM1 to PAM500 BLOSUM-derived from sequences not closely related (Eg. E. coli and human). Ranges from BLOSUM 10-BLOSUM 100

PAM250 Matrix 1) Notice 1 lettercode for the amino acids on both axes are the 20 aa note blocks of similar amino acids 2) Symmetric, only one half shown 3) Diagonal: * For example: high score for matching Tryptophans and “low” score for matching Alanines. * Cysteine * Leu abundant 4) Off-diagonal Groups of similar amino acids K -> F -5 A score above zero assigned to two amino acids indicates that these two .. Each other more often than expected by chance alone. Ie they are functionall.. Exchangable A negative score indicates that the two amino acids are rarely .. Interchangeable. Eg. A basic amino acids for an acidic one or one with an … side chain for one with aliphatic side chain. www.nbif.org/education/comp-bio/distance.htm

The Point-Accepted-Mutation (PAM) model This model implies that amino acids (AA) mutate independently of each other with a probability which depends only on the AA. Since there are 20 AA, the transition probabilities are described by a 20X20-mutation matrix, denoted by M. A standard M defines a 1-PAM change. Point Accepted Mutation (PAM) Distance: A 1-PAM unit changes 1% of the amino acids on average: where fi is the frequency of AAi, and Mii is the frequency of no change in amino acid i.

The Point-Accepted-Mutation (PAM) model Started by Margaret Dayhoff, 1978 A series of matrices describing the extent to which two amino acids have been interchanged in evolution PAM-1 was obtained by aligning very similar sequences. Other PAMs were obtained by extrapolation

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix A 2-PAM unit is equivalent to two 1-PAM unit evolution (or M2). A k-PAM unit is equivalent to k 1-PAM unit evolution (or Mk). Example 1: CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV |||||||||||||| ||||||||||||||||||||||||||||||||||| CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV length = 50 1 mismatch PAM distance = 2

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Observed % Sequence Difference Evolutionary Distance In PAMs 1 5 10 20 40 50 60 70 80 1 5 11 23 56 80 112 159 246

Assumptions in the PAM model 1. Replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model). 2. Sequences that are being compared have average amino acid composition.

Steps to building the first PAM Aligned sequences that were at least 85% identical. Reconstructed phylogenetic trees and inferred ancestral sequences. 71 trees containing 1,572 aa exchanges were used. Tallied aa replacements "accepted" by natural selection, in all pairwise comparisons (each Aij is the number of times amino acid j was replaced by amino acid i in all comparisons).

Steps to building PAM 4. Computed amino acid “mutability”, mj (the propensity of a given amino acid, j, to be replaced by any other amino acid) 5. Combined data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance, according to the following formula: Replacements

Steps to building PAM 6. Take the log odds ratio to obtain each score: Sij = log (Mij/fi) Where fi is the normalized frequency of aai in the sequences used. 7. Note: must multiply the Mij/fi by a factor of 10 prior to avoid fractions.

Sources of error in PAM model 1. Many sequences depart from average aa composition. 2. Rare replacements were observed too infrequently to determine probabilities accurately (for 36 aa pairs (out of 400 aa pairs) no replacements were observed!). 3. Errors in 1 PAM are magnified when extrapolated to 250 PAM. (Mijk = k PAM) 4. The idea that each amino acid is acting independently is an imperfect representation of evolution. Actually, distantly related sequences usually have islands (blocks) of conserved residues implying that replacement is not equally probable over entire sequence.

The bottom line on PAM Frequency of alignment Frequency of occurrence The probability that two amino acids, i and j are aligned by evolutionary descent divided by the probability that they are aligned by chance

BLOSUM Matrix (BLOcks SUbstitution Matrices) Blocks Sum-created from BLOCKS database A series of matrices describing the extent to which two amino acids are interchangeable in conserved structures of proteins The number in the series represents the threshold percent similarity between sequences, for consideration for calculation (For example, BLOSUM62 means 62% of the aa’s were similar)

BLOSUM Matrices BLOSUM is built from distantly related sequences within conserved blocks whereas PAM is built from closely related sequences BLOSUM is built from conserved blocks of aligned protein segments found in the BLOCKS database (the BLOCKS database is a secondary database that depends on the PROSITE Family database)

BLOSUM Matrices (cont.1) Version 8.0 of the Blocks Database consists of 2884 blocks based on 770 protein families documented in PROSITE. PROSITE supplies documentation for each family. Hypothetical entry in red box in BLOCK record: AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Building BLOSUM Matrices 1. To build the BLOSUM 62 matrix one must eliminate sequences that are identical in more than 62% of their amino acid sequences. This is done by either removing sequences from the Block or by finding a cluster of similar sequences and replacing it with a single representative sequence. 2. Next, the probability for a pair of amino acids to be in the same column is calculated. In the previous page this would be the probability of replacement of A with A, A with B, A with C, and B with C. This gives the value qij 3. Next, one calculates the probability that a certain amino acid frequency exists, fi.

Building BLOSUM Matrices (cont.) 4. Finally, we calculate the log odds ratio si,j= log2 (qij/fi). This value is entered into the matrix. Which BLOSUM to use? BLOSUM Identity 80 80% 62 62% (usually default value) 35 35% If you are comparing sequences that are very similar, use BLOSUM 80. Sequences that are more divergent (dissimilar) than 20% are given very low scores in this matrix.

Which Scoring Matrix to use? PAM-1 BLOSUM-100 Small evolutionary distance High identity within short sequences PAM-250 BLOSUM-20 Large evolutionary distance Low identity within long sequences

The PAM 250 Scoring Matrix

GCG Wisconsin Package GAP GAP is the implementation of the Needleman-Wunsch algorithm in the GCG program package. The NW algorithm will present you with a single globally optimal alignment, not all possible optimal alignments - different alignments may exist that give the same score. GAP presents you with one member of the family of best alignments that align the full length of one sequence to the full length of a second sequence. There may be many members of this family, but no other member has a higher score.

GCG Wisconsin Package GAP The primary use of a global alignment algorithm is when you really want the whole of two sequences to be aligned, without truncation. GAP could completely bypass a region of high local homology, if a better (or even just as good) path can be found in a different way. This is problematic if one short sequence is aligned against a longer one with internal repeats. If there is weak or unknown similarity between two sequences, a local alignment algorithm (BESTFIT) is the better choice. Use GAP only when you believe the similarity is over the whole length.

Global Alignment vs. Local Alignment Global alignment is used when the overall gene sequence is similar to another sequence-often used in multiple sequence alignment. Clustal W algorithm Local alignment is used when only a small portion of one gene is similar to a small portion of another gene. BLAST FASTA Smith-Waterman algorithm