In-Class Assignment #1: Research CD2 Follow instructions on distributed assignment sheet
Biology 4900 Biocomputing
Pairwise Sequence Alignment Chapter 3 Pairwise Sequence Alignment
Pairwise Alignment Potential relationships between proteins or nucleic acids can be explored by comparing 2 or more sequences of amino acids or nucleotides. Difficult to do visually. Computer algorithms help us by: Accelerating the comparison process Allowing for “gaps” or indels in sequences (i.e., insertions, deletions) Identifying substituted amino acids that are structurally or functionally similar (D and E). One way to do this is with BLAST (Basic Local Alignment Search Tool) Allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible. BLAST lets user select from a variety of scoring matrices to evaluate sequence relatedness. Pevsner, Bioinformatics and Functional Genomics, 2009
Sequence Analyses: RNA Codons (3 RNA bases in sequence) determine each amino acid that will build the protein expressed Many amino acids are encoded by more than 1 codon (change in 3rd base). Change of single base may not be significant.
Comparing protein sequences Comparing protein sequences usually more informative than nucleotide sequences. Changing base at 3rd position in codon does not always change AA (Ex: Both UUU and UUC encode for phenylalanine) Different AAs may share similar chemical properties (Ex: hydrophobic residues A, V, L, I) Relationships between related but mismatched AAs in sequence analysis can be accounted for using scoring systems (matrices). Protein sequence comparisons can ID sequence homologies from proteins sharing a common ancestor as far back as 1 × 109 years ago (vs. 600 × 106 for DNA).
Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties These have useful fluorescent properties http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm
Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm
Sequence Identity and Similarity Identity: How closely two sequences match one another. Unlike homology, identity can be measured quantitatively Similarity: Pairs of residues that are structurally or functionally related (conservative substitutions). >lcl|28245 3CLN:A|PDBID|CHAIN|SEQUENCE Length=148 Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%) Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN Sbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+E Sbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120 Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +K Sbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148 88% of sequences include the same amino acids (Identities). This increases to 97% (Positives) when you include amino acids that are different, but with similar properties. Pevsner, Bioinformatics and Functional Genomics, 2009
Sequence Homology Homology: Two sequences are homologous if they share a common ancestor. No “degrees of homology”: only homologous or not Almost always share similar 3D structure Ex. myoglobin and beta globin Sequences can change significantly over time, but 3D structure changes more slowly Beta-globin sub-unit of adult hemoglobin (2H35.pdb, in blue), superimposed over myoglobin (3RGK.pdb, in red). These sequences probably separated 600 million years ago. Pevsner, Bioinformatics and Functional Genomics, 2009
Percent Identity and Homology For an alignment of 70 amino acids, 40% sequence identity is a reasonable threshold for homology. Above 20% (more than 70 amino acids) may indicate homology. Below 20% probably indicates chance alignment. Pevsner, Bioinformatics and Functional Genomics, 2009
Orthologs and Paralogs Orthologs: Homologous sequences in different species that arose from a common ancestral gene during speciation. Ex. Humans and rats diverged around 80 million years ago divergence of myoglobin genes occurred. Orthologs frequently have similar biological functions. Human and rat myoglobin (oxygen transport) Human and rat CaM Paralogs: Homologous sequences that arose by a mechanism such as gene duplication. Within same organism/species Ex. Myoglobin and beta globin are paralogs Have distinct but related functions. Pevsner, Bioinformatics and Functional Genomics, 2009
Conservative Substitutions in Matrices Scoring may also vary based on conserved substitutions of amino acids: i.e., amino acids with similar properties will not lose as many points as AAs with very different properties. Basic AAs: K, R, H Acidic AAs: D, E Hydroxylated AAs: S, T Hydrophobic AAs: G, A, V, L, I, M, F, P, W, Y These relationships would be considered when calculating “Positives” in BLAST alignment. Pevsner, Bioinformatics and Functional Genomics, 2009
Dayhoff Model: Building a Scoring Matrix 1978, Margaret Dayhoff provided one of the first models of a scoring matrix Model was based on rules by which evolutionary changes occur in proteins Catalogued 1000’s of proteins, considered which specific amino acid substitutions occurred when 2 homologous proteins aligned Assumes substitution patterns in closely-related proteins can be extrapolated to more distantly-related proteins An accepted point mutation (PAM) is an AA replacement accepted by natural selection Based on observed mutations, not necessarily on related AA properties Probable mutations are rewarded, while unlikely mutations are penalized Scores for comparison of 2 residues (i, j) based on the following equation: Here, qi,j is the probability of an observed substitution (from mutation probability matrix), while p is the likelihood of observing the replacement AA (i) as a result of chance (normalized frequency of AA table). Pevsner, Bioinformatics and Functional Genomics, 2009
PAM250 Mutation Probability Matrix Original AA Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17 Replacement AA Think of these values as percentages (columns sum to 100). For example, there is an 18% (0.18) probability of R being replaced by K. This probability matrix needs to be converted into a scoring matrix. http://www.icp.ucl.ac.be/~opperd/private/pam250.html
Normalized Frequencies of Amino Acids **How often a given amino acid appears in a protein (determined by empirical analyses) http://www.icp.ucl.ac.be/~opperd/private/pam250.html
Purpose of PAM Matrices Derive a scoring system to determine relatedness of 2 sequences. PAM mutation probability matrix must be converted to a scoring matrix (log odds matrix).
PAM250 Log-Odds Matrix Cys C 12 Ser S 0 2 Thr T -2 1 3 Pro P -3 1 0 6 Ala A -2 1 1 1 2 Gly G -3 1 0 -1 1 5 Asn N -4 1 0 -1 0 0 2 Asp D -5 0 0 -1 0 1 2 4 Glu E -5 0 0 -1 0 0 1 3 4 Gln Q -5 -1 -1 0 0 -1 1 2 2 4 His H -3 -1 -1 0 -1 -2 2 1 1 3 6 Arg R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8 Lys K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 Met M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 Ile I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 Leu L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8 Val V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 Phe F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Tyr Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 Trp W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Cys Ser Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp This is the PAM250 scoring matrix, calculated as follows: http://www.icp.ucl.ac.be/~opperd/private/pam250.html
Pairwise Alignment and Homology Think of PAM value as total number of mutations. This included multiple mutations over time at a single position. Currently, we accept that once the percent distance reaches ~85%, homology is indeterminate. PAM250 works best for more distantly related protein sequences. Seq1 AGDFWYGGDGEYLLV Seq2 AGQFWYGGEGEKLLV Seq3 AGEFWYGGEGEKLLV Seq1 and Seq2 separated by 3 units, while Seq1 and Seq3 separated by 4 PAM units http://www.icp.ucl.ac.be/~opperd/private/pam.html
Practical Lessons from the Dayhoff Model Less mutable amino acids likely play more important structural and functional roles Mutable amino acids fulfill functions that can be filled by other amino acids with similar properties Common substitutions tend to require only a single nucleotide change in codon Amino acids that can be created from more than 1 codon are more likely to be created as a substitute (See p. 63, textbook) Changes to sequence that do not alter structure and function of protein likely to be more tolerated in nature Pevsner, Bioinformatics and Functional Genomics, 2009
BLOSUM62 Scoring Matrix BLOck SUbstitution Matrix By Henikoff and Henikoff (1992) Default scoring matrix for pairwise alignment of sequences using BLAST (local alignments) Based on empirical observations of distantly-related proteins organized into blocks In BLOSUM62, proteins are arranged in blocks sharing at least 62% identity Pevsner, Bioinformatics and Functional Genomics, 2009
General Trends in Scoring Matrices BLOSUM90 PAM30 BLOSUM62 PAM120 BLOSUM45 PAM250 Less divergent More divergent Human vs. chimp Human vs. bacteria Choose a matrix that is consistent with the level sequence identity you are investigating. I.E., if you are looking at/for more closely related sequences, use BLOSUM90. If you are not sure, use BLOSUM62.
Sequence Alignments: General Concepts Global Alignment: Tries to match the entire length of the sequence. Local Alignment: Tries to find the longest section that matches. Both are examples of dynamic programming: precise but slow
-GADEG-YFGPVILAADGEVA GGA-EGDYFGPAI--AEGEVA Global Alignment Input: two sequences over the same alphabet (either nucleotide or amino acid sequences) Output: The alignment of the sequences Example: GADEGYFGPVILAADGEVA and GGAEGDYFGPAIAEGEVA A possible alignment might look like this: -GADEG-YFGPVILAADGEVA GGA-EGDYFGPAI--AEGEVA ins del ins del del mut mut
Global Alignment – A Simple Scoring Scheme Each position is scored independently: Match: +1 Mismatch: -1 Insertions or deletions (gaps): -2 The alignment score is the sum of the position scores -GADEG-YFGPVILAADGEVA GGA-EGDYFGPAI--AEGEVA Global Alignment Score: (14 ×(+1)) + (5 × (-2)) + (2 × (-1)) = 2 -----GADEG-YFGPVILAADGEVA--- DLGNVGA-EGDYFGPAI--AEGEVARPL Global Alignment Score: (14 ×(+1)) + (12 × (-2)) + (2 × (-1)) = -12 -----GADEG-YFGPVILAADGEVA--- dlgnvGA-EGDYFGPAI--AEGEVArpl Local Alignment Score: (14 ×(+1)) + (4 × (-2)) + (2 × (-1)) = 4
Calculate the score in BLOSUM-62 for a gap with 7 residues… Matrices and Gap Costs The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b). Your total raw score for the alignment is reduced when you introduce gaps into the query sequence. Calculate the score in BLOSUM-62 for a gap with 7 residues… http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Global Sequence Alignments Global Alignment: Entire sequence of each protein or DNA. Needleman and Wunsch (1970) Reduces problem to series of smaller alignments on a residue-by-residue basis. How this approach works Setting up a matrix Score the matrix ID the optimal alignment
Global Sequence Alignments: Setting up a Matrix Create 2D Matrix of 2 sequences to align Seq 2 Seq 2 Seq 1 Seq 1 Perfect Alignment Mismatch Alignment (lower score) Seq 2 Seq 2 Seq 1 Seq 1 Deletion, Seq 2 Insertion, Seq 2
Global Sequence Alignments: Setting up a Matrix In simple identity matrix, matches scored as (+1), everything else is (0) Here you can see how BLOSUM62 Scoring Matrix is applied to replace to simple matrix Seq 2 Seq 2 Seq 1 Seq 1 Simple Identity Matrix BLOSUM62 Scoring Matrix
Global Seq. Alignments: Identity to Scoring Matrix We need to find a way to convert the identity matrix into a meaningful scoring system (match, mismatch, gap in 1 or 2) Seq 2 (j) Seq 2 Seq 1 Seq 1 (i) Simple Identity Matrix Needleman-Wunsch-Sellers Scoring Matrix
Global Seq. Alignments: Identity to Scoring Matrix Gap penalty values, matches, coordinate system Gap penalty Seq 2 (j) Gap penalty Seq 1 (i) Match = +1 Else = -2 Matches Needleman-Wunsch-Sellers Scoring Matrix
Global Seq. Alignments: Scoring Matrix Calculations Seq 2 (j) +1 Seq 1 (i) Needleman-Wunsch-Sellers Scoring Matrix Calculate Mi,j = MAXIMUM[ Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)] Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be blue and Mi-1,j will be green. Using this information, the score at position 1,1 (i, j) in the matrix can be calculated. Since the first residue in both sequences is an F, S1,1 = +1, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi-1,j-1 + 1, Mi,j-1 - 2, Mi-1,j - 2] = MAX[+1, -4, -4]. MAX function means we retain the highest (MAX) score of all possible scores.
Global Seq. Alignments: Scoring Matrix Calculations Seq 2 (j) +1 -1 Seq 1 (i) Needleman-Wunsch-Sellers Scoring Matrix Calculate Mi,j+1 = MAXIMUM[ Mi, j-1 + Si,j+1 (match/mismatch in the diagonal), Mi,j + w1 (gap in sequence #1), Mi-1,j+1 + w2 (gap in sequence #2)] The score at position 1,2 (i, j+1) in the matrix can be calculated. Since the residues are mismatched, Si+1,j = -2, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi-1,j-1 + 1, Mi,j-1 - 2, Mi-1,j - 2] = MAX[-4, -1, -6].
Global Seq. Alignments: Scoring Matrix Calculations Seq 2 (j) +1 Seq 1 (i) -1 Needleman-Wunsch-Sellers Scoring Matrix Calculate Mi+1,j = MAXIMUM[ Mi, j-1 + Si,j (match/mismatch in the diagonal), Mi+1,j-1 + w1 (gap in sequence #1), Mi-1,j + w2 (gap in sequence #2)] The score at position 2,1 (i+1, j) in the matrix can be calculated. Since the residues are mismatched, Si+1,j = -2, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi,j-1 - 2, Mi+1,j-1 - 2, Mi,j - 2] = MAX[-4, -6, -1].
Scored Matrix Seq 2 (j) Red Arrows indicate Pathways to calculated Max values Seq 1 (i) Overall score of optimal alignment
Optimal Alignment: Trace-back Procedure Seq 2 (j) Trace-back arrows can only follow pathways identified when calculating Max values Seq 1 (i) Start here
Completed Global Pairwise Alignment Seq 2 (j) Seq 2 (j) Seq 1 (i) Seq 1 (i) Seq 1 (i) F K H M E D - P L - E F - - M - D T P L N E Seq 2 (j) Note that final pairwise alignment score (-4) is equal to the value calculated based on total numbers of matches, mismatches, insertions and deletions Global Alignment Score: (6 ×(+1)) + (5 × (-2)) = -4
Local Sequence Alignment Local Alignment: Longest matching regions (subsets) between 2 sequences. Smith and Waterman Algorithm (1981) Scoring is similar to global alignment Set up a matrix Score the matrix No negative values allowed: If negative values are the only choices, then answer defaults to zero (0). Mismatches and gaps at ends score 0. ID the optimal alignment More sensitive but much slower than heuristic methods (FASTA, BLAST)
Smith and Waterman Local Sequence Alignment Can use any scoring matrix you want (ex. Substitute BLOSUM62) No negative values allowed: Default is 0 Alignment can start anywhere in sequence: not restricted to ends and no penalties at ends Trace-back starts with the highest number, works backwards the same as with global alignment Seq 2 G A A G A G T T T A A G
Heuristic (word or k-tuple based) algorithms Uses initial query to make reasonable guesses about sequence alignments, then evaluates those considered “most likely” Alignment then extended until: One of the sequences ends Score falls below some threshold In BLAST, search depends on word size KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!
FASTA (Pearson and Lippman 1988) Combines Smith and Waterman algorithm with word (k-tup) search faster, heuristic approach Query sequence divided into small words (usually k=2 for proteins) Words used to initially compare and match sequences If words located on same diagonal, surrounding region is then selected for analysis Seq 1 Search words (k-tup = 2) FY YG GK KL LH HM ME EG GD Seq 1 FYGKLHMEGD Seq 2 FWGKLHMEGSNE http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html
FASTA (Pearson and Lippman 1988) (a) Identify common k-words between sequences A and B (b) Score diagonals with k-word matches, identify 10 best diagonals (dense regions of k-word overlap) Rescore initial regions with a substitution score matrix (c) Join initial regions using gaps, penalize for gaps (d) Perform dynamic programming to find final alignments http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html
Statistical Significance of Pairwise Alignments Is an alignment similar based on statistical significance, or are similarities due to chance? How do we define significant? Statistics. Start with Null Hypothesis (H0) that 2 sequences are not related. Suggest an alternative hypothesis (H1) that 2 sequences are related. Select an arbitrary value defining statistical significance (α=0.05): This is the probability that the Null hypothesis can be rejected (i.e., there is less than 5% probability that a match occurs as a result of chance).
Statistical Mean and Standard Deviation Mean (average) is the sum of a set of numbers (x1 + x2 + … xn), divided by the total instances in the set (n) Standard Deviation (s) is the square root of the squared sum of the difference between a given value (xi) and the sample mean (x-bar) divided by the total instances in the set (n)
Statistical Measures of Algorithms Objective of alignment algorithms is to maximize sensitivity and specificity of alignments. Sensitivity: Measure of how well algorithm correctly predicts sequences that are related. Specificity: Measure of how well algorithm correctly predicts sequences that are unrelated.
Statistical Comparison of 2 Sequences Compare a large number of “random” sequences Many different proteins Randomly generated sequences Scrambled variations of 1 of your 2 sequences Calculate a Z score from the difference between the score of your aligned sequences (x) and the mean of the random sequences (μ), divided by the standard deviation of the random sequences (σ).
Convert Z Score to Probability of Chance Alignment Z score represents distance between sequence alignment score and population mean (per SD) estimated from random sequences The Z score can be converted to probability. Example: For Z = 2.0 (at α = 0.05), 97.98% of all values fall within 2.0 standard deviations (Z=2.0), therefore your sequence score could occur by chance only 2.02% of the time.