8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

8/27/20151 Pairwise sequence Alignment

8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.John Wiley & Sons, Inc Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks! Copyright notice

8/27/20153 It is used to decide if two proteins (or genes) are related structurally or functionally It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching It is used in the analysis of genomes Pairwise sequence alignment is the most fundamental operation of bioinformatics

8/27/20154

5 Pairwise alignment: protein sequences can be more informative than DNA protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties codons are degenerate: changes in the third position often do not alter the amino acid that is specified DNA sequences can be translated into protein, and then used in pairwise alignments

8/27/20156 DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pairwise alignment: protein sequences can be more informative than DNA 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

8/27/20157 Pairwise alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Definitions

8/27/20158 Homology Similarity attributed to descent from a common ancestor. Definitions Page 44 retinol-binding protein (rbp) (NP_006735)  -lactoglobulin (P02754)

8/27/20159 Homology Similarity attributed to descent from a common ancestor. Definitions Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Page 44 RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + +GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

8/27/201510 Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Definitions: two types of homology

8/27/201511

8/27/201512 Definitions Similarity: The extent to which nucleotide or protein sequences are related. Identity: The extent to which two sequences are invariant. Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Percent Similarity of two protein sequence: Identity + Conservation.

8/27/201513

8/27/201514 Conservation Substitutions Basic Amino Acid (K, R, H) Acidic Amino Acid (D, E) Hydroxylated Amino Acid (S, T) Hydrophobic Amino Acid (W, F, Y, L, I, V, M, A)

Pairwise alignment of retinol-binding protein and  -lactoglobulin 8/27/201515 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Identity (bar)

Pairwise alignment of retinol-binding protein and  -lactoglobulin 8/27/201516 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Somewhat similar (one dot) Very similar (two dots)

8/27/201517 Gaps Common Mutations: substitutions, insertions, deletions. Insertions and deletion lead to a Gap in the alignment: Positions at which a letter is paired with a null are called gaps. –Gap scores are typically negative. –Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. –In BLAST, it is rarely necessary to change gap values from the default. RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K ++ + + +GTW++MA+ L+ A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

8/27/201518 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |... | :.||||.:| : 1...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: |.|. || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| |..| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and  -lactoglobulin Internal gap Terminal gap

8/27/201519 General approach to pairwise alignment Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance

8/27/201520 Key Issues in Pairwise Alignment The scoring system Types of alignments (local vs. global) The alignment algorithm Measuring alignment significance

8/27/201521 An alignment scoring system is required to evaluate how good an alignment is positive and negative values assigned gap creation and extension penalties positive score for identities some partial positive score for conservative substitutions use of a substitution matrix

8/27/201522 Scoring systems Match scores: s(a,b) assigns a score to each combination of aligned letters. Examples: PAM, Blosum. Gap score: f(g) assigns a score to a gap of length g. Examples: linear, affine. Scoring systems are usually additive: the total score is the sum of the substitution scores and all the gap scores.

8/27/201523 Gap Scores Gap scores are negative (they are “costs”) Linear: gap score is proportional to length of gap. f(g) = -g · d Affine: gap score is a constant plus a linear score. With affine scores, many small gaps cost more than one large gap. f(g) = -d -(g-1)e, for g>0

8/27/201524 Calculation of an alignment score

Match Score Tables Match scores are computed using a substitution matrix

Log-odds Match Scores Match scores are usually log-odds scores. s(a,b) = log(Pr(a, b | M) / Pr(a, b | R)) Pr(a, b | M) = p ab is the probability of the residues a and b appearing assuming the correct positions are aligned and the sequences descended from a common ancestor. Pr(a, b | R) is the probability of a and b if we pick two random sequence positions to align. Assuming independence (null model) gives Pr(a, b | R)=q a q b.

Adding log-odds substitution scores gives the log-odds of the alignment For ungapped alignments, the log- odds alignment score of two sequence segments x and y is equal to the log-odds ratio of the segments. Let x = x 1 x 2...x n and y = y 1 y 2...y n.

A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. Substitution Matrix

8/27/201529 Dayhoff Model: Accepted Point mutations Accepted Point Mutations (PAM): a replacement of one amino acid in protein by another residue that has been accepted by natural selection. When PAM occurs: –A gene undergoes a DNA mutation such that it encodes a different amino acid –The entire species adopts that changes as the predominant form.

8/27/201530 Dayhoff’s 34 protein superfamilies ProteinPAMs per 100 million years Ig kappa chain37 Kappa casein33 Lactalbumin27 Hemoglobin  12 Myoglobin8.9 Insulin4.4 Histone H40.10 Ubiquitin0.00

8/27/201531 Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? Number of Accepted point mutations, multiplied by 10, in 1572 cases

8/27/201532 Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined. Relative mutability  [changes] / [occurrences] Example: sequence 1alahisvalala sequence 2alaargserval For ala, relative mutability = [1] / [3] = 0.33 For val, relative mutability = [2] / [2] = 1.0 The relative mutability of amino acids

8/27/201533 The relative mutability of amino acids Asn134His66 Ser120Arg65 Asp106Lys56 Glu102Pro56 Ala100Gly49 Thr97Tyr41 Ile96Phe41 Met94Leu40 Gln93Cys20 Val74Trp18

8/27/201534 Normalized frequencies of amino acids Gly8.9%Arg4.1% Ala8.7%Asn4.0% Leu8.5%Phe4.0% Lys8.1%Gln3.8% Ser7.0%Ile3.7% Val6.5%His3.4% Thr5.8%Cys3.3% Pro5.1%Tyr3.0% Glu5.0%Met1.5% Asp4.7%Trp1.0% blue=6 codons; red=1 codon These values sum to 1.0.

8/27/201535 PAM matrices are based on global alignments of closely related proteins. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity) PAM matrices: Accepted point mutations

8/27/201536 Dayhoff’s PAM1 PAM1: calculated from comparisons of sequences with no more than 1% divergence. –Defined as evolutionary interval. –1 PAM = PAM 1 = 1% average change of all amino acid positions Value in PAM1: for one amino acid –number of “accepted point mutations” × relative mutability × fraction of change from one amino acid to another amino acid over all changes of one amino acid to any other amino acid. –Normalization over all 20 changes.

8/27/201537 Dayhoff’s PAM1 mutation probability matrix Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side)

PAM After 100 PAMs of evolution, not every residue will have changed –some residues may have mutated several times –some residues may have returned to their original state –some residues may not changed at all

8/27/201539 Dayhoff’s PAM0 mutation probability matrix: the rules for extremely slowly evolving proteins Top: original amino acid Side: replacement amino acid

8/27/201540 Dayhoff’s PAM2000 mutation probability matrix: the rules for very distantly related proteins G 8.9% Top: original amino acid Side: replacement amino acid

8/27/201541 PAM250 Matrix Commonly used Describes the frequency of amino acid replacement between distantly related proteins An evolutionary distance where proteins share about 20% amino acid identity

8/27/201542 PAM250 mutation probability matrix Top: original amino acid Side: replacement amino acid

8/27/201543 PAM250 log odds scoring matrix

8/27/201544 How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log 10 (M ab /p b ) As an example, for tryptophan, S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4

8/27/201545 What do the numbers mean in a log odds matrix? S(a,tryptophan) = 10 log 10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (M ab /p b ) = x Then 17 = 10 log 10 x 1.7 = log 10 x 10 1.7 = x = 50

8/27/201546 What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids.

8/27/201547 Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. A PAM250 matrix is very tolerant of mismatches. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** 24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7% hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * * * ** * hsrbp, 86 --CADMVGTFTDTEDPAKFKM btlact, 80 GECAQKKIIAEKTKIPAVFKI ** * ** **

8/27/201548

8/27/201549 PAM: “Accepted point mutation” Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations.) Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968)

8/27/201550 Ancestral sequence Sequence 1 ACCGATC Sequence 2 AATAATC A no changeA C single substitutionC --> A C multiple substitutionsC --> A --> T C --> G coincidental substitutionsC --> A T --> A parallel substitutionsT --> A A --> C --> T convergent substitutionsA --> T C back substitutionC --> T --> C ACCCTAC Li (1997) p.70

8/27/201551 Percent identity Evolutionary distance in PAMs Two randomly diverging protein sequences change in a negatively exponential fashion “twilight zone”

8/27/201552 Percent identity Differences per 100 residues At PAM1, two proteins are 99% identical At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues “twilight zone”

8/27/201553 PAM matrices reflect different degrees of divergence PAM250

8/27/201554 PAM250 In the “twilight zone” At this level of divergence, it is difficulty to assess whether the two proteins are homologous. Other techniques may use –Multiple sequence alignment –Structural alignment.

8/27/201555 Comments Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales.

8/27/201556 Henikoff and Henikoff: BLOSUM BLOSUM: Block Substitution Matrices –constructed these matrices using multiple alignments of evolutionarily divergent proteins. –The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. –These conserved sequences are assumed to be of functional importance within related proteins.

8/27/201557 BLOSUM Matrices The BLOCKS database contains thousands of groups of multiple sequence alignments. The BLOSUM62 matrix is calculated from observed substitutions between proteins that share 62% sequence identity or more –the BLOSUM100 matrix is calculated from alignments between proteins showing 100% identity the proteins in the BLOCKS database –One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

8/27/201558 BLOSUM Matrices 100 62 30 Percent amino acid identity BLOSUM62 collapse

8/27/201559 BLOSUM Matrices 100 62 30 Percent amino acid identity BLOSUM62 100 62 30 BLOSUM30 100 62 30 BLOSUM80 collapse

8/27/201560 Blosum62 scoring matrix

8/27/201561 Differences between PAM and BLOSUM PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of evolution. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. The method used to count the replacements is different, unlike the PAM matrix, the BLOSUM procedure uses groups of sequences within which not all mutations are counted the same. Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100; BLOSUM62 is used for closer sequences than Blosum50. http://en.wikipedia.org/wiki/Substitution_matrix

8/27/201562 Rat versus mouse RBP Rat versus bacterial lipocalin

8/27/201563 Measuring Alignment Significance The statistical significance of a an alignment score is used to try to determine if an alignment is the result of homology or just random chance.

8/27/201564 True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives

8/27/201565 True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non- homologous sequences

8/27/201566 True positivesFalse positives False negatives Sequences reported as related Sequences reported as unrelated True negatives homologous sequences non- homologous sequences Sensitivity: ability to find true positives Specificity: ability to minimize false positives

8/27/201567 RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81 Randomization test: scramble a sequence First compare two proteins and obtain a score Next scramble the bottom sequence 100 times, and obtain 100 “randomized” scores (+/- S.D. ) Composition and length are maintained If the comparison is “real” we expect the authentic score to be several standard deviations above the mean of the “randomized” scores

8/27/201568 100 random shuffles Mean score = 8.7 Std. dev. = 4.2 Quality score Number of instances A randomization test shows that RBP is significantly related to  -lactoglobulin Real comparison Score = 37 But this test assumes a normal distribution of scores!

8/27/201569 For align RBP to b-lactoglobulin:

8/27/201570 The PRSS program performs a scramble test for you (http://fasta.bioch.virginia.edu /fasta/prss.htm) Bad scores Good scores But these scores are not normally distributed!

8/27/201571 We will first consider the global alignment algorithm of Needleman and Wunsch (1970). We will then explore the local alignment algorithm of Smith and Waterman (1981). Finally, we will consider BLAST, a heuristic version of Smith-Waterman. We will cover BLAST in detail later. Two kinds of sequence alignment: global and local

8/27/201572 Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Global alignment with the algorithm of Needleman and Wunsch (1970)

8/27/201573 The Needleman and Wunsch Algorithm: Dynamic Programming S. Needleman and C. Wunsch were the first to apply a dynamic programming approach to the problem of sequence alignment. The key to understanding the dynamic programming approach to sequence alignment lies in observing how the alignment problem is broken down into sub-problems.

8/27/201574 [1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Three steps to global alignment with the Needleman-Wunsch algorithm

8/27/201575 Global alignment with the algorithm of Needleman and Wunsch (1970) Build a matrix F, index by i and j (the positions in the two sequences), where the F(i,j) is the score of the best alignment of x 1..i and y 1..j

8/27/201576 Example: forward approach Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

8/27/201577 DP in equation form

8/27/201578 A simple example ACGT A2-7-5-7 C 2 -5 G -72 T -5-72 AAG A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/201579 A simple example ACGT A2-7-5-7 C 2 -5 G -72 T -5-72 AAG 0 A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/201580 A simple example ACGT A2-7-5-7 C 2 -5 G -72 T -5-72 AAG 0-5-10-15 A-5 G-10 C-15 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/201581 A simple example ACGT A2-7-5-7 C 2 -5 G -72 T -5-72 AAG 0-5-10-15 A-52-3-8 G-10-3 C-15-8 -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/201582 Trace back Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence.

8/27/201583 Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A simple example AAG 0-5 A2-3 G C-6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5.

8/27/201584 Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A simple example AAG 0-5 A2-3 G C-6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. AAG- -AGC A-GC

8/27/201585 N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. Needleman-Wunsch: dynamic programming

8/27/201586 Commercial Tools for Sequence Alignment: Accelrys Genetic Computer Group (GCG) Formally known as the GCG Wisconsin Package GCG contains over 140 programs and utilities covering the cross-disciplinary needs of today’s research environment. http://www.accelrys.com/products/gcg/

8/27/201587 Global alignment versus local alignment Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other Local alignment finds optimally matching regions within two sequences (“subsequences”) Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough.

8/27/201588 Smith-Waterman Algorithm Similar to Needleman-Wunch, but: –Start a new alignment when encountering a negative score –The alignment can end anywhere in the matrix –Trace back starts from entry with highest score in the matrix, and terminates at entry with score 0

8/27/201589 How the Smith-Waterman algorithm works Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero

8/27/201590 Local alignment: example AAG 0000 G0002 A0220 A0240 G0006 G0002 C0000 Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. 0 ACGT A2-7-5-7 C 2 -5 G -72 T -5-72

8/27/201591 Match: 1 Mismatch: -1/3 Gap: -4/3

8/27/201592 Extended Smith & Waterman To get multiple local alignments: delete regions around best path repeat backtracking

8/27/201593 Heuristic versions of Smith- Waterman: FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W

8/27/201594 FASTA Idea Idea: a good alignment probably matches some identical ‘words’ (ktups) Example: Database record: ACTTGTAGATACAAAATGTG Aligned query sequence: A-TTGTCG-TACAA-ATCTGT Matching words of size 4

8/27/201595 FASTA Stage I A “lookup table” (Database word) is created. It consists of short stretches of amino acids. The length of a stretch is called a k-tuple. Upon query: –For each DB record: Find matching words Search for long diagonal runs of matching words Finds the ten highest scoring segments that align to the query. * = matching word Position in query Position in DB record * * * * * * * * * * * *

8/27/201596 FASTA stage II, III II, These ten aligned regions are re-scored with a PAM or BLOSUM matrix. III, High-scoring segments are joined

8/27/201597 FASTA final stage Apply an exact algorithm to surviving records, computing the final alignment score. –Usually, Needleman-Wunsch or Smith- Waterman is then performed.

8/27/201598 Advantage: The FASTA program can search the NBRF protein sequence library (2.5 million residues) in less than 20 min on an IBM- PC microcomputer Rapid and sensitive sequence comparison with FASTP and FASTA. Pearson WR. Methods Enzymol. 1990;183:63-98.

8/27/201599 Limits Local similarity might be missed because only 10 regions saved at init1 stage. Non-identical conserved stretches may be overlooked

8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

Similar presentations

Presentation on theme: "8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics.

Similar presentations

Presentation on theme: "8/27/20151 Pairwise sequence Alignment. 8/27/20152 Many of the images in this power point presentation are from Bioinformatics and Functional Genomics."— Presentation transcript:

Similar presentations

About project

Feedback