In-Class Assignment #1: Research CD2

Slides:

Advertisements

Similar presentations

Global Sequence Alignment by Dynamic Programming.

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Pairwise sequence alignment.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Lecture 8 Alignment of pairs of sequence Local and global alignment

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.

Heuristic alignment algorithms and cost matrices

Sequence Alignment.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Introduction to bioinformatics

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence alignment, E-value & Extreme value distribution

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise & Multiple sequence alignments

An Introduction to Bioinformatics

Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.

Protein Sequence Alignment and Database Searching.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Introduction to Bioinformatics Algorithms Sequence Alignment.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Biology 4900 Biocomputing.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Construction of Substitution matrices

Chapter 17 Prediction, Engineering and Design of Protein Structures.

Step 3: Tools Database Searching

Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Protein Sequence Alignment Multiple Sequence Alignment

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Pairwise Sequence Alignment

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

Presentation transcript:

In-Class Assignment #1: Research CD2 Follow instructions on distributed assignment sheet

Biology 4900 Biocomputing

Pairwise Sequence Alignment Chapter 3 Pairwise Sequence Alignment

Pairwise Alignment Potential relationships between proteins or nucleic acids can be explored by comparing 2 or more sequences of amino acids or nucleotides. Difficult to do visually. Computer algorithms help us by: Accelerating the comparison process Allowing for “gaps” or indels in sequences (i.e., insertions, deletions) Identifying substituted amino acids that are structurally or functionally similar (D and E). One way to do this is with BLAST (Basic Local Alignment Search Tool) Allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible. BLAST lets user select from a variety of scoring matrices to evaluate sequence relatedness. Pevsner, Bioinformatics and Functional Genomics, 2009

Sequence Analyses: RNA Codons (3 RNA bases in sequence) determine each amino acid that will build the protein expressed Many amino acids are encoded by more than 1 codon (change in 3rd base).  Change of single base may not be significant.

Comparing protein sequences Comparing protein sequences usually more informative than nucleotide sequences. Changing base at 3rd position in codon does not always change AA (Ex: Both UUU and UUC encode for phenylalanine) Different AAs may share similar chemical properties (Ex: hydrophobic residues A, V, L, I) Relationships between related but mismatched AAs in sequence analysis can be accounted for using scoring systems (matrices). Protein sequence comparisons can ID sequence homologies from proteins sharing a common ancestor as far back as 1 × 109 years ago (vs. 600 × 106 for DNA).

Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm

Amino acids by similar biophysical properties These have useful fluorescent properties http://kimwootae.com.ne.kr/apbiology/chap2.htm

Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm

Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm

Amino acids by similar biophysical properties http://kimwootae.com.ne.kr/apbiology/chap2.htm

Sequence Identity and Similarity Identity: How closely two sequences match one another. Unlike homology, identity can be measured quantitatively Similarity: Pairs of residues that are structurally or functionally related (conservative substitutions). >lcl|28245 3CLN:A|PDBID|CHAIN|SEQUENCE Length=148 Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%) Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN Sbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+E Sbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120 Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +K Sbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148 88% of sequences include the same amino acids (Identities). This increases to 97% (Positives) when you include amino acids that are different, but with similar properties. Pevsner, Bioinformatics and Functional Genomics, 2009

Sequence Homology Homology: Two sequences are homologous if they share a common ancestor. No “degrees of homology”: only homologous or not Almost always share similar 3D structure Ex. myoglobin and beta globin Sequences can change significantly over time, but 3D structure changes more slowly Beta-globin sub-unit of adult hemoglobin (2H35.pdb, in blue), superimposed over myoglobin (3RGK.pdb, in red). These sequences probably separated 600 million years ago. Pevsner, Bioinformatics and Functional Genomics, 2009

Percent Identity and Homology For an alignment of 70 amino acids, 40% sequence identity is a reasonable threshold for homology. Above 20% (more than 70 amino acids) may indicate homology. Below 20% probably indicates chance alignment. Pevsner, Bioinformatics and Functional Genomics, 2009

Orthologs and Paralogs Orthologs: Homologous sequences in different species that arose from a common ancestral gene during speciation. Ex. Humans and rats diverged around 80 million years ago  divergence of myoglobin genes occurred. Orthologs frequently have similar biological functions. Human and rat myoglobin (oxygen transport) Human and rat CaM Paralogs: Homologous sequences that arose by a mechanism such as gene duplication. Within same organism/species Ex. Myoglobin and beta globin are paralogs Have distinct but related functions. Pevsner, Bioinformatics and Functional Genomics, 2009

Conservative Substitutions in Matrices Scoring may also vary based on conserved substitutions of amino acids: i.e., amino acids with similar properties will not lose as many points as AAs with very different properties. Basic AAs: K, R, H Acidic AAs: D, E Hydroxylated AAs: S, T Hydrophobic AAs: G, A, V, L, I, M, F, P, W, Y These relationships would be considered when calculating “Positives” in BLAST alignment. Pevsner, Bioinformatics and Functional Genomics, 2009

Dayhoff Model: Building a Scoring Matrix 1978, Margaret Dayhoff provided one of the first models of a scoring matrix Model was based on rules by which evolutionary changes occur in proteins Catalogued 1000’s of proteins, considered which specific amino acid substitutions occurred when 2 homologous proteins aligned Assumes substitution patterns in closely-related proteins can be extrapolated to more distantly-related proteins An accepted point mutation (PAM) is an AA replacement accepted by natural selection Based on observed mutations, not necessarily on related AA properties Probable mutations are rewarded, while unlikely mutations are penalized Scores for comparison of 2 residues (i, j) based on the following equation: Here, qi,j is the probability of an observed substitution (from mutation probability matrix), while p is the likelihood of observing the replacement AA (i) as a result of chance (normalized frequency of AA table). Pevsner, Bioinformatics and Functional Genomics, 2009

PAM250 Mutation Probability Matrix Original AA Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V Ala A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9 Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3 Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3 Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7 His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2 Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9 Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13 Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5 Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2 Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3 Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4 Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6 Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6 Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2 Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17 Replacement AA Think of these values as percentages (columns sum to 100). For example, there is an 18% (0.18) probability of R being replaced by K. This probability matrix needs to be converted into a scoring matrix. http://www.icp.ucl.ac.be/~opperd/private/pam250.html

Normalized Frequencies of Amino Acids **How often a given amino acid appears in a protein (determined by empirical analyses) http://www.icp.ucl.ac.be/~opperd/private/pam250.html

Purpose of PAM Matrices Derive a scoring system to determine relatedness of 2 sequences. PAM mutation probability matrix must be converted to a scoring matrix (log odds matrix).

PAM250 Log-Odds Matrix Cys C 12 Ser S 0 2 Thr T -2 1 3 Pro P -3 1 0 6 Ala A -2 1 1 1 2 Gly G -3 1 0 -1 1 5 Asn N -4 1 0 -1 0 0 2 Asp D -5 0 0 -1 0 1 2 4 Glu E -5 0 0 -1 0 0 1 3 4 Gln Q -5 -1 -1 0 0 -1 1 2 2 4 His H -3 -1 -1 0 -1 -2 2 1 1 3 6 Arg R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8 Lys K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 Met M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 Ile I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 Leu L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8 Val V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 Phe F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Tyr Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 Trp W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Cys Ser Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp This is the PAM250 scoring matrix, calculated as follows: http://www.icp.ucl.ac.be/~opperd/private/pam250.html

Pairwise Alignment and Homology Think of PAM value as total number of mutations. This included multiple mutations over time at a single position. Currently, we accept that once the percent distance reaches ~85%, homology is indeterminate. PAM250 works best for more distantly related protein sequences. Seq1 AGDFWYGGDGEYLLV Seq2 AGQFWYGGEGEKLLV Seq3 AGEFWYGGEGEKLLV Seq1 and Seq2 separated by 3 units, while Seq1 and Seq3 separated by 4 PAM units http://www.icp.ucl.ac.be/~opperd/private/pam.html

Practical Lessons from the Dayhoff Model Less mutable amino acids likely play more important structural and functional roles Mutable amino acids fulfill functions that can be filled by other amino acids with similar properties Common substitutions tend to require only a single nucleotide change in codon Amino acids that can be created from more than 1 codon are more likely to be created as a substitute (See p. 63, textbook) Changes to sequence that do not alter structure and function of protein likely to be more tolerated in nature Pevsner, Bioinformatics and Functional Genomics, 2009

BLOSUM62 Scoring Matrix BLOck SUbstitution Matrix By Henikoff and Henikoff (1992) Default scoring matrix for pairwise alignment of sequences using BLAST (local alignments) Based on empirical observations of distantly-related proteins organized into blocks In BLOSUM62, proteins are arranged in blocks sharing at least 62% identity Pevsner, Bioinformatics and Functional Genomics, 2009

General Trends in Scoring Matrices BLOSUM90 PAM30 BLOSUM62 PAM120 BLOSUM45 PAM250 Less divergent More divergent Human vs. chimp Human vs. bacteria Choose a matrix that is consistent with the level sequence identity you are investigating. I.E., if you are looking at/for more closely related sequences, use BLOSUM90. If you are not sure, use BLOSUM62.

Sequence Alignments: General Concepts Global Alignment: Tries to match the entire length of the sequence. Local Alignment: Tries to find the longest section that matches. Both are examples of dynamic programming: precise but slow

-GADEG-YFGPVILAADGEVA GGA-EGDYFGPAI--AEGEVA Global Alignment Input: two sequences over the same alphabet (either nucleotide or amino acid sequences) Output: The alignment of the sequences Example: GADEGYFGPVILAADGEVA and GGAEGDYFGPAIAEGEVA A possible alignment might look like this: -GADEG-YFGPVILAADGEVA GGA-EGDYFGPAI--AEGEVA ins del ins del del mut mut

Global Alignment – A Simple Scoring Scheme Each position is scored independently: Match: +1 Mismatch: -1 Insertions or deletions (gaps): -2 The alignment score is the sum of the position scores -GADEG-YFGPVILAADGEVA GGA-EGDYFGPAI--AEGEVA Global Alignment Score: (14 ×(+1)) + (5 × (-2)) + (2 × (-1)) = 2 -----GADEG-YFGPVILAADGEVA--- DLGNVGA-EGDYFGPAI--AEGEVARPL Global Alignment Score: (14 ×(+1)) + (12 × (-2)) + (2 × (-1)) = -12 -----GADEG-YFGPVILAADGEVA--- dlgnvGA-EGDYFGPAI--AEGEVArpl Local Alignment Score: (14 ×(+1)) + (4 × (-2)) + (2 × (-1)) = 4

Calculate the score in BLOSUM-62 for a gap with 7 residues… Matrices and Gap Costs The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b). Your total raw score for the alignment is reduced when you introduce gaps into the query sequence. Calculate the score in BLOSUM-62 for a gap with 7 residues… http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/

Global Sequence Alignments Global Alignment: Entire sequence of each protein or DNA. Needleman and Wunsch (1970) Reduces problem to series of smaller alignments on a residue-by-residue basis. How this approach works Setting up a matrix Score the matrix ID the optimal alignment

Global Sequence Alignments: Setting up a Matrix Create 2D Matrix of 2 sequences to align Seq 2 Seq 2 Seq 1 Seq 1 Perfect Alignment Mismatch Alignment (lower score) Seq 2 Seq 2 Seq 1 Seq 1 Deletion, Seq 2 Insertion, Seq 2

Global Sequence Alignments: Setting up a Matrix In simple identity matrix, matches scored as (+1), everything else is (0) Here you can see how BLOSUM62 Scoring Matrix is applied to replace to simple matrix Seq 2 Seq 2 Seq 1 Seq 1 Simple Identity Matrix BLOSUM62 Scoring Matrix

Global Seq. Alignments: Identity to Scoring Matrix We need to find a way to convert the identity matrix into a meaningful scoring system (match, mismatch, gap in 1 or 2) Seq 2 (j) Seq 2 Seq 1 Seq 1 (i) Simple Identity Matrix Needleman-Wunsch-Sellers Scoring Matrix

Global Seq. Alignments: Identity to Scoring Matrix Gap penalty values, matches, coordinate system Gap penalty Seq 2 (j) Gap penalty Seq 1 (i) Match = +1 Else = -2 Matches Needleman-Wunsch-Sellers Scoring Matrix

Global Seq. Alignments: Scoring Matrix Calculations Seq 2 (j) +1 Seq 1 (i) Needleman-Wunsch-Sellers Scoring Matrix Calculate Mi,j = MAXIMUM[ Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)] Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be blue and Mi-1,j will be green. Using this information, the score at position 1,1 (i, j) in the matrix can be calculated. Since the first residue in both sequences is an F, S1,1 = +1, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi-1,j-1 + 1, Mi,j-1 - 2, Mi-1,j - 2] = MAX[+1, -4, -4]. MAX function means we retain the highest (MAX) score of all possible scores.

Global Seq. Alignments: Scoring Matrix Calculations Seq 2 (j) +1 -1 Seq 1 (i) Needleman-Wunsch-Sellers Scoring Matrix Calculate Mi,j+1 = MAXIMUM[ Mi, j-1 + Si,j+1 (match/mismatch in the diagonal), Mi,j + w1 (gap in sequence #1), Mi-1,j+1 + w2 (gap in sequence #2)] The score at position 1,2 (i, j+1) in the matrix can be calculated. Since the residues are mismatched, Si+1,j = -2, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi-1,j-1 + 1, Mi,j-1 - 2, Mi-1,j - 2] = MAX[-4, -1, -6].

Global Seq. Alignments: Scoring Matrix Calculations Seq 2 (j) +1 Seq 1 (i) -1 Needleman-Wunsch-Sellers Scoring Matrix Calculate Mi+1,j = MAXIMUM[ Mi, j-1 + Si,j (match/mismatch in the diagonal), Mi+1,j-1 + w1 (gap in sequence #1), Mi-1,j + w2 (gap in sequence #2)] The score at position 2,1 (i+1, j) in the matrix can be calculated. Since the residues are mismatched, Si+1,j = -2, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi,j-1 - 2, Mi+1,j-1 - 2, Mi,j - 2] = MAX[-4, -6, -1].

Scored Matrix Seq 2 (j) Red Arrows indicate Pathways to calculated Max values Seq 1 (i) Overall score of optimal alignment

Optimal Alignment: Trace-back Procedure Seq 2 (j) Trace-back arrows can only follow pathways identified when calculating Max values Seq 1 (i) Start here

Completed Global Pairwise Alignment Seq 2 (j) Seq 2 (j) Seq 1 (i) Seq 1 (i) Seq 1 (i) F K H M E D - P L - E F - - M - D T P L N E Seq 2 (j) Note that final pairwise alignment score (-4) is equal to the value calculated based on total numbers of matches, mismatches, insertions and deletions Global Alignment Score: (6 ×(+1)) + (5 × (-2)) = -4

Local Sequence Alignment Local Alignment: Longest matching regions (subsets) between 2 sequences. Smith and Waterman Algorithm (1981) Scoring is similar to global alignment Set up a matrix Score the matrix No negative values allowed: If negative values are the only choices, then answer defaults to zero (0). Mismatches and gaps at ends score 0. ID the optimal alignment More sensitive but much slower than heuristic methods (FASTA, BLAST)

Smith and Waterman Local Sequence Alignment Can use any scoring matrix you want (ex. Substitute BLOSUM62) No negative values allowed: Default is 0 Alignment can start anywhere in sequence: not restricted to ends and no penalties at ends Trace-back starts with the highest number, works backwards the same as with global alignment Seq 2 G A A G A G T T T A A G

Heuristic (word or k-tuple based) algorithms Uses initial query to make reasonable guesses about sequence alignments, then evaluates those considered “most likely” Alignment then extended until: One of the sequences ends Score falls below some threshold In BLAST, search depends on word size KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!

FASTA (Pearson and Lippman 1988) Combines Smith and Waterman algorithm with word (k-tup) search  faster, heuristic approach Query sequence divided into small words (usually k=2 for proteins) Words used to initially compare and match sequences If words located on same diagonal, surrounding region is then selected for analysis Seq 1 Search words (k-tup = 2) FY YG GK KL LH HM ME EG GD Seq 1 FYGKLHMEGD Seq 2 FWGKLHMEGSNE http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html

FASTA (Pearson and Lippman 1988) (a) Identify common k-words between sequences A and B (b) Score diagonals with k-word matches, identify 10 best diagonals (dense regions of k-word overlap) Rescore initial regions with a substitution score matrix (c) Join initial regions using gaps, penalize for gaps (d) Perform dynamic programming to find final alignments http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html

Statistical Significance of Pairwise Alignments Is an alignment similar based on statistical significance, or are similarities due to chance? How do we define significant? Statistics. Start with Null Hypothesis (H0) that 2 sequences are not related. Suggest an alternative hypothesis (H1) that 2 sequences are related. Select an arbitrary value defining statistical significance (α=0.05): This is the probability that the Null hypothesis can be rejected (i.e., there is less than 5% probability that a match occurs as a result of chance).

Statistical Mean and Standard Deviation Mean (average) is the sum of a set of numbers (x1 + x2 + … xn), divided by the total instances in the set (n) Standard Deviation (s) is the square root of the squared sum of the difference between a given value (xi) and the sample mean (x-bar) divided by the total instances in the set (n)

Statistical Measures of Algorithms Objective of alignment algorithms is to maximize sensitivity and specificity of alignments. Sensitivity: Measure of how well algorithm correctly predicts sequences that are related. Specificity: Measure of how well algorithm correctly predicts sequences that are unrelated.

Statistical Comparison of 2 Sequences Compare a large number of “random” sequences Many different proteins Randomly generated sequences Scrambled variations of 1 of your 2 sequences Calculate a Z score from the difference between the score of your aligned sequences (x) and the mean of the random sequences (μ), divided by the standard deviation of the random sequences (σ).

Convert Z Score to Probability of Chance Alignment Z score represents distance between sequence alignment score and population mean (per SD) estimated from random sequences The Z score can be converted to probability. Example: For Z = 2.0 (at α = 0.05), 97.98% of all values fall within 2.0 standard deviations (Z=2.0), therefore your sequence score could occur by chance only 2.02% of the time.