Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Computational Molecular Biology Michael Smith

Similar presentations


Presentation on theme: "Chapter 3 Computational Molecular Biology Michael Smith"— Presentation transcript:

1 Chapter 3 Computational Molecular Biology Michael Smith msmith@cs.fit.edu

2 Sequence Comparison Sequence comparison is the most important operation in computational biology Sequence comparison is the most important operation in computational biology Consists of finding which parts of the sequences are alike and which parts differ Consists of finding which parts of the sequences are alike and which parts differ

3 Similarity and Alignment Similarity Similarity  Gives a measure of how similar sequences are Alignment Alignment  A way of placing sequences one above the other in order to make clear the correspondence between similar characters or substrings

4 Sequence Comparison Want best alignment between two or more sequences Want best alignment between two or more sequences Global Comparison Global Comparison Alignment involving entire sequences Alignment involving entire sequences Local Comparison Local Comparison Alignment involving substrings Alignment involving substrings Semi-Global Comparison Semi-Global Comparison Aligning prefixes and suffixes of the sequences Aligning prefixes and suffixes of the sequences All can be solved by Dynamic Programming All can be solved by Dynamic Programming

5 Global Comparison Consider the following DNA sequences Consider the following DNA sequencesGACGGATTAGGATCGGAATAG  Are they similar?  After alignment, similarities are more obvious GA-CGGATTAGGATCGGAATAG

6 Alignment and Score Alignment, more precise definition Alignment, more precise definition Insertion of spaces in arbitrary locations along the sequences so that they end up with the same size Insertion of spaces in arbitrary locations along the sequences so that they end up with the same size No column can be entirely composed of spaces No column can be entirely composed of spaces Score Score Measure of similarity Measure of similarity Each column receive +1, for a match, -1 for a mismatch or -2 for a space Each column receive +1, for a match, -1 for a mismatch or -2 for a space Sum values to get score Sum values to get score

7 Dynamic Programming Solving an instance of a problem by taking advantage of already computed solutions for smaller instances of the problem Solving an instance of a problem by taking advantage of already computed solutions for smaller instances of the problem Main algorithmic approach used in sequence alignment Main algorithmic approach used in sequence alignment Figure 3.1, 3.2 Figure 3.1, 3.2

8 Optimal Alignments From Figure 3.1, start at (m,n) and follow arrows to (0,0) From Figure 3.1, start at (m,n) and follow arrows to (0,0) Each arrow gives one column of the alignment Each arrow gives one column of the alignment If arrow is horizontal, it corresponds to a column with a space in s matched with t[j] If arrow is horizontal, it corresponds to a column with a space in s matched with t[j] If arrow is vertical, it corresponds to s[i] matched with a space in t If arrow is vertical, it corresponds to s[i] matched with a space in t If arrow is diagonal, s[i] is matched with t[j] If arrow is diagonal, s[i] is matched with t[j]

9 Optimal Alignments Many alignments are possible, depending on which arrow is given priority Many alignments are possible, depending on which arrow is given priority

10 Local Comparison A local alignment between s and t is an alignment between a substring of s and a substring of t A local alignment between s and t is an alignment between a substring of s and a substring of t Goal : find the highest scoring local alignment between two sequences Goal : find the highest scoring local alignment between two sequences Variation of basic algorithm (Figure 3.2) Variation of basic algorithm (Figure 3.2) Each entry holds highest score of an alignment between suffixes of s and t (page 55) Each entry holds highest score of an alignment between suffixes of s and t (page 55)

11 SemiGlobal Comparison Score alignments ignoring some of the end spaces in the sequences Score alignments ignoring some of the end spaces in the sequences End spaces are those that appear before the first or after the last character in a sequence End spaces are those that appear before the first or after the last character in a sequence For example, For example,CAGCA-CTTGGATTCTCGG---CAGCGTGG-------- If we aligned the sequences in the usual way, then If we aligned the sequences in the usual way, thenCAGCACTTGGATTCTCGGCAGC-----G-T----GG

12 Extensions to Basic Algorithm Basic algorithm has O(mn) complexity and uses space on the order of O(mn) Basic algorithm has O(mn) complexity and uses space on the order of O(mn) Possible to improve complexity from quadratic to linear at the expense of doubling processing time Possible to improve complexity from quadratic to linear at the expense of doubling processing time Can be accomplished by using a Divide and Conquer strategy Can be accomplished by using a Divide and Conquer strategy Divide the problem into small subproblems and later combine the solutions to obtain a solution for the whole problem Divide the problem into small subproblems and later combine the solutions to obtain a solution for the whole problem

13 Gap Penalty Functions A gap is a consecutive number of spaces A gap is a consecutive number of spaces When mutations occur, it is more likely to have a block of gaps verses a series of isolated gaps When mutations occur, it is more likely to have a block of gaps verses a series of isolated gaps Previous discussed scoring method is not appropriate in this case Previous discussed scoring method is not appropriate in this case

14 Gap Penalty Functions For example, For example,A------ATTCCTTCCTTCCAAAGAGAATTCCTTCCTTCC  Scoring is done at a block level, not a column level A ------ ATTCCTTCCTTCC A AAGAGA ATTCCTTCCTTCC

15 Multiple Sequences Multiple sequence alignment is a generation of the two sequence case Multiple sequence alignment is a generation of the two sequence case Multiple alignment of s 1,s 2 …..s k is obtained by inserting spaces in the sequences in such a way to make them all the same size Multiple alignment of s 1,s 2 …..s k is obtained by inserting spaces in the sequences in such a way to make them all the same size No column is made entirely of spaces No column is made entirely of spaces Figure 3.10 Figure 3.10

16 Scoring Multiple Sequences Need a function that inputs amino acid sequences and returns a score Need a function that inputs amino acid sequences and returns a score The function must have two properties The function must have two properties Order of arguments must be independent. For example if a column has I,V,- the same score should be produced if the order is -,V,I Order of arguments must be independent. For example if a column has I,V,- the same score should be produced if the order is -,V,I Should reward the presence of many equal resides and penalize unequal residues and spaces Should reward the presence of many equal resides and penalize unequal residues and spaces

17 Sum-of-Pairs (SP) Sum-of-Pairs (SP) satisfies the properties Sum-of-Pairs (SP) satisfies the properties Sum of pairwise scores of all pairs of symbols in a column Sum of pairwise scores of all pairs of symbols in a column SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + SP-score(I,-,I,V) = p(I,-) + p(I,I) + p(I,V) + p(-,I) + p(-,V) + p(I,V) p(-,I) + p(-,V) + p(I,V) where p(a,b) is pairwise score of a and b

18 Algorithm Paradigm Dynamic programming is used again Dynamic programming is used again Basic algorithm can be used, but there will be problems Basic algorithm can be used, but there will be problems In two sequence case, complexity is O(n 2 ) In two sequence case, complexity is O(n 2 ) For k sequence case, complexity is O(n k ) For k sequence case, complexity is O(n k ) Can take a really long time if k is large Can take a really long time if k is large

19 Algorithm Paradigm Must reduce the amount or number of cells to compute Must reduce the amount or number of cells to compute Apply a heuristic to reduce the number of computed cells Apply a heuristic to reduce the number of computed cells

20 Star Alignments Building a multiple alignment based on pairwise alignments between a fixed sequence and all others Building a multiple alignment based on pairwise alignments between a fixed sequence and all others Fixed sequence is the center of the star Fixed sequence is the center of the star

21 Star Alignments Example Example a = ATTGCCATT b = ATGGCCATT c = ATCCAATTTT d = ATCTTCTT e = ACTGACC Select a as the center of the star

22 Star Alignments Align Align a with b a with c a with d a with e

23 Star Alignments ATTGCCATT ATTGCCATT ATGGCCATT ATGGCCATT ATTGCCATT-- ATTGCCATT-- ATC-CAATTTT ATC-CAATTTT ATTGCCATT ATTGCCATT ATCTTC-TT ATCTTC-TT ATTGCCATT ATTGCCATT ACTGACC-- ACTGACC--

24 Star Alignments Combine results Combine results ATTGCCATT-- ATTGCCATT-- ATGGCCATT-- ATGGCCATT-- ATC-CAATTTT ATC-CAATTTT ATCTTC-TT-- ATCTTC-TT-- ACTGACC---- ACTGACC----

25 Database Search Database exist for searching and comparing protein and DNA sequences Database exist for searching and comparing protein and DNA sequences Methods described work, but may take to long and be impractical for searching large databases Methods described work, but may take to long and be impractical for searching large databases Novel and faster methods have been developed Novel and faster methods have been developed

26 PAM Matrix When scoring protein sequences, the +1,-1,-2 may not be sufficient When scoring protein sequences, the +1,-1,-2 may not be sufficient Amino acids have properties that influence the likelihood that they will be substituted in an evolutionary scenario Amino acids have properties that influence the likelihood that they will be substituted in an evolutionary scenario

27 PAM Matrix Point Accepted Mutations Point Accepted Mutations A 1-PAM matrix is suitable for comparing sequences that are 1 unit of evolution apart A 1-PAM matrix is suitable for comparing sequences that are 1 unit of evolution apart A 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart A 250-PAM matrix is suitable for comparing sequences that are 250 units of evolution apart

28 PAM Matrix Markovian in nature Markovian in nature Need the probability of for each amino acid Need the probability of for each amino acid Probability transition matrix Probability transition matrix Score matrix Score matrix

29 BLAST Most frequently programs used to search sequence databases Most frequently programs used to search sequence databases Acronym for Basic Alignment Search Tool Acronym for Basic Alignment Search Tool Returns a list of high scoring segment pairs between the query sequence and sequences in the database Returns a list of high scoring segment pairs between the query sequence and sequences in the database http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov

30 FAST Another family of programs for sequence database search Another family of programs for sequence database search http://www.rcsb.org/pdb/index.html http://www.rcsb.org/pdb/index.html BLAST and FAST use PAM matrices BLAST and FAST use PAM matrices


Download ppt "Chapter 3 Computational Molecular Biology Michael Smith"

Similar presentations


Ads by Google