Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Similar presentations


Presentation on theme: "Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar."— Presentation transcript:

1 Sequence similarity

2 Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar substring of A, B Longest similar substring of A, B..Z For each, How big? How similar?

3 Define alignment Align these two sequences optimally GACGGATT GATCGGTT Define precisely what an alignment is

4 Dot plot Best path from UL to LR?

5 Edit (Levenshtein) distance An alignment of sequences s,t can be created by a series of edit operations –Insert space in s opposite letter in t –Insert space in t opposite letter in s

6 Definition of alignment Insert spaces so that the letters line up, or letters align with spaces GA-CGGATT GATCGG-TT Don’t allow spaces to line up Allow spaces even at beginning and end GCAT- -CATG

7 Define similarity Given an alignment, compute a similarity score Three possibilities for each column i letter-letter match ii letter-letter mismatch iii letter-space mismatch (Can you transform ii into iii?)

8 Optimal alignment Create score function For example: +1 bonus for match -1 penalty for letter-space mismatch

9 Dynamic programming solution Given sequences s,t of length m,n Strategy: build up optimal alignment of prefixes Base case? Recurrence relation?

10 Recurrence Given opt alignment of prefixes of s,t shorter than i,j, find opt of s[1..i], t[1..j] Three possibilities: –extend s by a letter, t by a space –extend s by a letter, t by a letter –extend s by a space, t by a letter Choose the one with the best score

11 Tiny instance -- AGC, AAAC 0-2-3-4 -2 -3

12 Some dp details What is a good order to fill the array? How do you recover the opt alignment? What do you do about ties? What is the space complexity of this algorithm? What is the time complexity of this algorithm?

13 The gap penalty Model above assumes two gaps of size 1 are equivalent to one gap of size 2 Is this realistic? Why or why not?

14 General gap penalties Alignments can no longer be scored as the sum of their parts They still are the sum of blocks with one matched letter or one gap each Blocks are: matched letters, s-gap, t-gap A|A|C|---|A|GAT|A|A|C A|C|T|CGG|T|---|A|A|T

15 DP for general gaps Requires three arrays, one for each block type Time complexity is cubic This is expensive at best, prohibitive for large problems

16 Affine gap penalty Charge h for each gap, plus g * (len(gap)) This still has quadratic complexity!

17 Point accepted mutations Some mutations are more likely than others In proteins, some amino acids are more similar than others (size, charge, hydrophobicity) A point accepted mutation matrix is a table with probabilityof each transition in fixed time

18 PAM matrices The entire matrix sums to 1 A ‘unit of evolution’ is time in which 1/100 amino acids is expected to change

19 Scoring matrix Consider aligned letters a,b Pr(b is a mutation of a) = M ab Pr(b is a random occurrence) = p b Score(a,b) = 10log(M ab / p b )

20 Blast Basic Local Alignment Search Tool Def: ‘segment’ is a subsequence (without gaps) Def: ‘segment pair’ is two segments of equal length Rem: the score of a segment pair is the sum of its aligned letters

21 What Blast does Input: –a PAM matrix –a database of sequences B –a query sequence A –a threshhold S Output: –all segment pairs(A,B) with score > S

22 How Blast works Compile short, high-scoring strings (words) Search for hits -- each hit gives a seed Extend seeds

23 Blast on proteins Words are w-mers which score at least T against A Use hashing or dfa to search for hits Extend seed until heuristically determined limit is reached

24 Blast on nucleic acids Words are w-mers in query A Letters compressed, four to byte Filter database B for very common words to avoid false positives Extend seeds as in proteins

25 What does Blast give you? Efficiency A rigorous statistical theory which gives the probability of a segment pair occurring by chance


Download ppt "Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar."

Similar presentations


Ads by Google