Presentation is loading. Please wait.

Presentation is loading. Please wait.

Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.

Similar presentations


Presentation on theme: "Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk."— Presentation transcript:

1 Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk

2 Content  Dynamic programming Improve to linear space  Heuristics for database searches BLAST, FASTA  Statistical significance What the scores mean Score parameters (PAM, BLOSUM)

3 Dynamic programming  Improve to linear space

4 Dynamic programming  NW and SW run in O( nm ) time  And use O( nm ) space For proteins, this is okay. But DNA strings are huge!  Improve this to O( n+m ) space while still using O( nm ) time.

5 Basic idea  Full DP requires O( nm ) space

6 Basic idea for linear space  Cells don’t directly depend on cells more than one column to the left.  So keep only two columns; forget the rest No back-pointers! How to find the alignment?

7  If we happen to know a cell on the optimal alignment… Divide and conquer

8  We could then repeat this trick! Divide and conquer

9  But how to find such a point? Divide and conquer

10  Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed the column. Modified DP

11  Always only two columns at a time. Clearly O( n+m ) space.  But what about the time? Space analysis

12  Using linear space DP at every step of a divide and conquer scheme. What are we doing?

13  We do more work now, but how much?  Look at case with two identical strings.  n 2 Time analysis n2n2

14  We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 Time analysis ¼ n 2

15  We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 + ¼ n 2 Time analysis 1/16 n 2

16  We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 + ¼ n 2 + … < 2n 2 Time analysis et cetera…

17  We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 + ¼ n 2 + … < 2n 2  Along the same lines, the algorithm in general is still O( nm ).  And actually only about twice as much work! Time analysis et cetera…

18 Questions?

19 Heuristics for database search  BLAST, FASTA

20 Searching a database  Database of strings  Query string  Select best matching string(s) from the database.

21 Algorithms we already know  Until now, the algorithms were exact and correct (For the model!)  But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time

22 Getting faster  Full DP is too expensive  Space issue: fixed. With guaranteed same result  Now the time issue

23 BLAST and FASTA  Heuristics to prevent doing full DP  - Not guaranteed same result. - But tend to work well.  Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.

24 Basic idea  In a good alignment, it is likely that several short parts match exactly.  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C …  A B B A D A C C B B C C D C D A

25 k-tuples  Decompose strings into k-tuples with corresponding offset.  E.g. with k=3, “ A C C A B B” becomes  0: A C C  1: C C A  2: C A B  3: A B B  Do this for the database and the query

26 Example strings  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C …  A B B A D A C C B B C C D C D A

27 3-tuple join 0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB 10: CBB 11: BBC 12: BCB 13: CBA 0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC 10: CCD 11: CDC 12: DCD 13: CDA 3 -5 3 -3

28 Matches / hits  Lots on the same diagonal: might be a good alignment. Offset in query Offset in db string

29  Do e.g. “banded DP” around diagonals with many matches Don’t do full DP Offset in query Offset in db string

30 Some options  If no diagonal with multiple matches, don’t DP at all.  Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples.  …

31 Personal experience  “Database architecture” practical assignment  MonetDB: main memory DBMS  SWISS-PROT decomposed into 3-tuples 150k strings 150M 3-tuples  Find database strings with more than one match on the same diagonal.

32 Personal experience 43 char query string  ~500k matches (in ~122k strings)  ~32k diagonals with more than one match  in ~25k strings With some implementation effort: ~1s (Kudos for Monet here!)

33 Personal experience  From 150k strings to 15k ‘probable’ strings in 1 second.  This discards 90% percent of database for almost no work.  And even gives extra information to speed up subsequent calculations.  … but might discard otherwise good matches.

34 Personal experience Tiny query 6 char query  122 diagonals in 119 strings in no time at all An actual protein from the database: 250 char query  ~285k diagonals in ~99k strings in about 5 seconds

35 BLAST/FASTA conclusion  - Not guaranteed same result.  - But tend to work well.

36 Questions?

37 Statistical significance  What do the scores mean?  Score parameters (PAM, BLOSUM)

38 What do the scores mean?  We are calculating ‘optimal’ scores, but what do they mean? Used log-odds to get an additive scoring scheme  Biologically meaningful versus Just the best alignment between random strings 1.Bayesian approach 2.Classical approach

39 Bayesian approach  Interested in: Probability of a match, given the strings P( M | x,y )  Already know: Probability of strings given the models, i.e. P( x,y | M ) and P( x,y | R )  So … Bayes rule.

40 Bayesian approach  Bayes rule gives us: P( x,y | M ) P( M ) P( M | x,y ) = P( x,y )  …rewrite… …rewrite… …rewrite…

41 Bayesian approach  P( M | x,y ) = σ( S’ ) S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) ) Our score!

42 Take care!  Requires that substitution matrix contains probabilities.  Mind the prior probabilities: when matching against a database, subtract a log(N) term.  Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

43 Classical approach  Call the maximum score among N random strings: M N.  P( M N < x ) means “Probability that the best match from a search of a large number N of unrelated sequences has score lower than x.” is an Extreme Value Distribution  Consider x = our score. Then if this probability is very large: likely that our match was not just random.

44 Correcting for length  Additive score  So longer strings have higher scores!

45 Correcting for length  If match with any string should be equally likely Correct for this bias by subtracting log(length) from the score Because score is log-odds, this `is a division’ that normalizes score

46 Scoring parameters  How to get the values in cost matrices? 1.Just count frequencies? 2.PAM 3.BLOSUM

47 Just count frequencies?  Would be maximum likelihood estimate  - Need lots of confirmed alignments  - Different amounts of divergence

48 PAM  Grouped proteins by ‘family.’  Phylogenetic tree  PAM1 matrix probabilities for 1 unit of time.  PAMn as (PAM1) n  PAM250 often used.

49 BLOSUM  Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term and long-term changes.  Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.

50 Gap penalties  No proper time-dependent model  But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution

51 Questions?

52 What have we seen?  Linear space DP  Heuristics: BLAST, FASTA  What the scores mean  Available substitution matrices: PAM, BLOSUM

53 Last chance for questions…


Download ppt "Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk."

Similar presentations


Ads by Google