 Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.

Presentation on theme: "Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk."— Presentation transcript:

Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk

Content  Dynamic programming Improve to linear space  Heuristics for database searches BLAST, FASTA  Statistical significance What the scores mean Score parameters (PAM, BLOSUM)

Dynamic programming  Improve to linear space

Dynamic programming  NW and SW run in O( nm ) time  And use O( nm ) space For proteins, this is okay. But DNA strings are huge!  Improve this to O( n+m ) space while still using O( nm ) time.

Basic idea  Full DP requires O( nm ) space

Basic idea for linear space  Cells don’t directly depend on cells more than one column to the left.  So keep only two columns; forget the rest No back-pointers! How to find the alignment?

 If we happen to know a cell on the optimal alignment… Divide and conquer

 We could then repeat this trick! Divide and conquer

 But how to find such a point? Divide and conquer

 Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed the column. Modified DP

 Always only two columns at a time. Clearly O( n+m ) space.  But what about the time? Space analysis

 Using linear space DP at every step of a divide and conquer scheme. What are we doing?

 We do more work now, but how much?  Look at case with two identical strings.  n 2 Time analysis n2n2

 We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 Time analysis ¼ n 2

 We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 + ¼ n 2 Time analysis 1/16 n 2

 We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 + ¼ n 2 + … < 2n 2 Time analysis et cetera…

 We do more work now, but how much?  Look at case with two identical strings.  n 2 + ½ n 2 + ¼ n 2 + … < 2n 2  Along the same lines, the algorithm in general is still O( nm ).  And actually only about twice as much work! Time analysis et cetera…

Questions?

Heuristics for database search  BLAST, FASTA

Searching a database  Database of strings  Query string  Select best matching string(s) from the database.

Algorithms we already know  Until now, the algorithms were exact and correct (For the model!)  But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time

Getting faster  Full DP is too expensive  Space issue: fixed. With guaranteed same result  Now the time issue

BLAST and FASTA  Heuristics to prevent doing full DP  - Not guaranteed same result. - But tend to work well.  Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.

Basic idea  In a good alignment, it is likely that several short parts match exactly.  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C …  A B B A D A C C B B C C D C D A

k-tuples  Decompose strings into k-tuples with corresponding offset.  E.g. with k=3, “ A C C A B B” becomes  0: A C C  1: C C A  2: C A B  3: A B B  Do this for the database and the query

Example strings  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C B A  A B B A D A C C B B C C D C D A  A C C A B B D B C D C B B C …  A B B A D A C C B B C C D C D A

3-tuple join 0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB 10: CBB 11: BBC 12: BCB 13: CBA 0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC 10: CCD 11: CDC 12: DCD 13: CDA 3 -5 3 -3

Matches / hits  Lots on the same diagonal: might be a good alignment. Offset in query Offset in db string

 Do e.g. “banded DP” around diagonals with many matches Don’t do full DP Offset in query Offset in db string

Some options  If no diagonal with multiple matches, don’t DP at all.  Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples.  …

Personal experience  “Database architecture” practical assignment  MonetDB: main memory DBMS  SWISS-PROT decomposed into 3-tuples 150k strings 150M 3-tuples  Find database strings with more than one match on the same diagonal.

Personal experience 43 char query string  ~500k matches (in ~122k strings)  ~32k diagonals with more than one match  in ~25k strings With some implementation effort: ~1s (Kudos for Monet here!)

Personal experience  From 150k strings to 15k ‘probable’ strings in 1 second.  This discards 90% percent of database for almost no work.  And even gives extra information to speed up subsequent calculations.  … but might discard otherwise good matches.

Personal experience Tiny query 6 char query  122 diagonals in 119 strings in no time at all An actual protein from the database: 250 char query  ~285k diagonals in ~99k strings in about 5 seconds

BLAST/FASTA conclusion  - Not guaranteed same result.  - But tend to work well.

Questions?

Statistical significance  What do the scores mean?  Score parameters (PAM, BLOSUM)

What do the scores mean?  We are calculating ‘optimal’ scores, but what do they mean? Used log-odds to get an additive scoring scheme  Biologically meaningful versus Just the best alignment between random strings 1.Bayesian approach 2.Classical approach

Bayesian approach  Interested in: Probability of a match, given the strings P( M | x,y )  Already know: Probability of strings given the models, i.e. P( x,y | M ) and P( x,y | R )  So … Bayes rule.

Bayesian approach  Bayes rule gives us: P( x,y | M ) P( M ) P( M | x,y ) = P( x,y )  …rewrite… …rewrite… …rewrite…

Bayesian approach  P( M | x,y ) = σ( S’ ) S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) ) Our score!

Take care!  Requires that substitution matrix contains probabilities.  Mind the prior probabilities: when matching against a database, subtract a log(N) term.  Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

Classical approach  Call the maximum score among N random strings: M N.  P( M N < x ) means “Probability that the best match from a search of a large number N of unrelated sequences has score lower than x.” is an Extreme Value Distribution  Consider x = our score. Then if this probability is very large: likely that our match was not just random.

Correcting for length  Additive score  So longer strings have higher scores!

Correcting for length  If match with any string should be equally likely Correct for this bias by subtracting log(length) from the score Because score is log-odds, this `is a division’ that normalizes score

Scoring parameters  How to get the values in cost matrices? 1.Just count frequencies? 2.PAM 3.BLOSUM

Just count frequencies?  Would be maximum likelihood estimate  - Need lots of confirmed alignments  - Different amounts of divergence

PAM  Grouped proteins by ‘family.’  Phylogenetic tree  PAM1 matrix probabilities for 1 unit of time.  PAMn as (PAM1) n  PAM250 often used.

BLOSUM  Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term and long-term changes.  Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.

Gap penalties  No proper time-dependent model  But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution

Questions?

What have we seen?  Linear space DP  Heuristics: BLAST, FASTA  What the scores mean  Available substitution matrices: PAM, BLOSUM

Last chance for questions…

Download ppt "Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk."

Similar presentations