Download presentation

Presentation is loading. Please wait.

1
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk

2
Content Dynamic programming Improve to linear space Heuristics for database searches BLAST, FASTA Statistical significance What the scores mean Score parameters (PAM, BLOSUM)

3
Dynamic programming Improve to linear space

4
Dynamic programming NW and SW run in O( nm ) time And use O( nm ) space For proteins, this is okay. But DNA strings are huge! Improve this to O( n+m ) space while still using O( nm ) time.

5
Basic idea Full DP requires O( nm ) space

6
Basic idea for linear space Cells don’t directly depend on cells more than one column to the left. So keep only two columns; forget the rest No back-pointers! How to find the alignment?

7
If we happen to know a cell on the optimal alignment… Divide and conquer

8
We could then repeat this trick! Divide and conquer

9
But how to find such a point? Divide and conquer

10
Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed the column. Modified DP

11
Always only two columns at a time. Clearly O( n+m ) space. But what about the time? Space analysis

12
Using linear space DP at every step of a divide and conquer scheme. What are we doing?

13
We do more work now, but how much? Look at case with two identical strings. n 2 Time analysis n2n2

14
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 Time analysis ¼ n 2

15
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 + ¼ n 2 Time analysis 1/16 n 2

16
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 + ¼ n 2 + … < 2n 2 Time analysis et cetera…

17
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 + ¼ n 2 + … < 2n 2 Along the same lines, the algorithm in general is still O( nm ). And actually only about twice as much work! Time analysis et cetera…

18
Questions?

19
Heuristics for database search BLAST, FASTA

20
Searching a database Database of strings Query string Select best matching string(s) from the database.

21
Algorithms we already know Until now, the algorithms were exact and correct (For the model!) But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time

22
Getting faster Full DP is too expensive Space issue: fixed. With guaranteed same result Now the time issue

23
BLAST and FASTA Heuristics to prevent doing full DP - Not guaranteed same result. - But tend to work well. Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.

24
Basic idea In a good alignment, it is likely that several short parts match exactly. A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A

25
k-tuples Decompose strings into k-tuples with corresponding offset. E.g. with k=3, “ A C C A B B” becomes 0: A C C 1: C C A 2: C A B 3: A B B Do this for the database and the query

26
Example strings A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A

27
3-tuple join 0: ACC 1: CCA 2: CAB 3: ABB 4: BBD 5: BDB 6: DBC 7: BCD 8: CDC 9: DCB 10: CBB 11: BBC 12: BCB 13: CBA 0: ABB 1: BBA 2: BAD 3: ADA 4: DAC 5: ACC 6: CCB 7: CBB 8: BBC 9: BCC 10: CCD 11: CDC 12: DCD 13: CDA 3 -5 3 -3

28
Matches / hits Lots on the same diagonal: might be a good alignment. Offset in query Offset in db string

29
Do e.g. “banded DP” around diagonals with many matches Don’t do full DP Offset in query Offset in db string

30
Some options If no diagonal with multiple matches, don’t DP at all. Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples. …

31
Personal experience “Database architecture” practical assignment MonetDB: main memory DBMS SWISS-PROT decomposed into 3-tuples 150k strings 150M 3-tuples Find database strings with more than one match on the same diagonal.

32
Personal experience 43 char query string ~500k matches (in ~122k strings) ~32k diagonals with more than one match in ~25k strings With some implementation effort: ~1s (Kudos for Monet here!)

33
Personal experience From 150k strings to 15k ‘probable’ strings in 1 second. This discards 90% percent of database for almost no work. And even gives extra information to speed up subsequent calculations. … but might discard otherwise good matches.

34
Personal experience Tiny query 6 char query 122 diagonals in 119 strings in no time at all An actual protein from the database: 250 char query ~285k diagonals in ~99k strings in about 5 seconds

35
BLAST/FASTA conclusion - Not guaranteed same result. - But tend to work well.

36
Questions?

37
Statistical significance What do the scores mean? Score parameters (PAM, BLOSUM)

38
What do the scores mean? We are calculating ‘optimal’ scores, but what do they mean? Used log-odds to get an additive scoring scheme Biologically meaningful versus Just the best alignment between random strings 1.Bayesian approach 2.Classical approach

39
Bayesian approach Interested in: Probability of a match, given the strings P( M | x,y ) Already know: Probability of strings given the models, i.e. P( x,y | M ) and P( x,y | R ) So … Bayes rule.

40
Bayesian approach Bayes rule gives us: P( x,y | M ) P( M ) P( M | x,y ) = P( x,y ) …rewrite… …rewrite… …rewrite…

41
Bayesian approach P( M | x,y ) = σ( S’ ) S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) ) Our score!

42
Take care! Requires that substitution matrix contains probabilities. Mind the prior probabilities: when matching against a database, subtract a log(N) term. Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.

43
Classical approach Call the maximum score among N random strings: M N. P( M N < x ) means “Probability that the best match from a search of a large number N of unrelated sequences has score lower than x.” is an Extreme Value Distribution Consider x = our score. Then if this probability is very large: likely that our match was not just random.

44
Correcting for length Additive score So longer strings have higher scores!

45
Correcting for length If match with any string should be equally likely Correct for this bias by subtracting log(length) from the score Because score is log-odds, this `is a division’ that normalizes score

46
Scoring parameters How to get the values in cost matrices? 1.Just count frequencies? 2.PAM 3.BLOSUM

47
Just count frequencies? Would be maximum likelihood estimate - Need lots of confirmed alignments - Different amounts of divergence

48
PAM Grouped proteins by ‘family.’ Phylogenetic tree PAM1 matrix probabilities for 1 unit of time. PAMn as (PAM1) n PAM250 often used.

49
BLOSUM Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term and long-term changes. Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.

50
Gap penalties No proper time-dependent model But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution

51
Questions?

52
What have we seen? Linear space DP Heuristics: BLAST, FASTA What the scores mean Available substitution matrices: PAM, BLOSUM

53
Last chance for questions…

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google