Presentation is loading. Please wait.

Presentation is loading. Please wait.

PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

Similar presentations


Presentation on theme: "PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君."— Presentation transcript:

1 PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君 Ming Li, Bin Ma Derek Kisman, John Tromp

2 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 2 Overview Homology search Local alignment algorithms PH I PH II Multiple Spaced Seeds Computing hit probability Finding a good seed set PH II Design Performance

3 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 3 Local alignment Smith-Waterman Smith and Waterman, 1981; Waterman and Eggert, 1987 SSearch FastA Wilbur and Lipman, 1983; Lipman and Pearson, 1985 BLAST Altschul et al., 1990; Altschul et al., 1997 Blast Family: BLASTN, BLASTP, etc. MEGABLAST

4 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 4 PatternHunter Seed Tradeoff: sensitivity computation Consecutive k letters k=11 in Blastn, k=28 in MegaBlast Nonconsecutive k letters Spaced seed A model of k as its weight

5 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 5 PatternHunter II Genome Informatics 14 (2003) Extend single optimized spaced seed of PH to multiple ones Speed: BLASTN (MEGABLAST) Sensitivity: Smith-Waterman (SSearch)

6 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 6 Definition A homologous region, R A seed hits R A seed set A={a 1,…a k } hits R Similarity R has p=x% identities Sensitivity Hit probability Optimal (DP) = 1

7 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 7 Computing Hit Probability NP-hard on multiple seeds DP on 1 seed Extend DP to multiple seeds

8 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 8 Computing Hit Probability of Multiple Seeds Let A={a 1,…a k } be a set of k seeds and R a random region of Length L with similarity level p. Binary string b is a suffix of R[0:i] Answer: f ( L,Є ), Є = empty string

9 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 9 Computing Hit Probability of Multiple Seeds

10 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 10 Computing Hit Probability of Multiple Seeds

11 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 11 Finding a Good Seed Set NP-hard for both optimal seed and multiple seeds Greedy

12 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 12 Finding a Good Seed Set Compute the 1st seed a 1 which maximizes the hit probability of {a 1 } Compute the 2nd seed a 2 which maximizes the hit probability of {a 1, a 2 } Repeat until Reach the desired number of seeds Reach the desired hit probability

13 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 13 Finding a Good Seed Set May not optimize the combined hit probability Good enough Optimal 16 weight, 11 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,111100110010100001011, 110100001100010101111,1110111010001111} Greedy 16 weight, 12 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,111100010001001101011 1,1100110100101000110111,1110100011110010001101}

14 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 14 Performance of the seeds From low to high Solid: weight-11 k=1,2,4,8,16 seeds Dashed: 1-seed, weight=10,9,8,7

15 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 15 Performance of the seeds Reducing the weight by 1 Increase the expected number of hits by a factor of 4 Doubling the number of seeds Increase the expected number of hits by a factor of 2 Better: Multiple seeds

16 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 16 PH II Performance Compare with Blast(Blastn), Smith- Waterman(SSearch) Sensitivity of SSearch = 1 Alignment score BLAST methods (hash, DP) match=1, mismatch=-1, gapopen=-5, gapextension=-1

17 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 17 PH II Performance From low to high Solid: PH II, 1, 2, 4, 8 seeds weight 11 Dashed: Blastn, seed weight 11

18 Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 18 Complexity Proof Finding optimal spaced seeds NP-hard Finding one optimal seed NP-hard Computing the hit probability of multiple seeds NP-hard


Download ppt "PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君."

Similar presentations


Ads by Google