Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Similar presentations


Presentation on theme: "Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui."— Presentation transcript:

1 Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller and David J. Lipman

2 Introduction to BLAST BLAST is a heuristic approximation to dynamic programming based local alignment. Finds locally maximal segment pairs with scores over a cutoff. Has a formal statistical theory to assess the significance of scores.

3 Basic Algorithm Looks for words of length w with score greater than T. These hits are then extended to check for segment pairs with score greater than S (>T.) Tradeoff: Lowering T reduces probability of missing segment pairs (increases sensitivity) but increases number of hits to be extended.

4 Scanning for hits Two Approaches: –Positions of length w words in query with score higher than T stored in a 20 w sized array and hits detected by array lookup. –A DFA for the appropriate words is generated and used to scan the sequences. A Mealy machine (acceptance on transitions) is used for efficiency.

5 Other Issues Hit extension is simplified by stopping when score falls below a threshold compared to the best score found for shorter extensions. The various parameters are chosen based on experiments using random sequences. Combinations of MSP’s can be used to get better scores for matching sequences.

6 Two-Hit Method Original BLAST: One-Hit Extend each hit to determine if it is in a high- scoring alignment Extension consumes >90% of processing time hit: short word pair whose aligned score ≥ T Two-Hit Method Extension invoked only if there are two non- overlapping word pairs on the same diagonal Lowering T yields more hits, but only a few are extended 3x faster T: threshold parameter; as T ↑, speed ↑, probability of missing weak similarities ↑

7 Two-Hit Method Algorithm Scan db for hits (word pair scoring ≥ T) Seek pairs of non-overlapping hits found with distance A of one another on same diagonal Invoke (ungapped) extension to determine if hits lie within a statistically significant alignment with query. Extend until alignment score has dropped ≥ X below max score yet attained.

8 Gapped Alignments Original BLAST Implicitly treat gapped alignments: Locate several distinct HSPs within same db sequence Calculate statistical significance on combined result Gapped BLAST Trigger gapped extension for any HSP exceeding moderate score S g Gapped extension longer to execute, few undergo this extension HSP: high-scoring segment pair; locally optimal

9 Advantage of New Heuristic for Generating Gapped Alignments Two or more HSPs may each have low scores independently, but can have a statistically significance together Only one of the constituent HSPs need to be found to generate a successful combined result – can increase T

10 Older Gapped Alignments Confine the dynamic programming to a banded section of the full path graph Optimal gapped alignment may be outside this band As width of band ↑, speed ↓

11 New Heuristic for Generating Gapped Alignments Starting from a seed HSP, dynamic programming proceeds both bidirectionally through the path graph Consider only cells for which optimal local alignment score falls ≤ X g below best score yet found Region of path graph explored adapts to alignment being constructed Seed: central residue pair of segment with highest alignment along HSP

12 New Gapped BLAST Ungapped extension of second hit invoked for two non-overlapping hits of score ≥ T within distance A of one another If HSP generated has normalized score ≥ S g, gapped extension is triggered Resulting gapped alignment reported if statistically significant (low enough E-value) Runs on average 3x faster than original BLAST

13 PSI-BLAST: Overview Results of initial BLAST search used to construct position-specific scores. BLAST is repeated using the new scores till no more sequences are found. Position-specific scores improve the ability of successive BLAST iterations for detecting remote homologs.

14 Position-specific score matrix Dimensions: Lx20 “Multiple Alignment” created using all segments with e-value above a threshold. Alignment based on pairwise alignments. Columns with gaps in query ignored. For each column C a reduced alignment M C is created.

15 M C includes all sequences with a residue in C and all columns which have the above sequences. Sequence weighting method used to generate observed residue frequencies. Score for residue i in column C given by log(Q i / P i ) Q i is the weighted sum of observed frequencies and a pseudocount.


Download ppt "Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui."

Similar presentations


Ads by Google