Download presentation
Presentation is loading. Please wait.
Published byMildred Hall Modified over 9 years ago
1
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 27 th, 2012 BMI/CS 576
2
Heuristic alignment motivation too slow for large databases with high query traffic heuristic methods do fast approximation to dynamic programming – FASTA [Pearson & Lipman, 1988] – BLAST [Altschul et al., 1990; Altschul et al., Nucleic Acids Research 1997]
3
Heuristic alignment motivation consider the task of searching the RefSeq collection of sequences against a query sequence: – most recent release of DB contains 17,368,769 proteins, 2,700,925 genomic sequences – Entails 17,000,000*(300*300) matrix operations (assuming query sequence is of length 300 and avg. sequence length is 300) Need faster methods to search databases in practice
4
Heuristic alignment heuristic algorithm: a problem-solving method which isn’t guaranteed to find the optimal solution, but which is efficient and finds good solutions key heuristics in BLAST – A good alignment is made up short stretches of matches: seeds – Extend seeds to make longer alignments key tradeoff made: sensitivity vs. speed
5
Scoring matrices BLOSUM matrices for proteins DNA: separate scores for match and mismatch, and gaps.
6
Overview of BLAST (Basic Alignment Search Tool) given: query sequence q, word length w, word score threshold T, segment score threshold S – compile a list of “words” (of length w) that score at least T when compared to words from q – scan database for matches to words in list – extend all matches to seek high-scoring alignments return: alignments scoring at least S
7
Determining query words Given: query sequence: QLNFSAGW word length w = 2 (default for protein usually w = 3, 12 fro DNA) word score threshold T = 9 Step 1: determine all words of length w in query sequence QL LN NF FS SA AG GW
8
Determining query words Step 2: determine all words that score at least T when compared to a word in the query sequence QLQL=9 LNLN=10 NFNF=12, NY=9 … SAnone... words from sequence query words w/ T≥9
9
Scanning the database search database for all occurrences of query words approach: – index database sequences into table of words (pre-compute this) – index query words into table (at query time) NP NS NT NW NY QLNFSAGW MFNYT, STNYD… NPGAT, TSQRPNP… query sequence database sequences
10
Extending hits BLAST extends hits into local alignments The original version of BLAST extended each hit separately
11
Extending hits in original BLAST extend hits in both directions (without allowing gaps) terminate extension in one direction when score falls certain distance below best score for shorter extensions return segment pairs scoring at least S initial hit best extension b current extension c
12
Query pre-processing Some regions of the sequence are low complexity – repeats Identify low complexity regions and flag before BLAST – DNA sequence: DUST – Protein:SEG
13
Sensitivity vs. running time the main parameter controlling the sensitivity vs. running- time trade-off is T (threshold for what becomes a query word) – small T: greater sensitivity, more hits to expand – large T: lower sensitivity, fewer hits to expand
14
Speeding up BLAST: the two-hit method extension step typically accounts for 90% of BLAST’s execution time key intuition: – HSPs is longer than a word and made up closely located words – extend only when two words are on the same diagonal within distance A of each other to maintain sensitivity, lower T parameter – more single hits found – but only small fraction have associated 2nd hit
15
The two-hit method ‘+’ hits with T > 12 ‘ ’ hits with T > 10 extend these cases (A<40) Figure from: Altschul et al. Nucleic Acids Research 25, 1997
16
Gapped BLAST To improve sensitivity trigger gapped alignment if two-hit extension has a sufficiently high score S g find length-11 segment with highest score; use central pair in this segment as seed run DP both forward & backward from seed prune cells when local alignment score falls a certain distance below best score yet
17
query database http://blast.ncbi.nlm.nih.gov/Blast.cgi
18
BLAST Results
19
BLAST programs ProgramQueryDatabase BLASTPProtein BLASTNDNA BLASTX Translated DNA Protein TBLASTNProteinTranslated DNA TBLASTX Translated DNA
20
BLAST comments Use word-hits to pre-filter low quality matches – May miss some good matches – Fast: empirically, 10 to 50 times faster than Smith-Waterman – Gapped BLAST: improve sensitivity – Key parameters: T, HSP score for gap extension, seed for gap extension Large impact: – NCBI’s BLAST server handles more than 500,000 queries a day – most used bioinformatics program in the world
21
Other alignment problems Whole genome sequence alignment – O(n 2 ) too high for MegaBase length alignments – Hash and extend methods still expensive – Idea 1: Align gene by gene, and concatenate alignments – MUMmer: Suffix tree, Longest increasing sequence, Smith- Waterman Algorithm MUM: Maximal Unique Match Need O(n) implementation of suffix tree – Exploits closeness of sequences to be aligned (e.g. strains) Aligning reads to reference genome – Burrows Wheel Aligner
22
Suffix Tree for atgtgtgtc$ atgtgtgtc$ $ c$ gt t c$ gt 7 1 9 53 8 6 42 10 c$ gt gtc$ gt Slide credit: Adam M Phillippy
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.