Statistical preliminaries P i : background probability that amino acids occur randomly at all position E: number of distinct HSPs with normalized score at least S s ij q ij : target frequency of aligned pair of letters (i, j) with HSP, high-scoring segment paris
BLAST Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman) The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.
The maximal segment pair measure A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4) the highest scoring pair The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. BLAST heuristically attempts to calculate the MSP score.
BLAST 1)Build the hash table for Sequence A. 2)Scan Sequence B for hits. 3)Extend hits.
BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT AAA AAC.. AGA 1.. ATC 3.. CGA 5.. GAT TCG 4.. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T; The higher T, the less sensitivity, but faster
BLAST Step2: Scan sequence B for hits.
BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.) BLAST 2.0 saves the time spent in extension, and considers gapped alignments.
Two-Hit Method BLAST 1.o Extension step accounts for 90% of total time Observations: HSP of interest is much longer than a single word pair Entail multiple hits on the same diagonal and within short distance of one another Invoke an extension only when two non- overlapping hits are found within distance A on the same diagonal
Demonstration Recent[i]: the most recent hit found on the i th diagonal (always increasing) > A overlap < A Extend!
Discussion T must to be lowered More one-hits while the majority are dismissed Speed: Twice as rapid as one-hit Sensitivity Almost the same
Gapped BLAST Original BLAST: find several distinct HSPs All HSPs related to one alignment should be found Now: Find one HSP only– seed, than use 2-hit T can be raised faster Find all HSPs vs find one HSP for one optimal alignment For example, result should > 0.95, p: miss prob of HSP Orignial with 2 HSP: (1-p)(1-p)>0.95 p<0.025 Now: p 2 <0.05 p=0.22
Gapped BLAST (contd) A gapped extension takes much longer to execute than an ungapped extension, but by performing very few of them the fraction of the total time could be kept low. Trigger a gapped extension for any HSP exceeding score S g
Example Original BLAST locates only the first and the last ungapped aligment, E-value > 50 times
PSI-BLAST position-specific score matrices Vs substitution matrices Use it as ordinary ways Iterated, using position-specific score matrices For a BLAST run Constructed automatically from the output Use this matrix in place of the query for the next run For proteins, |query| = L Position-specific matrix : L * 20 Benefits: Better to detect weak relationships
Construct Position-specific matrix 1.Construct multiple alignment M from the output 2.For every column of M 1)Find reduced M c of column C 2)Calculate scores in column C of the position- specific matrix
Construct multiple alignment M Collect sequence segments output With E-value below a Threshold (why) Identical sequence are dropped Pair-wise alignment columns with query involves inserted gap are ignored Multiple alignment M has same length (column length) as query
Construct multiple alignment M
Calculate position-specific matrix score The scores of a given alignment column should dependent the residues appeared on the column But upon those in other columns as well
Find reduced M c of column C R: sequences contribute a residue in column C M c : those columns of M in which all the sequences are represented
Calculate scores in column C of the position-specific matrix Related to all residues frequency observed f i, and number of independent residues in column C (N c ) log(Q i /P i ) Q i : estimated probability for residue i to be found in C
BLAST applied to position-specific matrices Scale with s ij