Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004
Preview How to speed up pairwise alignment How to statistically evaluate alignments and database matches How to use BLAST What database should you search?
Need for fast pairwise alignment Even O(mn) is too slow or memory intensive for some applications –Shotgun assembly (Phrap) –Database search –Alignment of long sequences
History FASTA (1983) –First practical tool for fast pairwise alignment –Restricts alignment to path-graph “hot-spots” BLAST (1990) (and WU-BLAST) –Basic local alignment search tool –Refinement of ideas for locating “hot-spots” –Searching dB for “hot-spots” in sublinear time –Probability of alignment scores Gapped BLAST and FASTA3 (1998, 2000) –The two programs have largely converged –Can now both produce gapped alignments with accurate statistics
Why is BLAST so fast? Exclusion of database sequences unlikely to contain good alignments Preprocessing of database to enable fast matching algorithms Filling out only a small part of the path graph
QuerySubjectFlavor Nucleotide BLASTN Protein BLASTP Nucleotide (translated) ProteinBLASTX ProteinNucleotide (translated) TBLASTN Nucleotide (translated) Nucleotide (translated) TBLASTX
How BLAST works Segment pair - a pair of equal length substrings in query and dB Locally maximal segment pair (LMSP) - ungapped score would decrease by extending or shortening either end High scoring pair (HSP) - the LMSP of highest score between the query and a single DB sequence BLAST heuristically finds a gapped alignment for one or more HSPs that has a score above some threshold
How BLAST works Query broken into short overlapping words (default 11nt or 3aa) Up to 50 high scoring word neighbors (with minimum score T) determined for each word in query LRQ LRE LRO LRT LRG LRT LRQNMLAC RQN
How BLAST works Preprocessed database searched for exact matches to all high scoring words –Generation of words and search for matches are both linear in length of query A memory intensive data structure allows a fast algorithm –We will see how this can be achieved with suffix trees later
How BLAST works Segment pairs on the same diagonal and within A cells of each other are extended in both directions –Extension proceeds until the dropoff from the highest score is too great –If the score is sufficiently high, and no other overlapping LMSP is higher, the highest scoring extension is considered to be an HSP –Extension is the time consuming step of BLAST Requiring two close segment pairs on the same diagonal means that most dB sequences can be discarded before the extension step HSPs are joined by a gapped alignment
Alignment probabilities Each HSP has an associated score ( s ij ) We know the scores for every possible amino acid pair i, j We can compute the frequency of each i, j in a random alignment Why not simply compute the probability of obtaining a score at least as good as S? ELVIS ||| | ELVYS
BLAST statistics Scores are inflated over a naive probability model for two reasons –We have optimized the alignment –We are taking the MSP from that alignment and not a random pair of aligned segments –We are only reporting the highest scoring alignments after searching a large database The expected number of matches that meet or surpass our observed score is given by the extreme value (or Gumbel) distribution
opt E() < :== : one = represents 122 library sequences := :* :* :* :== * :====== * :=============== * :===========================*=== :======================================*======= :===============================================*========== :====================================================*======= :=====================================================*== :==================================================* :========================================= * :=================================== * :============================ * :========================= * :===================== * :===================* :===============* :============* :=========* :=======*= :======* :====*= :===*= :==*= :==* :=*= :=* :=* :* :* inset = represents 2 library sequences :* :* :======================*= :* :=================*========== :* :=============*== :* :==========*=== :* :=======*=== :* :===== * :* :====* :* :===*= :* :==* :* :==* :* :=* :* :=* :* :* :* :* > :* :*=================================
Bit scores, E-values Raw score cannot be compared among searches Bit score is a function of raw score, scoring matrix and amino acid frequencies in database. –Can be compared among searches –Requires parameters and K Expect value (E) depends on length of query (m) and subject (n) –Cannot be compared among searches of different dBs
Caution Preceding theory does not hold for global alignments
Where do and K come from? q i is the frequency of residue i p ij is the frequency of aligned residues i and j with score s ij is the unique solution to K is related to the random walk of locally maximal scores for a given scoring matrix and amino acid frequency
How to interpret E E is the expected # of matches S’ ≥ S’ obs –Given the lengths of the two sequences and assuming both are random –Will be approximately Poisson under null hypothesis E can be much smaller or much greater than 1 –When E << 1, it is approximately equal to the probability that any match would have S’ ≥ S’ obs –There may be many matches with E << 1 if there are multiple homologs in the dB
Additional considerations Edge effects –Length of expected random alignment must be substracted from length of sequence and query Multiple comparisons –When searching D such sequences for a given E threshold, the expected frequency of matches is E’ ~ DE –BLAST actually reports E’
Gapped versus ungapped alignments For gapped alignments – and K can be theoretically computed For ungapped alignments – and K must be simulated for each matrix and dB E values are approximate for gapped alignment –Assumes gaps are expensive/rare BLAST shows E value for the sum of HSP scores, rather than that of the gapped alignment itself
Calculating critical scores Two sequences of length 100 What is S’ crit for E ≤ 0.05? Answer: 13.3 bits
Masking low-complexity regions Alignment statistics require that symbols occur randomly in strings –Long substrings of one or a few symbols violate this assumption –Sequences are preprocessed to identify such low- complexity regions (LCR) LCR are masked –Do not contribute to alignment or score –Appear as X’s in BLAST output Tools include DUST (for DNA) and SEG (protein) Other types of repeats may also be masked by BLAST
Complexity where L=window length f i is frequency of symbol i, i=1..n Example: GGGG L=4, f G =4, f A = f C = f T =0, 4!=4*3*2*1=24, 0!=1 Complexity=(1/4) * log 4 (24/4*1*1*1) = 0
Which matches should you care about? Beware of redundancy Search the right set of sequences and no more Choosing a significance threshold requires judgement
Are you searching the right dB? Protein –nr (Non-redundant, sort of) –Swissprot (Annotated) –Pat (Patented) –PDB (3D structures in Protein Data Bank) –Yeast, E. coli, Drosophila, Human, etc. –Month Nucleotide –nt (nucleotide version of nr) –EST (Expressed sequence tags) –GSS (Genome survey sequence - low pass) –HTGS (High Throughput Genomic Sequence) –Chromosome (completed genomes, chromosomes) –etc.
BLAST parameters -W wordsize [Integer] default = 11 nucleotides, 3 proteins -G Cost to open gap [Integer] default = 5 nucleotides, 11 proteins -E Cost to extend gap [Integer] default = 2 nucleotides, 1 proteins Dropoffs for blast extensions Scoring matrix for proteins -q Penalty for nucleotide mismatch [Integer] default = -3 -r reward for nucleotide match [Integer] default = 1 Masking of low-complexity and repeat sequences -e expect value [Real] default = 10
Making your own BLAST dB Start with a MultiFASTA file Formatdb program from NCBI converts MultiFASTA file into BLAST dB –Words are preprocessed –K and are calculated for allowed scoring matrices MultiFASTA file can also be used a query to do many searches against one database All-by-all search - same file as query and dB
Don’t use BLAST blindly When BLAST is not appropriate –Perfect matches (faster methods available) Primer sequences (also too short) Assembly (more specialized tools exist) –Really distant homology (pairwise alignment is not sufficiently sensitive) Pay attention to program choice/parameters when –Aligning cDNA against a genome –Aligning two very long sequences –Aligning sequences with many repeats
Cautionary tale: BRCA1 Initial BLAST search showed match to granins with E = Granins are typically found in endocrine cell secretory vesicles –Is a cancer protein excreted outside of the cell?! Now known to be a spurious match
Summary To attain high speed when searching a dB, BLAST –Sacrifices some sensitivity by only extending HSPs –Preprocesses the database and holds it in memory Meaningful statistics are key to BLAST’s widespread use Conversion of –Raw score to bit score: depends on scoring matrix and amino acid frequencies –Bit score to E value: depends on database size BLAST is easy to use and versatile, which makes it awfully tempting to misuse
Reading assignment Gibson & Muse, Chapter 2: Genome Sequencing and Annotation, pgs