Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Searching Sequence Databases
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
From Pairwise Alignment to Database Similarity Search.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Protein Sequence Comparison Patrice Koehl
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Comp. Genomics Recitation 3 The statistics of database searching.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Sequence Alignment.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
What is BLAST? Basic BLAST search What is BLAST?
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Fast Sequence Alignments
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Alignment to a database of sequences Biology 162 Computational Genetics Todd Vision 7 Sep 2004

Preview How to speed up pairwise alignment How to statistically evaluate alignments and database matches How to use BLAST What database should you search?

Need for fast pairwise alignment Even O(mn) is too slow or memory intensive for some applications –Shotgun assembly (Phrap) –Database search –Alignment of long sequences

History FASTA (1983) –First practical tool for fast pairwise alignment –Restricts alignment to path-graph “hot-spots” BLAST (1990) (and WU-BLAST) –Basic local alignment search tool –Refinement of ideas for locating “hot-spots” –Searching dB for “hot-spots” in sublinear time –Probability of alignment scores Gapped BLAST and FASTA3 (1998, 2000) –The two programs have largely converged –Can now both produce gapped alignments with accurate statistics

Why is BLAST so fast? Exclusion of database sequences unlikely to contain good alignments Preprocessing of database to enable fast matching algorithms Filling out only a small part of the path graph

QuerySubjectFlavor Nucleotide BLASTN Protein BLASTP Nucleotide (translated) ProteinBLASTX ProteinNucleotide (translated) TBLASTN Nucleotide (translated) Nucleotide (translated) TBLASTX

How BLAST works Segment pair - a pair of equal length substrings in query and dB Locally maximal segment pair (LMSP) - ungapped score would decrease by extending or shortening either end High scoring pair (HSP) - the LMSP of highest score between the query and a single DB sequence BLAST heuristically finds a gapped alignment for one or more HSPs that has a score above some threshold

How BLAST works Query broken into short overlapping words (default 11nt or 3aa) Up to 50 high scoring word neighbors (with minimum score T) determined for each word in query LRQ LRE LRO LRT LRG LRT LRQNMLAC RQN

How BLAST works Preprocessed database searched for exact matches to all high scoring words –Generation of words and search for matches are both linear in length of query A memory intensive data structure allows a fast algorithm –We will see how this can be achieved with suffix trees later

How BLAST works Segment pairs on the same diagonal and within A cells of each other are extended in both directions –Extension proceeds until the dropoff from the highest score is too great –If the score is sufficiently high, and no other overlapping LMSP is higher, the highest scoring extension is considered to be an HSP –Extension is the time consuming step of BLAST Requiring two close segment pairs on the same diagonal means that most dB sequences can be discarded before the extension step HSPs are joined by a gapped alignment

Alignment probabilities Each HSP has an associated score (  s ij ) We know the scores for every possible amino acid pair i, j We can compute the frequency of each i, j in a random alignment Why not simply compute the probability of obtaining a score at least as good as S? ELVIS ||| | ELVYS

BLAST statistics Scores are inflated over a naive probability model for two reasons –We have optimized the alignment –We are taking the MSP from that alignment and not a random pair of aligned segments –We are only reporting the highest scoring alignments after searching a large database The expected number of matches that meet or surpass our observed score is given by the extreme value (or Gumbel) distribution

opt E() < :== : one = represents 122 library sequences := :* :* :* :== * :====== * :=============== * :===========================*=== :======================================*======= :===============================================*========== :====================================================*======= :=====================================================*== :==================================================* :========================================= * :=================================== * :============================ * :========================= * :===================== * :===================* :===============* :============* :=========* :=======*= :======* :====*= :===*= :==*= :==* :=*= :=* :=* :* :* inset = represents 2 library sequences :* :* :======================*= :* :=================*========== :* :=============*== :* :==========*=== :* :=======*=== :* :===== * :* :====* :* :===*= :* :==* :* :==* :* :=* :* :=* :* :* :* :* > :* :*=================================

Bit scores, E-values Raw score cannot be compared among searches Bit score is a function of raw score, scoring matrix and amino acid frequencies in database. –Can be compared among searches –Requires parameters  and K Expect value (E) depends on length of query (m) and subject (n) –Cannot be compared among searches of different dBs

Caution Preceding theory does not hold for global alignments

Where do and K come from? q i is the frequency of residue i p ij is the frequency of aligned residues i and j with score s ij is the unique solution to K is related to the random walk of locally maximal scores for a given scoring matrix and amino acid frequency

How to interpret E E is the expected # of matches S’ ≥ S’ obs –Given the lengths of the two sequences and assuming both are random –Will be approximately Poisson under null hypothesis E can be much smaller or much greater than 1 –When E << 1, it is approximately equal to the probability that any match would have S’ ≥ S’ obs –There may be many matches with E << 1 if there are multiple homologs in the dB

Additional considerations Edge effects –Length of expected random alignment must be substracted from length of sequence and query Multiple comparisons –When searching D such sequences for a given E threshold, the expected frequency of matches is E’ ~ DE –BLAST actually reports E’

Gapped versus ungapped alignments For gapped alignments – and K can be theoretically computed For ungapped alignments – and K must be simulated for each matrix and dB E values are approximate for gapped alignment –Assumes gaps are expensive/rare BLAST shows E value for the sum of HSP scores, rather than that of the gapped alignment itself

Calculating critical scores Two sequences of length 100 What is S’ crit for E ≤ 0.05? Answer: 13.3 bits

Masking low-complexity regions Alignment statistics require that symbols occur randomly in strings –Long substrings of one or a few symbols violate this assumption –Sequences are preprocessed to identify such low- complexity regions (LCR) LCR are masked –Do not contribute to alignment or score –Appear as X’s in BLAST output Tools include DUST (for DNA) and SEG (protein) Other types of repeats may also be masked by BLAST

Complexity where L=window length f i is frequency of symbol i, i=1..n Example: GGGG L=4, f G =4, f A = f C = f T =0, 4!=4*3*2*1=24, 0!=1 Complexity=(1/4) * log 4 (24/4*1*1*1) = 0

Which matches should you care about? Beware of redundancy Search the right set of sequences and no more Choosing a significance threshold requires judgement

Are you searching the right dB? Protein –nr (Non-redundant, sort of) –Swissprot (Annotated) –Pat (Patented) –PDB (3D structures in Protein Data Bank) –Yeast, E. coli, Drosophila, Human, etc. –Month Nucleotide –nt (nucleotide version of nr) –EST (Expressed sequence tags) –GSS (Genome survey sequence - low pass) –HTGS (High Throughput Genomic Sequence) –Chromosome (completed genomes, chromosomes) –etc.

BLAST parameters -W wordsize [Integer] default = 11 nucleotides, 3 proteins -G Cost to open gap [Integer] default = 5 nucleotides, 11 proteins -E Cost to extend gap [Integer] default = 2 nucleotides, 1 proteins Dropoffs for blast extensions Scoring matrix for proteins -q Penalty for nucleotide mismatch [Integer] default = -3 -r reward for nucleotide match [Integer] default = 1 Masking of low-complexity and repeat sequences -e expect value [Real] default = 10

Making your own BLAST dB Start with a MultiFASTA file Formatdb program from NCBI converts MultiFASTA file into BLAST dB –Words are preprocessed –K and  are calculated for allowed scoring matrices MultiFASTA file can also be used a query to do many searches against one database All-by-all search - same file as query and dB

Don’t use BLAST blindly When BLAST is not appropriate –Perfect matches (faster methods available) Primer sequences (also too short) Assembly (more specialized tools exist) –Really distant homology (pairwise alignment is not sufficiently sensitive) Pay attention to program choice/parameters when –Aligning cDNA against a genome –Aligning two very long sequences –Aligning sequences with many repeats

Cautionary tale: BRCA1 Initial BLAST search showed match to granins with E = Granins are typically found in endocrine cell secretory vesicles –Is a cancer protein excreted outside of the cell?! Now known to be a spurious match

Summary To attain high speed when searching a dB, BLAST –Sacrifices some sensitivity by only extending HSPs –Preprocesses the database and holds it in memory Meaningful statistics are key to BLAST’s widespread use Conversion of –Raw score to bit score: depends on scoring matrix and amino acid frequencies –Bit score to E value: depends on database size BLAST is easy to use and versatile, which makes it awfully tempting to misuse

Reading assignment Gibson & Muse, Chapter 2: Genome Sequencing and Annotation, pgs