BLAST Sequence alignment, E-value & Extreme value distribution.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
Last lecture summary.
Bioinformatics for biomedicine Sequence search: BLAST, FASTA Lecture 2, Per Kraulis
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Rationale for searching sequence databases
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Spring 2007 Bioinformatiatics Ch. 2 - Sequence Alignment.
Center for Biological Sequence Analysis Database Searching Using alignment algorithms for finding similar sequences.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Sequences are related Darwin: all organisms are related through descent with modification Related molecules have similar functions in different organisms.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Effect of gap penalty on Local Alignment Score:Score: 161 at (seq1)[2..36] : (seq2)[53..90] 2 ASTV----TSCLEPTEVFMDLWPEDHSNWQELSPLEPSD || | | |||||||||||||||||||||||||||
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
Pairwise Alignment and Database Searching Henrik Nielsen Protein Post-Translational Modification & Molecular Evolution Groups Center for Biological Sequence.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Courtesy of Jonathan Pevsner
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Pairwise Alignment and Database Searching
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

BLAST Sequence alignment, E-value & Extreme value distribution

Database searching Using pairwise alignments to search databases for similar sequences Database Query sequence

Database sizes PDB:169,581 sequences; 39,601,444 total letters UniProt:522,019 sequences; 184,241,293 total letters Nr:12,346,870 sequences; 4,221,182,711 total letters Database Query sequence

Database searching Most common use of pairwise sequence alignments is to search databases for related sequences. For instance: find probable function of newly isolated protein by identifying similar proteins with known function. Most often, local alignment ( “Smith-Waterman”) is used for database searching: you are interested in finding out if ANY domain in your protein looks like something that is known. Often, full Smith-Waterman is too time-consuming for searching large databases, so heuristic methods are used (fasta, BLAST).

Database searching: heuristic search algorithms FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith-Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith- Waterman Almost as sensitive as FASTA

BLAST flavors BLASTN Nucleotide query sequence Nucleotide database BLASTP Protein query sequence Protein database BLASTX Nucleotide query sequence Protein database Compares all six reading frames with the database TBLASTN Protein query sequence Nucleotide database ”On the fly” six frame translation of database TBLASTX Nucleotide query sequence Nucleotide database Compares all reading frames of query with all reading frames of the database

Searching on the web: BLAST at NCBI Very fast computer dedicated to running BLAST searches Many databases that are always up to date (e.g. NR and Human Genome Nice simple web interface But you still need knowledge about BLAST to use it properly

Searching on the web: BLAST at NCBI

Searching on the web: Best hits

Searching on the web: BLAST at NCBI

Searching on the web: worse hits Still high sequence coverage among the worst hits – are they ok hits ?

When is a database hit significant? Problem : –Even unrelated sequences can be aligned (yielding a low score) –How do we know if a database hit is meaningful? –When is an alignment score sufficiently high? Solution : –Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences). –Compare actual scores to the distribution of random scores. –Is the real score much higher than you’d expect by chance?

Extreme value distributions How can we estimate the probability of an extreme event ? Can we estimate when a pair of sequences align by chance ?

Random alignment scores follow extreme value distributions The exact shape and location of the distribution depends on the exact nature of the database and the query sequence Searching a database of unrelated sequences result in scores following an extreme value distribution

Significance of a hit: one possible solution (1)Align query sequence to all sequences in database, note scores (2)Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution (3)Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)

Significance of a hit: example Search against a database of 10,000 sequences. An extreme-value distribution (blue) is fitted to the distribution of all scores. It is found that 99.9% of the blue distribution has a score below 112. This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons 10 is the E-value of a hit with score 112. You want E-values well below 1!

Database searching: E-values in BLAST BLAST uses precomputed extreme value distributions to calculate E- values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of to ).

Searching on the web: worse hits Still high sequence coverage among the worst hits – are they ok hits ?

BLAST heuristics Best possible search: –Do full pairwise alignment (Smith-Watermann) between the query sequence and all sequences in the database. –(“ssearch” does this). BLAST speeds up the search by at least two orders of magnitude, by pre- screening the database sequences and only performing the full Dynamic Programming on “promising” sequences. This is done by indexing all databases sequences in a so-called suffix-tree which makes it very fast to search for perfect matching sub-strings. –A suffix tree is the quickest possible way (so far) to search for the longest matching sub-string between two strings. When a BLAST search is run, candidate sequences from the database is picked based on perfect matches to small sub-sequences in the query sequence. (BLASTN and BLASTP does this differently - more about this in a moment). –Full Smith-Waterman is then performed on these sequences.

Blast search method - I Query sequence: PQGELV Make list of all possible k-mer words (length 3 for proteins) PQG (score 15) QGE (score 9) GEL (score 12) ELV (score 10) Assign scores from Blosum62, use those with score> 11 PQG & GEL Mutate words such that score still > 11 PQG (score 15) similar to PEG (score 13) In total we get: PQG, GEL and PEG

Blast search method - II Make k-mer (word-size 3) of all sequences in database Store in a suffix-tree (fast tree-structure to search for identical matches) Find all database sequences that has at least 2 matches among our 3 words PQG, GEL & PEG Find database hit and extend alignment (High-scoring Segment Pair): Query: M E T P Q G I A V Database: P Q G E L V HSP: PQGI (score ) If 2 HSP in query sequence are < 40 positions away Full dynamic alignment on query and hit sequences

BLASTP Alignment matrix: –PAM and BLOSUM-series (default: BLOSUM 62) Notice: These alignment matrices incorporates knowledge about protein evolution. Heuristics: –2 x “Near match” within a windows. –Default word length: 3 aa –Default window length: 40 aa Match => word size All sequences Subset to align 40 aa

BLASTN Alignment matrix: –Perfect match: 1 –Mismatch: -3 Notice: All mismatched are equally penalized: –E.g. A:G == A:C == A:A –More advanced models for DNA evolution does exist. Heuristics: –Perfect match “word” of the size: 7, 11 (default) or 15. Match => word size Potential matched of length < word size (not seen by BLAST) All sequences Subset to align

BLAST Exercise