Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

Slides:

Advertisements

Similar presentations

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

BLAST Sequence alignment, E-value & Extreme value distribution.

Last lecture summary.

Measuring the degree of similarity: PAM and blosum Matrix

Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.

Introduction to Bioinformatics

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Heuristic alignment algorithms and cost matrices

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Introduction to bioinformatics

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.

Sequence alignment, E-value & Extreme value distribution

BLAST Basic Local Alignment Search Tool. BLAST החכה BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence [[רצף.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.

An Introduction to Bioinformatics

Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.

BLAST : Basic local alignment search tool B L A S T !

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

1 P6a Extra Discussion Slides Part 1. 2 Section A.

Comp. Genomics Recitation 3 The statistics of database searching.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Database search. Overview ： 1. FastA ： is suitable for protein sequence searching 2. BLAST ： is suitable for DNA, RNA, protein sequence searching.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Sequence Alignment.

Construction of Substitution matrices

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

Step 3: Tools Database Searching

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Copyright OpenHelix. No use or reproduction without express written consent1.

Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

What is sequencing? Video: WlxM (Illumina video) WlxM.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Blast Basic Local Alignment Search Tool

Basics of BLAST Basic BLAST Search - What is BLAST?

Identifying templates for protein modeling:

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

BLAST Slides adapted & edited from a set by

Presentation transcript:

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek, California 2.Department of Plant and Microbial Biology, University of California Berkley, Berkeley, California 3.Department of Integrative Biology, University of South Florida, Tampa, Florida

Kerfeld and Scott, PLoS Biology What is BLAST? Query tool to retrieve homologous genes from a database Sequence Database (Target) BLAST Putative Function

Kerfeld and Scott, PLoS Biology Different Forms of BLAST blastp blastn blastx tblastn Tblastx BLAST2 Biology Workbench ( is a good source of toolshttp://workbench/sdsc.edu/

Kerfeld and Scott, PLoS Biology Starts with a Query Sequence in FASTA Format Amino acid sequence of a protein, in FASTA format: >ribosomal protein L7/L12 [Thiomicrospira crunogena XCL-2] MAITKDDILEAVANMSVMEVVELVEAMEEKFGVSAAAVAVAGPAGDAGAA GEEQTEFDVVLTGAGDNKVAAIKAVRGATGLGLKEAKSAVESAPFTLKEG VSKEEAETLANELKEAGIEVEVK Nucleotide sequence of a gene, in FASTA format: >gi| : Thiomicrospira crunogena XCL-2 ATGGCAATTACAAAAGACGATATTTTAGAAGCAGTTGCTAACATGTCAGTAATGGAAG TTGTTGAACTTGTTGAAGCAATGGAAGAGAAGTTTGGTGTTTCTGCAGCAGCAGTTGC GGTTGCAGGTCCTGCAGGTGATGCTGGCGCTGCTGGTGAAGAACAAACAGAGTTTGAC GTTGTCTTGACTGGTGCTGGTGACAACAAAGTTGCAGCAATCAAAGCCGTTCGTGGCG CAACTGGTCTTGGGCTTAAAGAAGCGAAAAGTGCAGTTGAAAGTGCACCATTTACGCT TAAAGAGGGTGTTTCTAAAGAAGAAGCAGAAACTCTTGCAAATGAGCTTAAAGAAGCA GGTATTGAAGTCGAAGTTAAATAA

Kerfeld and Scott, PLoS Biology Key Aspects of FASTA format > ribosomal proteinL7/L12 MAITKDDILEAVANMSVMEVVELVEA MEEKFGVSAAAVAVAGPAGDAGAA GEEQTEFDVVLTGAGDNKVAAIKAVR GATGLGLKEAKSAVESAPFTLKEG VSKEEAETLANELKEAGIEVEVK ”description line” (not read as sequence data) Begins with > Ends with a hard return Sequence data (amino acid in this case)

Kerfeld and Scott, PLoS Biology NCBI BLAST Interface (blastp: Proteins) (Paste FASTA format sequence here)

Kerfeld and Scott, PLoS Biology NCBI BLAST Results Page: Potential homologs retrieved from database

Kerfeld and Scott, PLoS Biology Mindless BLAST: Believing that E tells the Whole Story We’re done, right?

Kerfeld and Scott, PLoS Biology Avoid Mindless BLAST by Getting to Know BLAST BLAST algorithm is –Based on molecular evolutionary concepts –Reasonably easy to understand –Responsive to user-modified parameters

Kerfeld and Scott, PLoS Biology Overview of BLAST 1.Segment the query sequence 2.Use the query sequence segments to scan the database 3.Extend the segments to verify the matching sequences (hits) from the database 4.Create a list of hits, with best matches first 5.Summarized in Fig. 1

Kerfeld and Scott, PLoS Biology BLAST Phase 1: Segmenting the query sequence; scoring potential word matches--compile BLAST: –Segments the query sequence into pieces (“words”) Default word length: 3 amino acids or 11 nucleic acids –Creates a list of synonyms and their scores for comparing query words to target words Uses scoring matrix to calculate scores for synonyms that might be found in the database –Saves the scores (and synonyms) exceeding a given threshold T

Kerfeld and Scott, PLoS Biology BLAST Phase 2: Scanning the database BLAST –Scans the database for matches to the word list with acceptable T values –Requires two matches (“hits”) within the target sequence –Sets aside sequences with matches above T for further analysis …………..SWITEASFSPPGIM….. SWIPGI Possible match from the database Words

Kerfeld and Scott, PLoS Biology BLAST Phase 3: Extending the hits BLAST –Searches 5’ and 3’ of the word hit on both the query and target sequence –Adds up the score for sequence identity or similarity until value exceeds S –Alignment is dropped from subsequent analyses if value never exceeds S

Kerfeld and Scott, PLoS Biology The Importance of Scoring Matrices in the BLAST Algorithm Segments query sequence into “words” and scores potential word matches Scans this list for alignments that meet a threshold score T –uses a scoring matrix to calculate this Uses this list of ‘synonyms’ to scan the database Extends the alignments to see if they meet a cutoff score S –uses a scoring matrix to calculate this Reports the alignments that exceed S

Kerfeld and Scott, PLoS Biology A Substitution Matrix (BLOSUM 62 partial) Identity and Similarity… RGIKFSTWV R G I K F 6 1 S T 5 0 W 11-3 V 4

Kerfeld and Scott, PLoS Biology Biochemistry Digression Biological basis for scores? How do proteins evolve? How do we know? Correlated changes (Structure sometimes reveals common ancestry that is no longer apparent in the primary structure)—See, for example,

Kerfeld and Scott, PLoS Biology How Scoring Matrices Were Built Scores in the matrix are based primarily on the frequency with which a given residue in the query sequence aligns with another residue in a homologous sequence in the database. Because these frequencies generally cannot be known a priori, they must be based on empirical evidence. Choice of which related sequences to use as empirical data for determination of frequencies differentiates each scoring matrix and its benefits.

Kerfeld and Scott, PLoS Biology PAM and BLOSUM Matrices Both are empirically based: –Rely on similarity scores derived by aligning amino acid sequences from proteins known to be homologous PAM (1978): –Similarity scores were based on closely related proteins and extrapolated out for more distantly related ones BLOSUM (1992): –Similarity scores were based on distantly related proteins

Kerfeld and Scott, PLoS Biology Selecting Scoring Matrices Choose a matrix appropriate to the suspected degree of sequence identity between the query and its target sequences PAM: empirically derived for close relatives BLOSUM: empirically derived for distant relatives

Kerfeld and Scott, PLoS Biology Raw Scores (S values) from an Alignment S = (  M ij ) – cO – dG, where M = score from a similarity matrix for a particular pair of amino acids (ij) c = number of gaps O = penalty for the existence of a gap d = total length of gaps G = per-residue penalty for extending the gap

Kerfeld and Scott, PLoS Biology Limitations of Raw Scores S values depend on the substitution matrix, gap penalties Impossible to compare S values from hits retrieved from BLAST searches when different matrices and gap penalties are used

Kerfeld and Scott, PLoS Biology Going from Raw Scores to Bit Scores S’ = [ S-ln(K)]/ln(2) where S’ = bit score and K = normalizing parameters of the specific matrices and search spaces –Larger raw scores result in larger bit scores –Allows user to compare scores obtained by using different matrices and search spaces

Kerfeld and Scott, PLoS Biology Limitations of Bit Scores How high does a bit score have to be to suggest common ancestry? –Hard to evaluate hits as homologs or not, based solely on bit scores

Kerfeld and Scott, PLoS Biology E-value Number of distinct alignments with scores greater than or equal to a given value expected to occur in a search against a database of known size, based solely on chance, not homology. –Large E-values suggest that the query sequence and retrieved sequence similarities are due to chance –Small E-values suggest that the sequence similarities are due to shared ancestry (or potentially convergent evolution)

Kerfeld and Scott, PLoS Biology Calculating E-values E = (n × m) / 2 S’ where m = effective length of the query sequence = length of query sequence – average length of alignments (Controls for fewer alignments occurring at the ends of the query sequence) n = effective length of the database sequence (total number of bases) The value of E decreases exponentially with increasing S

Kerfeld and Scott, PLoS Biology BLAST as an Experiment: Parameters to manipulate in a BLAST search Expect Word size Matrix Gap costs Filter Mask

Kerfeld and Scott, PLoS Biology E value Threshold Alignments will be reported with E-values less than or equal to the expect values threshold –Setting a larger E threshold will result in more reported hits –Setting a smaller E threshold will result in fewer reported hits

Kerfeld and Scott, PLoS Biology Filter: Low complexity –Replaces the following with N (nucleotides) or X (amino acids) Dinucleotide repeats Amino acid repeats Leader sequences Stretches of hydrophobic residues Mask: Lower case –Replaces lowercase letters in sequence with N or X Lowercase letters typically indicate base or amino acid not known with certainty Filter and Mask

Kerfeld and Scott, PLoS Biology Parameter Summary is Found at the Bottom of the Output…..

Kerfeld and Scott, PLoS Biology Evaluating BLAST Results

Kerfeld and Scott, PLoS Biology Examine the BLAST Alignment Does it cover the whole length of both the query and subject sequences?

Kerfeld and Scott, PLoS Biology High E-value: Discovery of a Distant Homolog or Garbage? Take another look at the target (subject) sequence(s) that have high E-values –Similar length? –Recurring motifs? –Similar biological functions? Use target sequences as query sequences for another BLAST search –Does the original query sequence come up in report?