© Wiley Publishing. 2007. All Rights Reserved. Searching Sequence Databases.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Rationale for searching sequence databases
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Sequence Alignment and Database Searching.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
BLAST and Psi-BLAST and MSA Nov. 1, 2012 Workshop-Use BLAST2 to determine local sequence similarities. Homework #6 due Nov 8 Chapter 5, Problem 8 Chapter.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Genome Center of Wisconsin, UW-Madison
Bioinformatics and BLAST
BLAST.
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

© Wiley Publishing All Rights Reserved. Searching Sequence Databases

Learning Objectives Finding out why similarity searches are so important Understanding the relationship between homology, similarity, and identity Being able to run a BLAST and to interpret program output Understanding the concept of e-values Knowing how to ask biological questions with BLAST

Outline Biological meaning of sequence similarity Homology, identity, and similarity Running BLAST Interpreting a BLAST output Making a biological analysis with BLAST Running PSI-BLAST the latest BLAST version

Sequence Similarity Two protein sequences with more than 25 % identity (over 100 amino acids ) are homologues Two DNA sequences with more than 70 % identity (over 100 nucleotides) are homologues Homologous sequences have A common ancestor (proteins and DNA) A similar 3D structure (proteins) Often a similar function (proteins)

Homology When two proteins have less than 25% identity They can be homologous or non-homologous Within this range of identity, it’s impossible to say which is true This range of identity is called the “Twilight Zone”

Homology, Similarity, and Identity Identity is a measure made on an alignment Sequence A can be “32 % identical to” Sequence B Similarity is a measure of how close two amino acids are to identical For instance, isoleucine and leucine are similar Homology is a property that exists or does not exist Sequence A IS or IS NOT homologous to Sequence B Sequence A cannot be “40% homologous to” B Homology is established on the basis of measured similarity or identity

How to Establish Homology Compare Protein A with every other protein in a database such as Swiss-Prot Identify a Protein B that is 40% identical to your protein Specialists prefer using E-values but the idea is the same (more on this in a minute) You can conclude that A and B are probably homologous if they are very similar It’s like saying, “John and Nancy are probably brother and sister because they are very similar.” If you know the structure or the function of B, then A and B probably have the same structure

In-silico Biology When establishing that two proteins (A and B) are homologous, you can extrapolate everything you know from one to the other. It’s like making a virtual experiment. This is in-silico biology!

BLAST BLAST: Basic Local Alignment Search Tool BLAST is a tool for comparing one sequence with all the other sequences in a database BLAST can compare DNA sequences Protein sequences BLAST is more accurate for comparing protein sequences than for comparing DNA sequences

BLAST (cont’d.) BLAST makes local alignments It only aligns what can be aligned It ignores the rest BLAST is very fast You need only a few minutes to search Swiss-Prot on a standard PC Many BLAST flavors are available for a variety of tasks

Many BLAST Flavors...

BLASTing a Protein Sequence

Running blastp Choose one of the public servers NCBIwww.ncbi.nlm.nih.gov/blast EBIwww.ebi.ac.uk/blast EMBNetwww.expasy.ch/blast Select a database to search: NR to find any protein sequence Swiss-Prot to find proteins with known functions PDB to find proteins with known structures Cut and paste your sequence Click the BLAST button

Reading BLAST Output Graphic Display Overview of the alignments Hit List Gives the score of each match Alignments Details of each alignment

The Graphic Display The Horizontal Axis (0-700) corresponds to your protein (query) Color codes indicate that match’s quality Red: very good Green: acceptable Black: bad Thin lines join independent matches on the same sequence

The Hit List Sequence accession number Depends on the database Description Taken from the database Bit score High bit score = good match E-Value Low E-value = good match Links Genome Uniref, database of transcripts

The E-Values E-value means expectation value The E-value is the measure most commonly used for estimating sequence similarity How many times is a match at least as good expected to happen by chance ? This estimate is based on the similarity measure If a match is highly unexpected, it probably results from something other than chance Common origin is the most likely explanation This is how homology is inferred

Which Value for Your E-Values ? Low E-value  good hit 1 = bad e-Value 10 e-3 = borderline E-value 10 e-4 = good E-value 10 e-10 = very good E-value E-values lower than 10 e-4 indicate possible homology E-values higher than 10 e-4 require extra evidence to support homology

Why Use E-Values? E-values make it possible to compare alignment of different lengths E-values are used by most sequence comparison programs PSI-BLAST Domain Search FASTA E-values always have the same meaning You can compare the output of different programs

The Alignments Look for clusters of identity Gray residues are low- complexity regions Grayed-out regions have been removed from your sequence to avoid false hits

BLASTing DNA Sequences The BLAST program you need depends on your DNA sequence Coding DNA Non Coding DNA BLASTing DNA sequences is less accurate than BLASTing protein sequences If your sequence is coding, blastx and tblastx will translate it for you on its 6 possible reading frames

BLASTing DNA Sequences

Asking the Right Question with BLAST

The BLAST Way of Doing Things The original BLAST paper is the fourth-most-cited scientific publication 21,000 citations for BLAST 18,000 citations for PSI-BLAST BLAST has changed many aspects of modern biology The following slides show more BLAST procedures They are not necessarily the best procedures They are effective ways of getting the job done on the spot

Gene-Hunting with BLAST Cut your genome sequence in little (2~5Kb) overlapping sequences. Use blastx to BLAST each piece of genome against NR (the Non Redundant protein database). This works better if you have no introns (bacteria). The complicated alternative is to run gene- prediction software program. Predicting a Protein Function

In-silico Analysis with BLAST Use blastp to BLAST your protein sequence against SWISS-PROT. If you get a good hit (more than 25 percent identity) over the complete length of the protein, you’ve solved your problem and you know that your protein has the same function as the SWISS- PROT protein. The complicated alternative is to conduct domain analysis or wet-lab experiments Predicting a Protein Function

Structural Analysis with BLAST Use blastp to BLAST your protein against PDB (the database of protein structure). If you get a good hit (more than 25 percent identity), you know that your protein and this good hit have a similar 3-D structure. The complicated alternative is to do Homology Modeling, X-ray or NMR analysis of your protein Predicting a Protein 3D Structure

Gathering Members of a Protein Family Use blastp (or its more powerful cousin PSI- BLAST) and run it against NR (the non- redundant protein family). After you have all the members of the family, you can make a multiple-sequence alignment (see Chapter 9) and draw a phylogenetic tree. The complicated alternative is to use PCR for cloning your sequences Finding Protein Family Members

Some Reasons for Changing the Default Parameters

PSI-BLAST PSI-BLAST is P osition- S pecific I terated BLAST More sensitive than BLAST: finds matches BLAST would not find More specific than BLAST: reports fewer false matches A bit slower than BLAST PSI-BLAST finds remote homologues Will let you identify very distant members of your protein family PSI-BLAST uses the results of each iteration to increase its specificity

PSI-BLAST Iterations PSI-BLAST uses the best results of the first iteration to build a profile (PSSM) PSI-BLAST uses the profile to re- scan the database PSI-BLAST keeps re-scanning until it stops finding new matches

Some Tips for Using PSI-BLAST If your protein is multi-domain, search one domain at a time PSI-BLAST is slower than normal BLAST because of the iterations You can feed PSI-BLAST with your own PSSM Use the NCBI server for this purpose

Going Farther Each BLAST online server is unique Shop around to find the right database If you need to look for exact matches between a sequence and a genome use BLAT No it’s not a typo You can find it at genome.ucsc.edu If you want something more accurate than BLAST, use Smith and Waterman It’s also slower than BLAST You can find it at www-btls.jst.go.jp