BLAST and Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Last lecture summary.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Rationale for searching sequence databases
Heuristic alignment algorithms and cost matrices
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
We continue where we stopped last week: FASTA – BLAST
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLAST.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Sequence alignment, E-value & Extreme value distribution
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Chapter 5 Multiple Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Database Searching BLAST and FastA.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
What is BLAST? BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
BLAST and Psi-BLAST and MSA Nov. 1, 2012 Workshop-Use BLAST2 to determine local sequence similarities. Homework #6 due Nov 8 Chapter 5, Problem 8 Chapter.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment

Which program should one use? Most researchers use methods for determining local similarities: Smith-Waterman (gold standard) FASTA BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W

BLAST Basic Local Alignment Search Tool Three phases: 1) List of high scoring words 2) Scan the sequence database 3) Extend hits

The threshold and word size The program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used. This allows the word size (W) to be kept high (for speed) without sacrificing sensitivity. If T is increased, the number of background hits is reduced and the program will run faster.

. . . Phase 1: Compile a list of high-scoring words above threshold T. Query sequence: human p53: . . . RCPHHERCSD. . . Words derived from query sequence: RCP, CPH, PHH, HHE, … List of words above threshold T: Word Scores from BLOSUM scoring matrix Total score RCP 5 + 9 + 7 21 KCP 2 + 9 + 7 18 QCP 1 + 9 + 7 17 ECP 0 + 9 + 7 16 . . . Note: The line is located at the threshold. Word size is 3.

Phase 2: Scan the database for short segments that match the list of acceptable words/scores above or equal to threshold T. Phase 3: Extend the hits and terminate when the tabulated score drops below a cutoff score. Query EVVRRCPHHERCSD EVVRRCPHHER S+ Sbjct EVVRRCPHHERSSE (Ch. hamster p53 O09185) If the hit is extended far enough the query/subj segment is called a High Scoring Segment Pair (HSP).

What are the different BLAST programs? compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

What are the different BLAST programs? (continued) psi-blast Compares a protein sequence to a protein database. Performs the comparison in an iterative fashion in order to detect homologs that are evolutionarily distant. blast2 Compares two protein or two nucleotide sequences.

The E value (false positive expectation value) The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a significance threshold for reporting results. When the E value is increased from the default value of 10 prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance.

E value (Karlin-Altschul statistics) E = K•m•n•e-λS Where K is a scaling factor (constant), m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score. If S increases, E decreases exponentially. If the decay constant increases, E decreases exponentially If m•n increases the “search space” increases and there is a greater chance for a random “hit”, E increases. Larger database will increase E. However, larger query sequence often decreases E. Why???

Thought problem A homolog to a query sequence resides in two databases. One is the UniProtKB/SwissProt database and the other is the PDB database. After performing BLAST search against the UniProtKB database you obtain an E value of 1. After performing the BLAST search against the PDB database you obtain an E value of 0.0625. What is the relative sizes of the two databases?

Using BLAST to get quick answers to bioinformatics problems Task BLAST method Trad. Method Predict protein function (1) Perform blastp on PIR or Swiss-Prot database Perform wet-lab experiment Predict protein function (2) Perform tblastn on NR database Predict protein structure Perform blastp against PDB Structure prediction software, x-ray crystall., NMR

Using BLAST to get quick answers to bioinformatics problems (cont.) Task BLAST method Trad. Method Locate genes in a genome Divide genome into 2-5 kb sequences. Perform blastx against NR protein datbase Run gene prediction software. Perform microarray analysis or RNAs Find distantly related proteins Perform psi-blast No traditional method Identify DNA sequence Perform blastn Screen genomic DNA library

Filtering Repetitive Sequences Over 50% of genomic DNA is repetitive This is due to: retrotransposons ALU region microsatellites centromeric sequences, telomeric sequences 5’ Untranslated Region of ESTs Example of EST with simple low complexity region: T27311 GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC

Filtering Repetitive Sequences and Masking Options available for user.

PSI-BLAST PSI-position specific iterative a position specific scoring matrix (PSSM) is constructed automatically from multiple HSPs of initial BLAST search. Normal E value threshold is used. The PSSM is created as the new scoring matrix for a second BLAST search. A low E value threshold is used (E=.001). Result-1) obtains distantly related sequences 2) finds the important residues that provide function or structure.

A PSSM

Steps to multiple alignment Create Alignment Edit the alignment to ensure that regions of functional or structural similarity are preserved Phylogenetic Analysis Structure Analysis Find conserved motifs to deduce function Design of PCR primers

Multiple Sequence Alignment Collection of three or more protein (or nucleic acid) sequences partially or completely aligned. Aligned residues tend to occupy corresponding positions in the 3-D structure of each aligned protein.

Practical use of MSA Helps to place protein into a group of related proteins. It will provide insight into function, structure and evolution. Helps to detect homologs Identifies sequencing errors Identifies important regulatory regions in the promoters of genes.

Clustal W (Thompson et al., 1994) CLUSTAL=Cluster alignment The underlying concept is that groups of sequences are phylogenetically related. If they can be aligned then one can construct a phylogenetic tree.

Flowchart of computation steps in Clustal W (Thompson et al., 1994) Pairwise alignment: calculation of distance matrix Creation of unrooted neighbor-joining tree Rooted nJ tree (guide tree) and calculation of sequence weights Progressive alignment following the guide tree

Step 1-Pairwise alignments Compare each sequence with each other and calculate a distance matrix. A - B .87 - C .59 .60 - Different sequences Each number represents the number of exact matches divided by the sequence length (ignoring gaps). Thus, the higher the number the more closely related the two sequences are. A B C In this matrix, sequence A is 87% identical to sequence B

Step 1-Pairwise alignments Compare each sequence with each other and pairwise alignment scores human EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN 480 dog EYSGSSEKIDLMASDPQDAFICESERVHTKPVGGNIEDKIFGKTYRRKASLPKVSHTTEV 477 mouse GGFSSSRKTDLVTPDPHHTLMCKSGRDFSKPVEDNISDKIFGKSYQRKGSRPHLNHVTE 476

Step 1-Calculation of Distance Matrix Use the Distance Matrix to create a Guide Tree to determine the “order” of the sequences. Hbb-Hu 1 - Hbb-Ho 2 .17 - Hba-Hu 3 .59 .60 - Hba-Ho 4 .59 .59 .13 - Myg-Ph 5 .77 .77 .75 .75 - Gib-Pe 6 .81 .82 .73 .74 .80 - Lgb-Lu 7 .87 .86 .86 .88 .93 .90 - 1 2 3 4 5 6 7 D = 1 – (I) D = Difference score # of identical aa’s in pairwise global alignment I = total number of aa’s in shortest sequence

Step 2-Create unrooted NJ tree Hba-Ho Hba-Hu Hbb-Ho Hbb-Hu Myg-Ph Gib-Pe Lgb-Lu

Step 3-Create Rooted NJ Tree Weight Alignment Order of alignment: 1 Hba-Hu vs Hba-Ho 2 Hbb-Hu vs Hbb-Ho 3 A vs B 4 Myg-Ph vs C 5 Gib-Pe vs D 6 Lgh-Lu vs E

Step 4-Progressive alignment

Step 4-Progressive alignment Scoring during progressive alignment

Rules for alignment Short stretches of 5 hydrophilic residues often indicate loop or random coil regions (not essential for structure) and therefore gap penalties are reduced reduced for such stretches. Gap penalties for closely related sequences are lowered compared to more distantly related sequences (“once a gap always a gap” rule). It is thought that those gaps occur in regions that do not disrupt the structure or function. Alignments of proteins of known structure show that proteins gaps do not occur more frequently than every eight residues. Therefore penalties for gaps increase when required at 8 residues or less for alignment. This gives a lower alignment score in that region. A gap weight is assigned after each aa according the frequency that such a gap naturally occurs after that aa in nature

Amino acid weight matrices As we know, there are many scoring matrices that one can use depending on the relatedness of the aligned proteins. As the alignment proceeds to longer branches the aa scoring matrices are changed to more divergent scoring matrices. The length of the branch is used to determine which matrix to use and contributes to the alignment score.

Example of Sequence Alignment using Clustal W Asterisk represents identity : represents high similarity . represents low similarity

Multiple Alignment Considerations Quality of guide tree. It would be good to have a set of closely related sequences in the alignment to set the pattern for more divergent sequences. If the initial alignments have a problem, the problem is magnified in subsequent steps. CLUSTAL W is best when aligning sequences that are related to each other over their entire lengths Do not use when there are variable N- and C- terminal regions If protein is enriched for G,P,S,N,Q,E,K,R then these residues should be removed from gap penalty list. (what types of residues are these?) Reference: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/