©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Database searching with BLAST
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
©CMBI 2005 Transfer of information The main topic of this course is transfer of information. A month in the lab can easily save you an hour in front of.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence similarity, BLAST alignments & multiple sequence alignments
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Presentation transcript:

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters BLAST, output Alignment significance in BLAST

©CMBI 2005 Database Searching Identify similarities between novel query sequences whose structures and functions are unknown and uncharacterized sequences in (public) databases whose structures and functions have been elucidated. N.B. The similarity might span the entire query sequence or just part of it!

©CMBI 2005 Database searching (2) The query sequence is compared/aligned with every sequence in the database. High-scoring database sequences are assumed to be evolutionary related to the query sequence. If sequences are related by divergence from a common ancestor, there are said to be homologous.

J.Leunissen©CMBI 2005 Sequence Alignment The purpose of a sequence alignment is to line up all residues in the sequence that were derived from the same residue position in the ancestral gene or protein in any number of sequences gap = insertion or deletion A B B A

©CMBI 2005 Scoring Matrix/Substitution Matrix To score quality of an alignment Contains scores for pairs of residues (amino acids or nucleic acids) in a sequence alignment For protein/protein comparisons: a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character (e.g. Ile, Leu) give higher scores compared to those of different character (e.g. Ile, Asp). Symmetric

©CMBI 2005 Substitution Matrices Not all amino acids are equal Residues mutate more easily to similar ones Residues at surface mutate more easily Aromatics mutate preferably into aromatics Mutations tend to favor some substitutions Core tends to be hydrophobic Selection tends to favor some substitutions Cysteines are dangerous at the surface Cysteines in bridges seldom mutate

©CMBI 2005 PAM250 Matrix

©CMBI 2005 Scoring example Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT => alignment score = 46

©CMBI 2005 Dayhoff Matrix (1) The group of Dayhoff created a scoring matrix from a dataset of closely similar protein sequences that could be aligned unambiguously. Then they counted all mutations (and non-mutations) and calculated the mutation frequencies With a bit of math, they converted these frequencies into the famous Dayhoff matrix (also called PAM matrix).

©CMBI 2005 Given the frequency of Leu and Val in my sequences, do I see more mutations of V  L than I would expect by chance? Score of mutation A  B = log ( observed a  b mutation rate / expected number of mutations ) This is called a log odd and can be negative, zero, or positive. When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues. Dayhoff Matrix (2)

©CMBI 2005 Significance of alignment (1) When is an alignment statistically significant? In other words: How much different is the alignment score found from scores obtained by aligning random sequences to the query sequence? Or: What is the probability that an alignment with this score could have arisen by chance?

©CMBI 2005 Significance of alignment (2) Database size= 20 x 10 6 letters peptide#hits A1 x 10 6 AP50000 IAP2500 LIAP125 WLIAP6 KWLIAP0,3 KWLIAPY0,015

©CMBI 2005 BLAST Question: What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. Very fast algorithm Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes Robust – the default parameters can usually be used

©CMBI 2005 BLAST – Algorithme Step 1: Read/understand user query sequence. Step 2: Use hashing technology to select several thousand likely candidates. Step 3: Do a real alignment between the query sequence and those likely candidate. ‘Real alignment’ is a main topic of this course. Step 4: Present output to user.

©CMBI 2005 Basic BLAST Algorithms ProgramQueryDatabase BLASTPProtein BLASTNDNA BLASTXtranslatedDNAprotein TBLASTNproteintranslatedDNA TBLASTXtranslatedDNA

©CMBI 2005 PSI-BLAST Position-Specific Iterated BLAST Distant relationships are often best detected by motif or profile searches rather than pair-wise comparisons PSI-BLAST first performs a BLAST search. PSI-BLAST uses the information from significant BLAST alignments returned to construct a position specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found.

©CMBI 2005 BLAST Input Steps in running BLAST: Entering your query sequence (cut-and-paste) Select the database(s) you want to search Choose output parameters Choose alignment parameters (scoring matrix, filters,….) Example query= MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN

©CMBI 2005 BLAST Output (1)

©CMBI 2005 BLAST Output (2) A high score, or preferably, clusters of high scores, indicates a likely relationship A low probability indicates that a match is unlikely to ave arisen by chance

©CMBI 2005 BLAST Output (3) Low scores with high probabilities suggest that matches have arisen by chance

©CMBI 2005 Alignment Significance in BLAST P-value (probability) Relates the score for an alignment to the likelihood that it arose by chance. The closer to zero, the greater the confidence that the hit is real. E-value (expect value) The number of alignments with E that would be expected by chance in that database (e.g. if E=10, 10 matches with scores this high are expected to be found by chance). A match will be reported if its E is below the threshold. Lower E thresholds are more stringent, and report fewer matches.

©CMBI 2005 BLAST Output (4)

©CMBI 2005 BLAST Output (5)

©CMBI 2005 BLAST Output (6)

©CMBI 2005 Low complexity filter

©CMBI 2005 Low complexity filter

©CMBI 2005 Low complexity filter

©CMBI 2005 Local implementation - Blast in MRS MRS also contains a BLAST. This BLAST is simpler, has fewer options, knows fewer databases, but is faster.

©CMBI 2005 Blast in MRS MRS Blast remembers all your queries from one session, and stores them in a table. The one you are running is in that table too. Multiple BLASTs can run at one time. Still runningReady

©CMBI 2005 Blast hitlist in MRS

©CMBI 2005 Blast hitlist expansion in MRS

©CMBI 2005 Blast hitlist expansion in MRS

©CMBI 2005 Low complexity motifs visible

©CMBI 2005 Routing

©CMBI 2005 Routing to Clustal

©CMBI 2005 Routing MRS to Blast