BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Analysis Tools
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
An Introduction to Bioinformatics
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Construction of Substitution matrices
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Identifying templates for protein modeling:
Sequence Based Analysis Tutorial
BLAST.
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

BLAST: A Case Study Lecture 25

BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters. BLAST was developed to find sequences of nucleotides or amino acids in a database that match a query sequence. For example, searching the human genome for AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT produces a list of sequences scored by similarity. This system helps scientists find genetic homologues across individuals and species.

Using BLAST There are several interfaces to BLAST, and it often appears as one component of a larger suite of informatics tools. National Center for Biotechnology Information (NCBI) hosts the primary website and a server farm dedicated to BLAST. From here, a user enters a query, selects a database, chooses a variant of BLAST to use, and sets program parameters Results appear in seconds.

BLAST Results The NCBI BLAST tool returns results in several modes, with information centered around similarity scores. In addition to a list of matches, the tool returns a graphical view of the list that visualizes the alignments, a detailed textual view of each match, and a mapping of the matches to a visual representation of an entire genome.

How BLAST Works (Stage 1) The core BLAST algorithm has three distinct stages. In the first stage, the system splits the query sequence into constant-sized words. Assuming the constant, W, is 4, the nucleotide query AGCTTTTCTCTTCTGTCAACCCCACACGCCTTT produces the words AGCT GCTT CTTT … GCCT CCTT CTTT BLAST matches these against every possible four letter word from the language to build similarity scores. The subset of words whose similarity scores exceed a threshold move on to later stages, the rest are discarded.

Side Note: Similarity in BLAST To score the similarity of two words, BLAST builds a table based on edit distances. For example, comparing AGCT to ACCC could give a score of 1, whereas comparing it to GGCT would give 3. However, some substitutions (due to mutation) are more likely than others, especially in the case of amino acids. BLAST accepts a scoring matrix for protein strings (e.g., Point Accepted Mutations 70). For nucleotide strings, users can specify distinct scores for matches and mismatches. BLAST also includes procedures for identifying and penalizing gaps.

How BLAST Works (Stages 2 and 3) At this point, BLAST has built a set of W-length words that exceed a user-provided threshold. During the second stage, the system searches for all occurrences of these words within the database. In the third stage, BLAST extends each of these W-length matches to get the final similarity score. The system also calculates the E-value for the score, which is a statistical measure of significance.

Knowledge and Search in BLAST BLAST differs from many of the informatics tools that we have considered in the course. Essentially it finds a sequence’s nearest neighbors within a database with minimal concern for the content. Unlike discovery or analysis tools, BLAST gathers information and leaves the interpretation to the user. However, like many discovery tools, BLAST relies on domain knowledge to carry out heuristic search. Knowledge:match/mismatch costs for amino acid and nucleotide sequences Heuristic Search: an approximate scoring scheme, tells BLAST where to look more closely

What Makes BLAST a Successful Tool? Google Scholar identifies over 28,000 citations of the original BLAST paper. One of the key reasons for the system’s popularity is that it addresses problems commonly encountered in biology: finding genetic homologues across organisms; and determining the source organism of a sequenced genome (e.g., the Global Ocean Sampling Expedition). Technical issues also contributed to BLAST’s success: it was much faster than competing software; it was distributed and maintained by the National Institute of Health; it has continually evolved to meet new challenges and to integrate with new databases and other technologies.

BLAST: Summary A key insight in BLAST was to iteratively refine a solution: find a reduced set of short words to use as a heuristic for locating similar strings; find matches to those short words and extend them to refine the candidate solution. This strategy accounts for the computational gains that this system makes over others that seek exact comparisons. The continued success of BLAST is attributable to the speed in which it can find sequence matches, its availability over the internet, its integration with other biological tools, and the fact that it addresses a specific need of biologists.