Bioinformatics and BLAST

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Sequence alignment Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.
Heuristic alignment algorithms and cost matrices
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Sequence alignment, Part 2
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Bioinformatics and BLAST

Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research

Recap from Last Time Genome consists of entire DNA sequence of a species DNA is the blueprint Contains individual genes Genes are composed of four nucleotides (A,C,G,T) RNA transcribes DNA into proteins Proteins consist of 20 amino acids Proteins perform the actual functions

Recap from Last Time Identify similarities among the same gene from pig, sheep, and rabbit Things that were hard: Manually matching letters Deciding what “similar” meant Things that would have made it harder: What if you had the whole genome? What if there were missing letters/gaps?

Why do similarity search? Similarity indicates conserved function Human and mouse genes are more than 80% similar at sequence level But these genes are small fraction of genome Most sequences in the genome are not recognizably similar Comparing sequences helps us understand function Locate similar gene in another species to understand your new gene Rosetta stone

Issues to consider Dealing with gaps Do we want gaps in alignment? What are disadvantages of Many small gaps? Some big gaps?

Warning: similarity not transitive! If 1 is “similar” to 2, and 3 is “similar” to 2, is 1 similar to 3? Not necessarily AAAAAABBBBBB is similar to AAAAAA and BBBBBB But AAAAAA is not similar to BBBBBB “not transitive unless alignments are overlapping”

Summary Why are biological sequences similar to one another? Start out similar, follow different paths Knowledge of how and why sequences change over time can help you interpret similarities and differences between them

BLAST Basic Local Alignment Search Tool Algorithm for comparing a given sequence against sequences in a database A match between two sequences is an alignment Many BLAST databases and web services available

Example BLAST questions Which bacterial species have a protein that is related in lineage to a protein whose amino-acid sequence I know? Where does the DNA I’ve sequenced come from? What other genes encode proteins that exhibit structures similar to the one I’ve just determined?

Background: Identifying Similarity Algorithms to match sequences: Needleman-Wunsch Smith Waterman BLAST

Needleman-Wunsch Global alignment algorithm An example: align COELACANTH and PELICAN Scoring scheme: +1 if letters match, -1 for mismatches, -1 for gaps COELACANTH P-ELICAN-- COELACANTH -PELICAN--

Needleman-Wunsch Details Two-dimensional matrix Diagonal when two letters align Horizontal when letters paired to gaps C O E L A N T H P I -

Needleman-Wunsch In reality, each cell of matrix contains score and pointer Score is derived from scoring scheme (-1 or +1 in our example) Pointer is an arrow that points up, left, or diagonal After initializing matrix, compute the score and arrow for each cell

Algorithm For each cell, compute Match score: sum of preceding diagonal cell and score of aligning the two letters (+1 if match, -1 if no match) Horizontal gap score: sum of score to the left and gap score (-1) Vertical gap score: sum of score above and gap score (-1) Choose highest score and point arrow towards maximum cell When you finish, trace arrows back from lower right to get alignment

Smith-Waterman Modification of Needleman-Wunsch Edges of matrix initialized to 0 Maximum score never less than 0 No pointer unless score greater than 0 Trace-back starts at highest score (rather than lower right) and ends at 0 How do these changes affect the algorithm?

Global vs. Local Global – both sequences aligned along entire lengths Local – best subsequence alignment found Global alignment of two genomic sequences may not align exons Local alignment would only pick out maximum scoring exon

Complexity O(mn) time and memory This is impractical for long sequences! Observation: during fill phase of the algorithm, we only use two rows at a time Instead of calculating whole matrix, calculate score of maximum scoring alignment, and restrict search along diagonal

Other Observations Most boxes have a score of 0 – wasted computation Idea: make alignments where positive scores most likely (approximation) BLAST

Caveats Alignments play by computational, not biological, rules Similarity metrics may not capture biology Approximation may be preferred to reduce computational costs Any two sequences can be aligned challenge is finding the proper meaning

BLAST Set of programs that search sequence databases for statistically significant similarities Complex- requires multiple steps and many parameters Five traditional BLAST programs: BLASTN – nucleotides BLASTSP, BLASTX, TBLASTN, TBLASTX - proteins

BLAST Algorithm Consider a graph with one sequence along X axis and one along Y axis Each pair of letters has score Alignment is a sequence of paired letters (may contain gaps)

Observations Recall Smith-Waterman will find maximum scoring alignment between two sequences But in practice, may have several good alignments or none What we really want is all statistically significant alignments

Observations (continued) Searching entire search space is expensive! BLAST can explore smaller search space Tradeoff: faster searches but may miss some hits

BLAST Overview Three heuristic layers: seeding, extension, and evaluation Seeding – identify where to start alignment Extension – extending alignment from seeds Evaluation – Determine which alignments are statistically significant

Seeding Idea: significant alignments have words in common Word is a defined number of letters Example: MGQLV contains 3-letter words MGQ, GQL, QLV BLAST locates all common words in a pair of sequences, then uses them as seeds for the alignment Eliminates a lot of the search space

What is a word hit? Simple definition: two identical words In practice, some good alignments may not contain identical words Neighborhood – all words that have a high similarity score to the word – at least as big as a threshold T Higher values of T reduce number of hits Word size W also affects number of hits Adjusting T and W controls both speed and sensitivity of BLAST

Some notes on scoring Amino acid scoring matrices measure similarity Mutations likely to produce similar amino acids Basic idea: amino acids that are similar should have higher scores Phenylalanine (F) frequently pairs with other hydrophobic amino acids (Y,W,M,V,I,L) Less frequently with hydrophilic amino acids (R,K,D,E, etc.)

Scoring Matrices PAM (Percent Accepted Mutation) Theoretical approach Based on assumptions of mutation probabilities BLOSUM (BLOcks SUbstitution Matrix) Empirical Constructed from multiply aligned protein families Ungapped segments (blocks) clustered based on percent identity

Seeding Implementation Details BLASTN (nucleotides)– seeds always identical, T never used To speed up BLASTN, increase W BLASTP uses W size 2 or 3 To speed up protein searches, set W=3 and T to a large value

Extension Once search space is seeded, alignments generated by extending in both directions from the seed Example: The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Can align first six characters How far should we continue?

Extension (continued) X parameter – How much is score allowed to drop off after last maximum? Example (assume identical scores +1 and mismatch scores -1) The quick brown fox jump The quiet brown cat purr 123 45654 56789 876 5654 <- score 000 00012 10000 123 4345 <-drop off score

What is a good value of X? Small X risks premature termination But little benefit to a very large X Generally better to use large value W and T better for controlling speed than X

Implementation Details Extension differs in BLASTN and BLASTP Nucleotide sequences can be stored in compressed state (2 bits per nucleotide) If sequence contains N (unknown), replace with random nucleotide Two-bit approximation may cause extension to terminate prematurely

Evaluation Determine which alignments are statistically significant Simplest: throw out alignments below a score threshold S In practice, determining a good threshold complicated by multiple high scoring pairs (HSPs)

Implementation Details Recall three phases: seeding, extension, evaluation In reality, two rounds of extension and evaluation, gapped and ungapped Gapped extension and evaluation only if ungapped alignments exceed thresholds

Storage/Implementation Most BLAST databases are either a collection of files and scripts or simple relational schema What are limitations of these approaches?

Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes in the human genome that are expressed in the liver and have a TTGGACAGGATCGA (allowing 1 or 2 mismatches) followed by GCCGCG within 40 symbols in a 4000 symbol stretch upstream from the gene”

Evaluating complex queries Idea: write a script that make multiple calls to a BLAST database An example query plan: Search for all instances of the two patterns in the human genome Combine results to find all pairs within 40 symbols of each other Consult a gene database to see if pair is upstream of any known gene Consult database to check if gene expressed in liver