Fa05CSE 182 L3: Blast: Local Alignment and other flavors.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Searching Sequence Databases
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 CSE182-L4: Keyword matching. Fa05CSE 182 Backward scoring Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Heuristic Alignment Algorithms Hongchao Li Jan
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Vineet Bafna. How can we compute the local alignment itself?
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence Alignment 11/24/2018.
Fast Sequence Alignments
Pairwise sequence Alignment.
CSE 589 Applied Algorithms Spring 1999
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Fa05CSE 182 L3: Blast: Local Alignment and other flavors

Fa05CSE 182 An Example Align s=TCAT with t=TGCAA Match Score = 1 Mismatch score = -1, Indel Score = -1 Score A1?, Score A2? T C A T - T G C A A T C A T T G C A A A1 A2

Fa05CSE 182 Sequence Alignment Recall: Instead of computing the optimum alignment, we are computing the score of the optimum alignment Let S[i,j] denote the score of the optimum alignment of the prefix s[1..i] and t [1..j]

Fa05CSE 182 An O(nm) algorithm for score computation The iteration ensures that all values on the right are computed in earlier steps. For i = 1 to n For j = 1 to m

Fa05CSE 182 Base case (Initialization)

Fa05CSE 182 A tableaux approach s n 1 i 1jn S[i,j-1]S[i,j] S[i-1,j] S[i-1,j-1] t Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells

Fa05CSE T G C A A T C A T Alignment Table

Fa05CSE T G C A A T C A T Alignment Table S[4,5] = 1 is the score of an optimum alignment Therefore, A2 is an optimum alignment We know how to obtain the optimum Score. How do we get the best alignment?

Fa05CSE 182 Computing Optimum Alignment At each cell, we have 3 choices We maintain additional information to record the choice at each step. For i = 1 to n For j = 1 to m If (S[i,j]= S[i-1,j-1] + C(s i,t j )) M[i,j] = If (S[i,j]= S[i-1,j] + C(s i,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,t j ) ) M[i,j] = j-1 i-1 j i

Fa05CSE 182 T G C A A T C A T Computing Optimal Alignments

Fa05CSE 182 Retrieving Opt.Alignment T G C A A T C A T M[4,5]= Implies that S[4,5]=S[3,4]+C( A,T ) or A T M[3,4]= Implies that S[3,4]=S[2,3] +C( A,A ) or A T A A

Fa05CSE 182 Retrieving Opt.Alignment T G C A A T C A T M[2,3]= Implies that S[2,3]=S[1,2]+C( C,C ) or A T M[1,2]= Implies that S[1,2]=S[1,1] +C (-,G ) or A T A A A A C C C C - GT T

Fa05CSE 182 Algorithm to retrieve optimal alignment RetrieveAl(i,j) if (M[i,j] == `\’) return (RetrieveAl (i-1,j-1). ) else if (M[i,j] == `|’) return (RetrieveAl (i-1,j). ) sisi tjtj sisi - - tjtj else if (M[i,j] == `--’) return (RetrieveAl (i,j-1). ) return (RetrieveAl (i,j-1). )

Fa05CSE 182 Summary An optimal alignment of strings of length n and m can be computed in O(nm) time There is a tight connection between computation of optimal score, and computation of opt. Alignment –True for all DP based solutions

Fa05CSE 182 Global versus Local Alignment Consider s = ACCACCCCTT t = ATCCCCACAT ACCACCCCTT A TCCCCACAT ATCCCCACAT ACCACCCCT T

Fa05CSE 182 Blast Outputs Local Alignment query Schematic db

Fa05CSE 182 Local Alignment Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’) How can we compute this efficiently? a b a’b’

Fa05CSE 182 Local Alignment Recall that in global alignment, we compute the best score for all prefix pairs s(1..i) and t(1..j). Instead, compute the best score for all sub-alignments that end in s(i) and t(j). What changes in the recurrence? a i a’j

Fa05CSE 182 Local Alignment The original recurrence still works, except when the optimum score S[i,j] is negative When S[i,j] <0, it means that the optimum local alignment cannot include the point (i,j). So, we must reset the score to 0. i i-1 j j-1 sisi tjtj

Fa05CSE 182 Local Alignment Trick (Smith-Waterman algorithm) How can we compute the local alignment itself?

Fa05CSE 182 Generalizing Gap Cost It is more likely for gaps to be contiguous The penalty for a gap of length l should be

Fa05CSE 182 Using affine gap penalties What is the time taken for this? What are the values that l can take? Can we get rid of the extra Dimension?

Fa05CSE 182 Affine gap penalties Define D[i,j] : Score of the best alignment, given that the final column is a ‘deletion’ (s i is aligned to a gap) Define I[i,j]: Score of the best alignment, given that the final column is an insertion (t j is aligned to a gap) s[i] Optimum alignment of s[1..i-1], and t[1..j] - Optimum alignment of s[1..i], and t[1..j-1] t[j]

Fa05CSE 182 O(nm) solution for affine gap costs

Fa05CSE 182 Alignment Space? How much space to we need? What if the query and database are each 1Mbp?

Fa05CSE 182 Alignment (Linear Space) Score computation For i = 1 to n For j = 1 to m

Fa05CSE 182 Linear Space Alignment In Linear Space, we can do each row of the D.P. We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05CSE 182 Linear Space (cont’d) At i=n/2, we know scores of all the optimal paths ending at that row. Define F[j] = S[n/2,j] One of these j is on the true path. Which one?

Fa05CSE 182 Backward alignment Let S’[i,j] be the optimal score of aligning s[i..n] with t[j..m] Define B[j] = S’[n/2,j] One of these j is on the true path. Which one?

Fa05CSE 182 Forward, Backward computation At the optimal coordinate, j –F[j]+B[j]=S[n,m] In O(nm) time, and O(m) space, we can compute one of the coordinates on the optimum path.

Fa05CSE 182 Linear Space Alignment Align(1..n,1..m) –For all 1<=j <= m Compute F[j]=S(n/2,j) –For all 1<=j <= m Compute B[j]=S b (n/2,j) –j* = max j {F[j]+B[j] } –X = Align(1..n/2,1..j*) –Y = Align(n/2..n,j*..m) –Return X,j*,Y

Fa05CSE 182 Linear Space complexity T(nm) = c.nm + T(nm/2) = O(nm) Space = O(m)

Fa05CSE 182 Summary We considered the basics of sequence alignment –Opt score computation –Reconstructing alignments –Local alignments –Affine gap costs –Space saving measures Can we recreate Blast?

Fa05CSE 182 Blast and local alignment Concatenate all of the database sequences to form one giant sequence. Do local alignment computation with the query.

Fa05CSE 182 Large database search Query (m) Database (n) Database size n=100M, Querysize m=1000. O(nm) = computations

Fa05CSE 182 Why is Blast Fast?

Fa05CSE 182 Silly Question! True or False: No two people in new york city have the same number of hair

Fa05CSE 182 Observations Much of the database is random from the query’s perspective Consider a random DNA string of length n. –Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Assume for the moment that the query is all A’s (length m). What is the probability that an exact match to the query can be found?

Fa05CSE 182 Basic probability Probability that there is a match starting at a fixed position i = 0.25 m What is the probability that some position i has a match. Dependencies confound probability estimates.

Fa05CSE 182 Basic Probability:Expectation Q: Toss a coin: each time it comes up heads, you get a dollar –What is the money you expect to get after n tosses? –Let X i be the amount earned in the i-th toss  Total money you expect to earn

Fa05CSE 182 Expected number of matches Expected number of matches can still be computed.  Let X i =1 if there is a match starting at position i, X i =0 otherwise  Expected number of matches = i

Fa05CSE 182 Expected number of exact Matches is small! Expected number of matches = n*0.25 m –If n=10 7, m=10, Then, expected number of matches = –If n=10 7, m=11 expected number of hits = 2.38 –n=10 7,m=12, Expected number of hits = 0.5 < 1 Bottom Line: An exact match to a substring of the query is unlikely just by chance.

Fa05CSE 182 Observation 2 What is the pigeonhole principle?  Suppose we are looking for a database string with greater than 90% identity to the query (length 100)  Partition the query into size 10 substrings. At least one much match the database string exactly

Fa05CSE 182 Why is this important? Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. Assume that the mismatches are randomly distributed. What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. –The above equation does not take dependencies into account –Reality is better because the matches are not randomly distributed

Fa05CSE 182 Just the Facts Consider the set of all substrings of the query string of fixed length W. –Prob. of exact match to a random database string is very low. –Prob. of exact match to a true homolog is very high. –Keyword Search (exact matches) is MUCH faster than sequence alignment

Fa05CSE 182 BLAST Consider all (m-W) query words of size W (Default = 11) Scan the database for exact match to all such words For all regions that hit, extend using a dynamic programming alignment. Can be many orders of magnitude faster than SW over the entire string Database (n)

Fa05CSE 182 Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random Db n=10 7, no TP SW = O(nm) = 1000*10 7 = computations BLAST, W=11 E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 Number of computations = 2384*100*100=2.384*10 7 Ratio=10 10 /(2.384*10 7 )=420 Further speed improvements are possible

Fa05CSE 182 Keyword Matching How fast can we match keywords? Hash table/Db index? What is the size of the hash table, for m=11 Suffix trees? What is the size of the suffix trees? Trie based search. We will do this in class. AATCA 567

Fa05CSE 182 Related notes How to choose the alignment region? –Extend greedily until the score falls below a certain threshold What about protein sequences? –Default word size = 3, and mismatches are allowed. Like sequences, BLAST has been evolving continuously –Banded alignment –Seed selection –Scanning for exact matches, keyword search versus database indexing

Fa05CSE 182 P-value computation How significant is a score? What happens to significance when you change the score function A simple empirical method: Compute a distribution of scores against a random database. Use an estimate of the area under the curve to get the probability. OR, fit the distribution to one of the standard distributions.

Fa05CSE 182 Z-scores for alignment Initial assumption was that the scores followed a normal distribution. Z-score computation: –For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation –Look up a table to get a P-value

Fa05CSE 182 Blast E-value Initial (and natural) assumption was that scores followed a Normal distribution 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution Practical consequence: –Longer tail. –Previously significant hits now not so significant

Fa05CSE 182 Exponential distribution Random Database, Pr(1) = p What is the expected number of hits to a sequence of k 1’s Instead, consider a random binary Matrix. Expected # of diagonals of k 1s

Fa05CSE 182 As you increase k, the number decreases exponentially. The number of diagonals of k runs can be approximated by a Poisson process In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). As the score increases, the number of alignments that achieve the score decreases exponentially

Fa05CSE 182 Blast E-value Choose a score such that the expected score between a pair of residues < 0 Expected number of alignments with a particular score For small values, E-value and P-value are the same

Fa05CSE 182 Blast Variants 1.What is mega-blast? 2.What is discontiguous mega- blast? 3.Phi-Blast/Psi-Blast? 4.BLAT? 5.PatternHunter? Longer seeds. Seeds with don’t care values Later Database pre-processing Seeds with don’t care values

Fa05CSE 182 Keyword Matching P O T A S T P O T A T O POTATO T UISM SET A l cccccccccccc l c l

Fa05CSE 182