Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
296.3: Algorithms in the Real World
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 CSE182-L4: Keyword matching. Fa05CSE 182 Backward scoring Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Midterm Review. Review of previous weeks Pairwise sequence alignment Scoring matrices PAM, BLOSUM, Dynamic programming Needleman-Wunsch (Global) Semi-global.
Fa05CSE 182 L3: Blast: Local Alignment and other flavors.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Similar Sequence Similar Function Charles Yan Spring 2006.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
CSE182-L5: Scoring matrices Dictionary Matching
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Pairwise Sequence Alignment (cont.)
Alignment IV BLOSUM Matrices
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching

Fa05CSE 182 Silly Quiz

Fa05CSE 182 PAM 1 distance Two sequences are 1 PAM apart if they differ in 1 % of the residues. PAM 1 (a,b) = Pr[residue a substitutes residue b, when the sequences are 1 PAM apart] 1% mismatch

Fa05CSE 182 PAM 1

Fa05CSE 182 Generating Higher PAMs PAM 2 (a,b) = ∑ c PAM 1 (a,c). PAM 1 (c,b) PAM 2 = PAM 1 * PAM 1 (Matrix multiplication) PAM 250 –= PAM 1 *PAM 249 –= PAM = a a b c b c PAM 2 PAM 1

Fa05CSE 182 Scoring residues A reasonable score function C(a,b) is given as follows: –Look at ‘high quality’ alignments –C(a,b) should be high when a,b are seen together more often than is expected by chance –C(a,b) should be low, otherwise. How often would you expect to see a,b together just by chance? –P a P b Let P ab be the probability that a and b are aligned in a high- quality alignment A good scoring function is the log-odds score –C(a,b)= log 10 (P ab /P a P b )

Fa05CSE 182 Scoring alignments To compute P ab, we need ‘high-quality’ alignments How can you get quality alignments? –Use SW (But that needs the scoring function) –Build alignments manually –Use Dayhoff’s theory to extrapolate from high identity alignments

Fa05CSE 182 Scoring using PAM matrices Suppose we know that two sequences are 250 PAMs apart. S(a,b) = log 10 (P ab /P a P b )= log 10 (P a|b /P a ) = log 10 (PAM 250 (a,b)/P a ) How does it help? –S 250 (A,V) >> S 1 (A,V) –Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. –An alignment with a smaller % identity could still have a higher score and be more significant hum mus dros

Fa05CSE 182 S 250 (a,b) = log 10 (P ab /P a P b ) = log 10 (PAM250(a,b)/P a ) PAM250 based scoring matrix

Fa05CSE 182 BLOSUM series of Matrices Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. –In practice BLOSUM62 seems to work very well.

Fa05CSE 182 PAM vs. BLOSUM What is the correspondence? PAM1 Blosum1 PAM2 Blosum2 Blosum62 PAM250 Blosum100

Fa05CSE 182 P-value computation We use text filtering to filter the database quickly. The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. The results are presented in order. How significant is the top scoring hits if it has a score S? Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better How can we compute E-value?

Fa05CSE 182 What is a distribution function Given a collection of numbers (scores) –1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. Plot its distribution as follows: –X-axis =each number –Y-axis (count/frequency/probability) of seeing that number –More generally, the x-axis can be a range to accommodate real numbers

Fa05CSE 182 P-value computation How significant is a score? What happens to significance when you change the score function A simple empirical method: Compute a distribution of scores against a random database. Use an estimate of the area under the curve to get the probability. OR, fit the distribution to one of the standard distributions.

Fa05CSE 182 Z-scores for alignment Initial assumption was that the scores followed a normal distribution. Z-score computation: –For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation –Look up a table to get a P-value

Fa05CSE 182 Blast E-value Initial (and natural) assumption was that scores followed a Normal distribution 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution Practical consequence: –Longer tail. –Previously significant hits now not so significant

Fa05CSE 182 Altschul Karlin statistics For simplicity, assume that the database is a binary string, and so is the query. –Let match-score=1, –mismatch score=- , –indel=-  (No gaps allowed) What does it mean to get a score k?

Fa05CSE 182 Exponential distribution Random Database, Pr(1) = p What is the expected number of hits to a sequence of k 1’s Instead, consider a random binary Matrix. Expected # of diagonals of k 1s

Fa05CSE 182 As you increase k, the number decreases exponentially. The number of diagonals of k runs can be approximated by a Poisson process In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). As the score increases, the number of alignments that achieve the score decreases exponentially

Fa05CSE 182 Blast E-value Choose a score such that the expected score between a pair of residues < 0 Expected number of alignments with a score S For small values, E-value and P-value are the same

Fa05CSE 182 The last step in Blast We have discussed –Alignments –Db filtering using keywords –Scoring matrices –E-values and P-values The last step: Database filtering requires us to scan a large sequence fast for matching keywords

Fa05CSE 182 Dictionary Matching, R.E. matching, and position specific scoring

Fa05CSE 182 Keyword search Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. Question: Given a collection of strings (keywords), find all occurrences in a database string where they keyword might match.

Fa05CSE 182 Dictionary Matching Q: Given k words (s i has length l i ), and a database of size n, find all matches to these words in the database string. How fast can this be done? 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O dictionary database

Fa05CSE 182 Dict. Matching & string matching How fast can you do it, if you only had one word of length m? –Trivial algorithm O(nm) time –Pre-processing O(m), Search O(n) time. Dictionary matching –Trivial algorithm (l 1 +l 2 +l 3 …)n –Using a keyword tree, l p n (l p is the length of the longest pattern) –Aho-Corasick: O(n) after preprocessing O(l 1 +l 2..) We will consider the most general case

Fa05CSE 182 Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well.

Fa05CSE 182 PO T A TO T UISM SET A The Trie Automaton Construct an automaton A from the dictionary –A[v,x] describes the transition from node v to a node w upon reading x. –A[u,’T’] = v, and A[u,’S’] = w –Special root node r –Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE w vu S r

Fa05CSE 182 An O(l p n) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition –Increment current pointer –Move to a new node –If terminal node “success” Else –Retract ‘current’ pointer –Increment ‘start’ pointer –Move to root & repeat

Fa05CSE 182 Illustration: PO T A TO T UISM SET A P O T A S T P O T A T O l c v S 1 2 3

Fa05CSE 182 Idea for improving the time P O T A S T P O T A T O Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match –Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) l c 1:POTATO 2:POTASSIUM 3:TASTE P O T A S S I U M T A S T E Pattern i Pattern j

Fa05CSE 182 Improving speed of dictionary matching Every node v corresponds to a string s v that is a prefix of some pattern. Define F[v] to be the node u such that s u is the longest suffix of s v If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = |s u | PO T A TO T UISM SET A S

Fa05CSE 182 End of L5