Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there.
BLAST Sequence alignment, E-value & Extreme value distribution.
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
Statistics: Purpose, Approach, Method. The Basic Approach The basic principle behind the use of statistical tests of significance can be stated as: Compare.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Searching Sequence Databases
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 CSE182-L4: Keyword matching. Fa05CSE 182 Backward scoring Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Midterm Review. Review of previous weeks Pairwise sequence alignment Scoring matrices PAM, BLOSUM, Dynamic programming Needleman-Wunsch (Global) Semi-global.
Fa05CSE 182 L3: Blast: Local Alignment and other flavors.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
From Pairwise Alignment to Database Similarity Search.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Protein Sequence Comparison Patrice Koehl
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Welcome to MM570 Psychological Statistics
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
CISC667, S07, Lec7, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms:
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Significance of similarity scores
Fast Sequence Alignments
Sequence alignment, Part 2
Pairwise Sequence Alignment (cont.)
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool (BLAST)
Projects….
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest thing in my randomly generated string; a row of of length 30 which was as follows – If her original string was of length 2B, should she be surprised?

We considered the basics of sequence alignment – Opt score computation – Reconstructing alignments – Local alignments – Affine gap costs – Space saving measures – Scoring matrices for DNA and protein sequences The local alignment algorithm (for biological strings) was given by Temple Smith, and Mike Waterman, and is called the Smith-Waterman algorithm (SW) Can we recreate Blast?

The output of BLAST is a collection of alignments sorted in decreasing order of scores. Where do we stop?

Consider – a large random DNA database (size n) – a random query DNA string (size m) – A score function: match/mismatch, etc. Use SW/BLAST to search the database against the query, and choose all alignments that score above a threshold ‘s’ – All hits are bogus. Q: how many hits do we get above score s? Q: what is the probability that we even get one hit above s?

BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. The results are presented in order of decreasing scores The score is just a number. How significant is the top scoring hits if it has a score S? Expect/E-value (score S)= Number of times we would expect to see a random match generate a score S, or better P-value= Probability if seeing a random hit score S, or better. How can we compute E-value, P-value?

query Schematic db Q beg S beg Q end S end S Id

Q beg S beg Q end S end S Id

Some quantities can be reasonably guessed by taking a statistical sample, others not – Average weight of a group of 100 people – Average height of a group of 100 people – Average grade on a test Give an example of a quantity that cannot. When the distribution, and the expectation is known, it is easy to see when you see something significant. If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn

P-value(11): probability that a specific value (11) or something more extreme is achieved by chance. E-value(11): The number of times we expect to see a specific value or something more extreme just by chance.

P-value(11): probability that a specific value (11) or something more extreme is achieved by chance. E-value(11): The number of times we expect to see a specific value or something more extreme just by chance. E.g., look at alignment scores obtained by chance – 1, 2, 8, 3, 5, 3,6,12,4, 4,1,5,3,6,7,15 Compute a Distribution – 1-2 XXX – 3-4 XXXX – 5-6 XXXX – 7-8 XX – 9-10 – X – – X P-value (11) = 2/15

Given a collection of numbers (scores) – 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. Plot its distribution as follows: – X-axis =each number – Y-axis (count/frequency/probability) of seeing that number – More generally, the x-axis can be a range to accommodate real numbers

Goal: Compute P-value(x) A simple empirical method: Compute a distribution of scores against a random database. Use an estimate of the area under the curve to get the probability. OR, fit the distribution to one of the standard distributions. x P-val(x)

Initial assumption was that the scores followed a normal distribution. Z-score computation: – For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation – Look up a table to get a P-value

Initial (and natural) assumption was that scores followed a Normal distribution 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution Practical consequence: – Longer tail. – Previously significant hits now not so significant

For simplicity, assume that the database is a binary string, and so is the query. – Let match-score=1, – mismatch score=- , – indel=-  (No gaps allowed) What does it mean to get a score k?

Query (m) Database (n) Database size n=100M, Querysize m=1000.

For a typical query, there are only a few ‘real hits’ in the database. Much of the database is random from the query’s perspective Consider a random DNA string of length n. – Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Assume for the moment that the query is all A’s (length m). What is the probability that an exact match to the query can be found?

Probability that there is a match starting at a fixed position i = 0.25 m What is the probability that some position i has a match. Dependencies confound probability estimates.

Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. What is the money you expect to get after n tosses? – Let X i be the amount earned in the i-th toss

Expected number of matches can still be computed.  Let X i =1 if there is a match starting at position i,  X i =0 otherwise  Expected number of matches = i

Expected number of matches = n*0.25 m – If n=10 7, m=10, Then, expected number of matches = – If n=10 7, m=11 expected number of hits = 2.38 – n=10 7,m=12, Expected number of hits = 0.5 < 1 Bottom Line: An exact match to a substring of the query is unlikely just by chance.

Random Database, Pr(1) = p What is the expected number of hits to a sequence of k 1’s Instead, consider a random binary Matrix. Expected # of diagonals of k 1s

As you increase k, the number decreases exponentially. The number of diagonals of k runs can be approximated by a Poisson process In ungapped alignments, we replace the coin tosses by column scores, but the behavior does not change (Karlin & Altschul). As the score increases, the number of alignments that achieve the score decreases exponentially

Choose a score such that the expected score between a pair of residues < 0 Expected number of alignments with a score S For small values, E-value and P-value are the same Bit score