OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
Chapter 9 Tests of Significance Target Goal: I can perform a significance test to support the alternative hypothesis. I can interpret P values in context.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
Sequence Alignment.
Patterns, Profiles, and Multiple Alignment.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Searching Sequence Databases
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
CISC667, F05, Lec7, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence pairwise alignment Score statistics –Bayesian –Extreme value distribution.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Protein Structures.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Microarrays.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Significance in protein analysis
5.1 Probability in our Daily Lives.  Which of these list is a “random” list of results when flipping a fair coin 10 times?  A) T H T H T H T H T H 
Testing the Differences between Means Statistics for Political Science Levin and Fox Chapter Seven 1.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Protein Structures.
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Alignment IV BLOSUM Matrices
Presentation transcript:

OUTLINE Scoring Matrices Probability of matching runs Quality of a database match

Scoring Matrices Two alternative models for differences in DNA / protein sequences – Random: All sequences are random selections of given pool of residues. – Nonrandom: Sequences are related, Evolutionary process.

Scoring Matrices – Random: p a : fraction of amino acid a in the pool (probbility of occurance of the amino acid). – Nonrandom: q a,b : the probability of finding particular residus a and b aligned

Scoring Matrices These two models can be compared: – q a,b / p a p b  odds ration – If q a,b > p a p b  nonrondom model is more likely to produce the alignement of these residues.

Scoring Matrices However, we need a single model – Assume: each position in an alignment will be regarded as independent. – Odds ratio of alignment:

Scoring Matrices log – odds ratio: Negative value: probability of the two residues aligned is greater in the random model than nonrandom model.

Scoring Matrices EXAMPLE: If M occurs in the sequences with 0.01 frequency and L occurs with 0.1 frequency. By random pairing, you expect amino acid pairs to be M-L. If the observed frequency of M-L is actually 0.003, score of matching M-L will be log 2 (3)=1.585

Probability of matching runs Statistical significance measures: – p-value: the probability that at least one sequence will produce the same score by chance – E-value: expected number of sequences that will produce same or better score by chance

Probability of matching runs Analysis of coin tosses : – “H” indicates a head – p probability of head (p = 0,5) – Probability of 5 heads in a run: 0,5 5 =0,031 – The expected number of times that 5H occurs in above 14 coin tosses: 10x0,031 = 0,31

Probability of matching runs Analysis of coin tosses : – The expected number of a length l run of heads in n tosses: – Expected length R of the longest match in n tosses: (Erdös-Rényi)

Probability of matching runs Analysis of coin tosses : – Example: N = 20  R = log 2 (20) = 4,3 (in 20 coin tosses we expect 4,3 runs of heads, once )

Probability of matching runs DNA / protein sequences: Probability of an individual match p = 1 / 20 = 0,05

Probability of matching runs Expected number of matches: 8x6x0.05 = 2,4

Probability of matching runs Expected number of two successive matches: 8x6x0,05x0,05 = 0,12

Probability of matching runs Expected number of length l matches: Expected longest match of two sequences of length m and n: where p is the probability of occurance of a single residue.

Probability of matching runs Expected number of length l matches: Expected longest match of two sequences of length m and n: where p is the probability of occurance of a single residue.

Probability of matching runs Example: – DNA seq: m = 32, n = 32 R = log 4 (32x32) = 5 – Amino acid seq: m = 100, n = 80 R = log 20 (100x80) = 3

Probability of matching runs Under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores Statistics for the scores of local alignments, unlike those of global alignments, are well understood.

Probability of matching runs The optimal ungapped local alignment score follows the Gumble Extreme value distribution. Because we always choose the best-scoring alignments the distribution will be Gumble Extreme value distribution. Probability of obtaining an alignment of score S greater than a value x:

Quality of a database match How good is an alignment ? How believable the results of a database search ?

Quality of a database match The alignment reports are selected according to the alignment score. We need to know: – Whether the score is greater than we would expect from the alignment of the sequences with a random sequence.

Quality of a database match Statistical significance measures: – p-value: the probability that at least one sequence will produce the same score by chance – E-value: expected number of sequences that will produce same or better score by chance

Quality of a database match Score  Significance of the score. – By applying the Gumble Extreme value distribution, it is possible to estimate the probability of two random sequences aligned with a score greater than or equal to the alignment score. – E- value, p – value.

Quality of a database match E-value depends on: – The sequence length, – The number of sequences in the database, – Alignement score.

Quality of a database match A good E-value: – The smaller the E-value the better the alignment, – The threshold value generally is set to 0,01 or 0,001.

Quality of a database match

References M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland Science Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide to the analysis of genes and proteins”, 2001, Wiley.