Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.

Slides:



Advertisements
Similar presentations
Lecture 2: Basic Information Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Probability Probability: what is the chance that a given event will occur? For us, what is the chance that a child, or a family of children, will have.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Point Specific Alignment Methods
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
An Introduction to Logistic Regression
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
STATISTIC & INFORMATION THEORY (CSNB134)
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple testing correction
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Position-Specific Substitution Matrices
Blast Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Position-Specific Substitution Matrices

PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where in the protein they are. This is obviously only an approximation: within a family of related proteins, some residues are very important for function and hardly change at all, while others can vary quite a bit. Position-specific substitution matrices are an approach to this problem: developing a different substitution matrix for each position in a set of aligned proteins. Requires a set of aligned, related proteins Gaps can be a problem: do you have a separate gap opening and extension penalty for each position, or do you use the same value for all positions? –Most PSSM use a single set of values –Hidden Markov Models address this question specifically PSI-BLAST is the primary general use of PSSM.

Some Aligned Sequences gi| |gb|ABS | AVPLMQPEAPIVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYE gi| |emb|CAB | AVPLMQPEAPFVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYE gi| |gb|ABV | AVPLMQPESPIVGTGMEYVSGKDSGAAVICRYPGVVERVEAKNIWVRRYE gi| |gb|AAU | AVPLMQPESPIVGTGMEYVSAKDSGAAVICRHPGIVERVEAKNIWVRRYE BMQ_0128 AVPLLNPEAPIVGTGMEYVSGKDSGAAVICKYPGVVERVEAKQIIVRRYE gi| |gb|AAS | AVPLMNPESPIVGTGMEYVSAKDSGAAVICKHPGIVERVEAREVWVRRYV gi| |dbj|BAB | AVPLLVPEAPIVGTGMEHVSAKDSGAAIVSKHRGIVERVTAKEIWVRRLE gi| |dbj|BAD | AVPLLVPEAPLVGTGMEHVSAKDSGAAVVSKYAGIVERVTAKEIWVRRIE ****: **:*:******:**.******::.:: *:**** *::: ***

Making a PSSM There are several variations on the theme, but the best way in analogous to how substitution matrices are made, using the log-odds method. Start with a set of aligned sequences. For each position, count the number of each type of amino acid that has occurred. The frequency of amino acid a in column u is q u,a –Note that we aren’t counting substitutions here, since in a multiple alignment we don’t know how the different sequences are related. We also need to know the frequency of amino acid a among sequences in general, p a. The odds ratio is the frequency of amino acid a given real-world evolution divided by the frequency expected if amino acids are completely random. = q u,a / p a Finally, take the logarithm so scores can be added. –m u,a is the score used for amino acid a in column u. This needs to be done for all amino acids in all columns. –m u,a = log (q u,a / p a ) It is possible to weight the scores to compensate for bias in the original sequence selection.

The Missing Data Problem You are trying to determine the frequency of all 20 amino acids at each position in the sequence. There are inevitably some amino acids that never occur in certain positions. –However, if they do occur, in a new sequence, their score m u,a = log (q u,a / p a ) would be negative infinity, the logarithm of 0. This is not a useful score. The simplest solution is to simply start counting at 1 instead of 0. The counts are then referred to as pseudocounts. –Normally the frequency of amino acid a in column u is q u,a = n u,a / N, where N is the number of sequences being examined. –For pseudocounts, q u,a = (n u,a + 1)/ (N + 20). The N+20 term is because there are 20 amino acids. Slightly more sophisticated is using the proportions of each amino acid in the database, p a. The sum of all 20 p a is 1. –q u,a = (n u,a + p a )/ (N + 1). –Related to this is using data from a substitution matrix as the source of the proportions. By adding constants, you can vary the proportions of pseudocounts and real counts, depending on how much real data you have. More sophisticated methods also exist.

Information and Entropy The modern theory of information was developed by Claude Shannon in –The basis for most modern communication. A common application is ZIP files, which compress information. Entropy is a measure of the uncertainty of the results of an event. Entropy = number of bits (binary, yes/no decisions) needed store or communicate the results. –The results of a coin flip, with 2 equally likely outcomes, needs 1 bit to describe. –Rolling a die, with 6 equal outcomes, needs somewhat more than 2 bits to describe. –Related to the concept of entropy in thermodynamics. The entropy of an event (H) is the -1 time sum of the probability of each possible outcome (p x ) times the base 2 logarithm of that probability, –H = -  p x log 2 p x –Units are bits. –plog 2 p = 0 by convention Thus for a coin flip, p H = p T = 1/2. The base 2 log of 1/2 is -1, so H = -(1/2  /2  -1 ) = 1, or 1 bit of information. –For a 6-sided die, each possible outcome has a 1/6 probability. Log 2 (1/6) = -2.58, so rolling a die has H = -6  1/6  = 2.58 bits of information.

Information Content Outcomes with different probabilities affect the entropy. Entropy is maximal when all outcomes are equally likely. Entropy is 0 when there is only 1 possible outcome. –Imagine a loaded die, where the probability of a 6 is 1/2 and the probability of any other number is 1/10. –H = - (1/2log 2 (1/2) + 5·(1/10)log 2 (1/10) ) = - ( ·(1/10) ·-3.321) = bits –Compare this with a fair die, which has an entropy of 2.58 bits. The fair die’s outcome is much more uncertain than the loaded die. The information of an event is the loss of uncertainty concerning an outcome. It is difference between the maximum possible entropy (with all equal outcomes) and the actual amount of entropy calculated with different outcomes having different probabilities.

Sequence Logos A sequence logo is a visual representation of a PSSM, showing the relative importance of different positions and which residues contribute the most. –Based on Shannon information theory. Consider a single position in a set of aligned protein sequences. –If all 20 amino acids are equally likely, the entropy of that position is H max = -20  (1/20)log 2 (1/20) = log 2 (1/20) = –The information I of position u is I = H max - H u. –I = 0 when all amino acids are equally likely. If there is only 1 amino acid ever found at a position (completely conserved), there is no uncertainty about it, so its entropy is 0 and the information content is A more complicated example: say that this position has a 1/3 chance of being R and a 2/3 chance of being K. –H u = -(1/3log 2 (1/3) + 2/3log 2 (2/3) ) = -( ) = –I = = bits. For a sequence logo, the relative frequency of each amino acid is multiplied by the position’s information content, which is then converted into a height.

PSI-BLAST Part of the BLAST programs available at NCBI Finding new family members that don’t hit the original query An iterative process: –first the database (usually nr) is searched with an initial query sequence, and all hits with e-values better than some cutoff (default = 0.005) are taken –these aligned sequences are used to construct a PSSM –The PSSM is then used to search the database again. –If new sequences better than the e-value cutoff are found, the PSSM is updated to include them, and the search is run again. –Eventually, no new sequences are found and the PSI-BLAST search is complete. Considerably slower than regular BLAST You have to manually do each iteration, at the top of the Descriptions area. After 3 iterations with ORF00135 we get no more new hits. With “conserved hypothetical protein” BMQ_0196 (next slide) we get new hits for at least 4 iterations, and also extensions on the length of match of many hits. –Most are hypothetical genes, but some mention possible functions. Unfortunately, you can’t download the PSSM, but you can save it and re-use it if you like.

Another Sequence for PSI- BLAST >BMQ_0196 | QMB1551_chromosome: | conserved hypothetical protein MDKLMNRSWVMKIIALLLAFMLYLSVNLDDGASSSNKILNRSSSANTGVETLTDVPVQVS YNEKNRIVRGVPDTVIMTLEGPKNILAQTKLQKDYQAYIDLDNLSLGQHRVKVQYRNISD NLNVVVKPDIVNVTIEERDSKQFSVEASYDKNKVKNGYEAGEATVSPRAVTVTGASSQLD QVAYVKAIIDLDNASKTVTKQATVVALDKNLNKLNVTVQPETVNVTIPVRNISKKVPIDV IQEGTPGDGVNITKLEPKTDTVKIIGPSDSLEKIDKIDNIPVDVTGITKSKDIKVNVPVP DGIDSVSPKQITVHVEVDKQGDEKDAEETDASAAETKSFKNLPVSLTGQSSKYTYELLSP TSVDADVKGPKSDLDKLTKSGISLSANVGNLSAGEHTVPIIINSPDSVTSTLSTKQAKVR VTAKKQSGTNDEQTDDKETSGSTSDKETSGSTSDKETKPDTGTGSGTNPGTGNSGDSADK PSEETDTPEDNTDTPTDSTETGDDSSNQSDENSTPVDGQTDNTSGN