Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models in Bioinformatics Applications
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Lecture 6, Thursday April 17, 2003
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Heuristic alignment algorithms and cost matrices
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence similarity.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
Construction of Substitution matrices
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
(H)MMs in gene prediction and similarity searches.
Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Free for Academic Use. Jianlin Cheng.
Position-Specific Substitution Matrices
CSE 5290: Algorithms for Bioinformatics Fall 2009
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1

Alignment scoring matrix DNA matrix: A C G T A C G T

Alignment scoring matrix Protein matrix:

Use of a scoring matrix P L S - - C F G G L T - A C H L Score = 3

Consensus sequences Different ways to describe a consensus, from crude to refined: Consensus site Sequence logos Position Specific Score Matrix (PSSM) Hidden Markov Model (HMM)

Consensus sequences and sequence logos Sequence logo Consensus sequence GTMGFGLPAAIGAKLARPDRRVVAIDGDGSFQMTVQELST

Constructing (and using) a consensus sequence 1.Collect sequences 2.Align sequences (consensus sites are descriptions of the alignment) 3.Condense the set of sequences into a consensus (to a consensus, PSSM, HMM). 4.Apply the scoring matrix in alignments/searches.

Position Specific Score Matrix (PSSM) A position specific scoring matrix (PSSM) is a matrix based on the amino acid frequencies (or nucleic acid frequencies) at every position of a multiple alignment. From these frequencies, the PSSM that will be calculated will result in a matrix that will assign superior scores to residues that appear more often than by chance at a certain position.

Creating a PSSM: Example NTEGEWI NITRGEW NIAGECC Amino acid frequencies at every position of the alignment:

Creating a PSSM: Example Amino acids that do not appear at a specific position of a multiple alignment must also be considered in order to model every possible sequence and have calculable log-odds scores. A simple procedure called pseudo-counts assigns minimal scores to residues that do not appear at a certain position of the alignment according to the following equation: Where –Frequency is the frequency of residue i in column j (the count of occurances). –pseudocount is a number higher or equal to 1. –N is the number of sequences in the multiple alignment.

In this example, N = 3 and let’s use pseudocount = 1: Score(N) at position 1 = 3/3 = 1. Score(I) at position 1 = 0/3 = 0. Readjust: Score(I) at position 1 -> (0+1) / (3+20) = 1/23 = Score(N) at position 1 -> (3+1) / (3+20) = 4/23 = The PSSM is obtained by taking the logarithm of (the values obtained above divided by the background frequency of the residues). To simplify for this example we’ll assume that every amino acid appears equally in protein sequences, i.e. f i = 0.05 for every i): PSSM Score(I) at position 1 = log(0.044 / 0.05) = PSSM Score(N) at position 1 = log(0.174 / 0.05) = Creating a PSSM: Example

The matrix assigns positive scores to residues that appear more often than expected by chance and negative scores to residues that appear less often than expected by chance. Creating a PSSM: Example

Using a PSSM To search for matches to a PSSM, scan along a the sequence using a window the length (L) of the PSSM. The matrix is slid on a sequence one residue at a time and the scores of the residues of every region of length L are added. Scores that are higher than an empirically predetermined threshold are reported.

Advantages of PSSM Weights sequence according to observed diversity specific to the family of interest Minimal assumptions Easy to compute Can be used in comprehensive evaluations.

More sophisticated PSSMs 1.PSSM with pseudocounts. 2.Giving pseudocounts less weight when more alignment data is available. 3.Weight pseudocount amino acids by their frequency of occurrence in proteins. 4.Instead of giving pseudocounts all the same value, weight them by their similarity to the consensus (like BLOSUM62 does) at each position. (PSI-BLAST method). 5.Combine 2 & 4 (Dirichlet mixture method). From less to more complicated

Method 1 and standard BLOSUM62 matrix Method 5 A PSSM column with a perfectly conserved isoleucine with different methods used to calculate the scores.

Using Hidden Markov models to describe sequence alignment profiles A profile HMM can represent a sequence alignment profile similar to how a PSSM does. A profile HMM includes information on the amino acid consensus at each position in the alignment like a PSSM. A profile HMM also has position-specific scores for gap insertion and extensions.

Background: Creating HMMs To create an HMM to model data we need to determine two things: The structure/topology of the HMM—states and transitions The values of the parameters—emission and transition probabilities. Determining the parameters is called “training”.

A HMM structure/topology M = match state (score the aa in the sequence at this position in the profile) I = insertion (w.r.t profile - insert gap characters in profile) D = deletion (w.r.t sequence - insert gap characters in sequence) M 1 is first aa in the profile, M 2 is second, etc.

Example HMMER parameters NULE (...) HMM A C D E F G H (...) m->m m->i m->d i->m i->i d->m d->d b->m m->e (...) (...) C * (...) (...) C * * (...) (...) (...) E * * (...) * * * * * * * (...) C * * * * * * * * 0 //

A profile HMM with match state probabilities shown AAs “PATH” is the consensus sequence.

Building a profile HMM Pick a HMM structure/topology. Estimate initial parameters. Train the HMM by running sequences through it. Transitions that get used are given higher probabilities, those rarely used are given lower probabilities.

Protein profile HMMs Better (in theory) representations than PSSMs. –More complicated. –Not hand-tuned by curators. Used in some protein profile databases: –Pfam ( –SMART ( Difficult to describe in human readable formats. Schuster-Böckler et al., 2004 (