1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Introduction to bioinformatics lecture 8
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Alignment IV BLOSUM Matrices
1-month Practical Course
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands ibi.vu.nl C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Amino acid exchange matrices How do we get one? And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure 20  20

A 2 R -2 6 N D C Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z PAM250 matrix amino acid exchange matrix (log odds) Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

Amino acids are not equal: 1. Some are easily substituted because they have similar: physico-chemical properties chemical structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices Amino acid exchange matrices

The 20 common amino acids

Substitution matrices ACGT A C G T A 2 R -2 6 N D C Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z

A 2 R -2 6 N D C Q E G H I L K M F P S T W Y V B Z A R N D C Q E G H I L K M F P S T W Y V B Z PAM250 matrix WR exchange is too large (due to paucity of data)

PAM model The scores derived through the PAM model are an accurate description of the information content (or the relative entropy) of an alignment (Altschul, 1991). PAM-1 corresponds to about 1 million years of evolution PAM-120 has the largest information content of the PAM matrix series (Altschul, 1991) PAM-250 is the traditionally most popular matrix

Summary Dayhoff’s PAM-matrices  Derived from global alignments of closely related sequences.  Matrices for greater evolutionary distances are extrapolated from those for lesser ones.  The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances.  Several later groups have attempted to extend Dayhoff's methodology or re-apply her analysis using later databases with more examples.  Extensions of Dayhoff’s methodology: > Jones, Thornton and coworkers used the same methodology as Dayhoff but with modern databases (CABIOS 8:275). > Gonnett and coworkers (Science 256:1443) used a slightly different (but theoretically equivalent) methodology. > Henikoff & Henikoff (Proteins 17:49) compared these two newer versions of the PAM matrices with Dayhoff's originals.

The BLOSUM matrices (BLOcks SUbstitution Matrix)  The BLOSUM series of matrices were created by Steve Henikoff and colleagues (PNAS 89:10915).  Derived from local, un-gapped alignments of distantly related sequences.  All matrices are directly calculated; no extrapolations are used.  Again: the observed frequency of each pair is compared to the expected frequency (which is essentially the product of the frequencies of each residue in the dataset). Then: Log-odds matrix.

The Blocks Database  The Blocks Database contains multiple alignments of conserved regions in protein families.  Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins.  The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database.  The database can be searched by and World Wide Web (WWW) servers ( to classify protein and nucleotide sequences.

The Blocks Database Gapless alignment blocks

The BLOSUM series  BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90.  The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks (in the BLOCKS database) used to construct the matrix (all blocks have >=62% sequence identity);  No extrapolations are made in going to higher evolutionary distances  High number - closely related sequences Low number - distant sequences  BLOSUM62 is the most popular: best for general alignment.

BLOSUM62 Matrix, log-odds representation

The log-odds matrix for BLOSUM62

 Based on an explicit evolutionary model  Derived from small, closely related proteins with ~15% divergence  Higher PAM numbers to detect more remote sequence similarities  Errors in PAM 1 are scaled 250X in PAM 250  Based on empirical frequencies  Uses much larger, more diverse set of protein sequences (30-90% ID)  Lower BLOSUM numbers to detect more remote sequence similarities  Errors in BLOSUM arise from errors in alignment PAM versus BLOSUM

Comparing exchange matrices  To compare amino acid exchange matrices, the "Entropy" value can be used. This is a relative entropy value (H) which describes the amount of information available per aligned residue pair.

BlosumEntropyPAMEntropy To compare amino acid exchange matrices, the "Entropy" value can be used. This is a relative entropy value which describes the amount of information available per aligned residue pair. As two protein sequences diverge over time, information about the evolutionary process at work is lost (e.g. back mutations). Therefore, matrices with larger entropy values are more sensitive to less divergent sequences, while matrices with smaller entropy values are more sensitive to distantly related sequences. Comparing exchange matrices

Specialized matrices  Claverie (J.Mol.Biol 234:1140) developed a set of substitution matrices designed explicitly for finding possible frameshifts in protein sequences. These matrices are designed solely for use in protein-protein comparisons; they should not be used with programs which blindly translate DNA (e.g. BLASTX, TBLASTN).

Specialized matrices  Rather than starting from alignments generated by sequence comparison, Rissler et al (1988) and later Overington et al (1992) only considered proteins for which an experimentally determined three dimensional structure was available.  They then aligned similar proteins on the basis of their structure rather than sequence and used the resulting sequence alignments as their database from which to gather substitution statistics. In principle, the Rissler or Overington matrices should give more reliable results than either PAM or BLOSUM. However, the comparatively small number of available protein structures (particularly in the Rissler et al study) limited the reliability of their statistics.  Overington et al (1992) developed further matrices that consider the local environment of the amino acids.