Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Slides:



Advertisements
Similar presentations
Scoring Matrices.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Pairwise alignments.
DNA sequences alignment measurement
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
©CMBI 2008 Aligning Sequences The most powerful weapon in the bioinformaticist’s armory is sequence alignment. Why? Lets’ think about an alignment. It.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
©CMBI 2005 Why align sequences? Lots of sequences with unknown structure and function. A few sequences with known structure and function If they align,
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Sequence similarity search Glance to the protein world.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Single Motif Charles Yan Spring Single Motif.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Bioinformatics in Biosophy
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Biology 4900 Biocomputing.
Secondary structure prediction
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
In-Class Assignment #1: Research CD2
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Construction of Substitution matrices
Step 3: Tools Database Searching
Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Protein Sequence Alignment Multiple Sequence Alignment
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Sequence similarity search II Searching for remote homologies.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Sequence similarity search Glance to the protein world.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Sequence similarity, BLAST alignments & multiple sequence alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Alignment IV BLOSUM Matrices
Presentation transcript:

Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter Program to determine their effects on output when comparing squid p53 and human p53. Create your own scoring matrix and use it to compare two protein sequences. Explain to the instructor the rationale behind your scoring matrix.

Scoring Matrices Scoring matrices appear in all analyses involving sequence comparisons. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice of scoring matrix.

Scoring Matrices When we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix. For example, M 12 refers to the entry at the first row and the second column. In general, M ij refers to the entry at the ith row and the jth column.

Two major scoring matrices for amino acid sequence comparisons PAM-derived from sequences known to be closely related (Eg. Chimpanzee and human). Generally ranges from PAM 1 to PAM 500 BLOSUM-derived from sequences not closely related (Eg. E. coli and human). Ranges from BLOSUM 10-BLOSUM 100

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Started by Margaret Dayhoff, 1978 A series of matrices describing the extent to which two amino acids have been interchanged in evolution PAM 1 scoring matrix was obtained by aligning very similar sequences. Other PAMs were obtained by mathematical extrapolation 1) neighbor independence; 2) positional independence; and 3) historical independence. Dayhoff, M. O., Atlas of Protein Sequence and Structure Natl. Biomed. Res. Found., Silver Spring MD, 1978.

Protein families used to construct Dayhoff’s scoring matrix ProteinPAMs per 100 mil yrs IgG kappa C region37 Kappa casein33 Serum Albumin26 Cytochrome C0.9 Histone H30.14 Histone H40.10

Calculation of relative mutability of amino acid Find frequency of amino acid change at a certain position in protein. Divide this “change frequency” by the frequency that the amino acid occurs in all proteins. This gives the mutability of the amino acid. Multiply the alanine mutability by a factor to get the value 100. Multiply the 19 other a.a. mutabilities by the same factor. Result: Relative Mutabilities

Relative mutabilities of amino acids Asn134 Ser120 Asp106 Glu102 Ala100 Thr 97 Ile 96 Met 94 Gln 93 Val 74 His66 Arg65 Lys56 Pro56 Gly49 Tyr41 Phe41 Leu40 Cys20 Trp18

Why are the mutabilities different? High mutabilities because a similar amino acid can replace it. (Asp for Glu) Conversely, the low mutabilities are unique, can’t be replaced.

Tally all pairwise replacements Next, tally a.a. replacements "accepted" by natural selection, in all pair-wise sequence comparisons.

A R N D C Q E G H I L K M F P S T W Y V A R 30 N D C Q E G H I L K M F P S T W Y V Numbers of accepted point mutations, multiplied by 10 Original amino acids Replacement amino acids

Creation of a mutation probability matrix Used accepted point mutation data from previous slide and the mutability of each amino acid to create a mutation probability matrix. M ij =(m j *A ij )/(sum_over_all_i A ij ) M ij shows the probability that an original amino acid j (in columns) will be replaced by amino acid i (in rows) over a defined evolutionary interval. For PAM 1, an average of 1% of aa’s were changed.

PAM1 mutational probability matrix Values of each column will sum to 10,000 Orig. aa Replacement aa

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix Observed % aa Difference Evolutionary Distance in PAMs

Final Scoring Matrix is the Log- Odds Score Matrix S (a,b) = 10 log 10 (M ab /P b ) Original amino acid Replacement amino acid Mutational probability matrix number (from PAM 250) Normalized frequency of amino acid b S(a,alanine) = 10 log(0.13/0.087)=1.7 (round to 2)

At this evolution- ary distance, there is a 13% chance that the second sequence will also have an alanine.

Summary of PAM Scoring Matrix PAM = a unit of evolution (1 PAM = average of 1 point mutation/100 amino acids) Accepted Mutation means fixed point mutation Comparison of 71 groups of closely related proteins yielding 1,572 changes. (>85% identity) Different PAM matrices are derived from the PAM 1 matrix by matrix multiplication. The matrices are converted to log odds matrices.

BLOSUM Matrix (BLOcks SUbstitution Matrices) Blocks Sum-created from BLOCKS database A series of matrices describing the extent to which two amino acids are interchangeable in conserved structures of proteins The number in the series represents the threshold percent similarity between sequences, for consideration for calculation (Eg. BLOSUM62 means 62% of the aa’s were similar)

BLOSUM BLOSUMs are built from distantly related sequences within conserved blocks of sequences BLOSUMs are built from the BLOCKS database (the BLOCKS database is a secondary database that derives information from the PROSITE Family database)

BLOSUM (cont.1) Version 8.0 of the Blocks Database consists of 2884 blocks based on 770 protein families documented in PROSITE. Hypothetical entry in red box in BLOCK record: AABCDA...BBCDA DABCDA.A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA...BBCCC

Building BLOSUM Matrices 1. To build the BLOSUM 62 matrix one must eliminate sequences that are identical in more than 62% of their amino acid sequences. This is done by either removing sequences from the BLOCK or by finding a cluster of similar sequences and replacing the cluster with a single representative sequence. 2. Next, the probability for a pair of amino acids to be placed in the same column is calculated. In the previous page this would be the probability of replacement of A with A, A with B, A with C, and B with C. This gives the value q ij 3. Next, one calculates the frequency that the replacement amino acid exists in nature, f i.

Building BLOSUM Matrices (cont.) 4. Finally, we calculate the log odds ratio s i,j = log 2 (q ij /f i ). This value is entered into the matrix. Which BLOSUM to use? BLOSUM Identity 80 80% 62 62% (usually default value) 35 35% If you are comparing sequences that are very similar, use BLOSUM 80.

Which Scoring Matrix to use? PAM-1 BLOSUM-100 Small evolutionary distance High identity within short sequences PAM-250 BLOSUM-20 Large evolutionary distance Low identity within long sequences

Workshop