Sequence Alignment Csc 487/687 Computing for bioinformatics.

Slides:

Advertisements

Similar presentations

Substitution matrices

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Last lecture summary.

Lecture 8 Alignment of pairs of sequence Local and global alignment

Introduction to Bioinformatics

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Sequence analysis course

Introduction to Bioinformatics Algorithms Sequence Alignment.

Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Alignment methods April 12, 2005 Return Homework (Ave. = 7.5)

Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Introduction to bioinformatics

Sequence similarity.

Sequence Alignment III CIS 667 February 10, 2004.

. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at

Introduction to Bioinformatics Algorithms Sequence Alignment.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Scoring matrices Identity PAM BLOSUM.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Sequence Alignments Revisited

Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Substitution matrices

Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Bioinformatics in Biosophy

An Introduction to Bioinformatics

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Substitution Numbers and Scoring Matrices

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Pairwise Sequence Analysis-III

Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.

Sequence Alignment.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Construction of Substitution matrices

Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

Alignment methods April 17, 2007 Quiz 1—Question on databases Learning objectives- Understand difference between identity, similarity and homology. Understand.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Pairwise Sequence Alignment and Database Searching

Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM

Alignment IV BLOSUM Matrices

Presentation transcript:

Sequence Alignment Csc 487/687 Computing for bioinformatics

Refining the Scoring Scheme - Scoring Matrix To measure the relative probability of any particular substitution. The relative frequencies of such changes to form a scoring matrix for substitution A likely change will score higher than a rare one.

Scoring matrix for nucleic acid sequences A simple scheme for substitutions: +1 for a match, -1 for a mismatch. A more complicated scheme based on the higher frequency of transition mutations than transversion mutations ag and tc (a or g) (t or c)

Refining the Scoring Scheme - Scoring Matrix The scheme should return high values for alignment of homologous proteins Should reward higher alignment of amino acids often seen in corresponding positions in homologous proteins

Scoring Matrices Importance of scoring matrices Scoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice.

Identity Matrix Simplest type of scoring matrix LICA 1000L 100I 10C 1A

Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. + NH 3 CO NH 3 CO 2 - Leucine Isoleucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between?

Scoring matrices Gives scores between each pair of amino acids Should reflect The degree of ”biological relatedness” The ”probability” that two amino acids occurring in different sequences have common ancestor Should be symmetric Substitution matrices The probability that an amino acid a is changed to amino acid b (in a certain evolutionary time) Is generally not symmetric

Scoring matrices Identity matrix (scoring 0/1) Use of the distances in the genetic codes Use of the amino acid similarities based on physico-chemical properties Scoring matrices based on experimental data (PAM – BLOSUM)

DAYHOFF’s PAM-MATRICES Based on experimental data  – evolutionary time interval Sequences from 34 superfamilies were used 1. Divide the sequences into groups (71) of homologous sequences, and make a multiple alignment for each of them 2. Construct evolutionary trees for each group, and estimate the mutations that have occurred 3. Define an evolutionary model to explain the evolution 4. Construct substitution matrices, for each amino acid pairs (a,b) an estimate of the probability that an amino acid a has mutated to an amino acid b in time interval  5. Construct scoring matrices from the substitution matrices. Note that a and b are variables that mean any amino acid.

Example

The model of the evolution The probability of a mutation in a position is independent on Position and neighbour residues Previous mutations in the position The biological (evolutionary) clock is assumed (meaning constant rate of mutations) This means that evolutionary time can be measured in number of mutations (here substitutions) The measure is PAM (Point Accepted Mutations) 1 PAM is one accepted mutation per 100 residues

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix A 1-PAM unit is equivalent to 1 mutation found in a stretch of 2 sequences each containing 100 amino acids that are aligned Example 1:..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. |||||||||||||| |||||||||||||||||||||||||||||||||||..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. length = 100, 1 Mismatch, PAM distance = 1 A k-PAM unit is equivalent to k 1-PAM units (or M k ).

Substitution matrix M 1

Calculate M z by matrix multiplication, show for z=2 Z=2 mean two mutations per 100 residues A residue a can be changed to residue b after 2 PAM of following reasons: 1. a is mutated to b in first PAM, unchanged in the next, with probability M ab M bb 2. a is unchanged in first PAM, changed in the next, probability M aa M ab 3. a is mutated to an amino acid x in the first PAM, and then to b in the next, probability M ax M xb, x being any amino acid unequal (a,b) These three cases are disjunctive, hence

Final Scoring Matrix is the Log-Odds Scoring Matrix S (a,b) = 10 log 10 (M ab /P b ) Original amino acid Replacement amino acid Mutational probability matrix number Frequency of amino acid b

M 250

PAM-250 scoring matrix

BLOSUM (Henikoff & Henikoff) Perform best in identifying distant relationships Making use of the much larger amount of data that become available since Dayhoff’s work Based on BLOCKS database of aligned protein sequence

BLOSUM (Henikoff & Henikoff) Make multiple alignments and discover blocks not containing gaps (used over 2,000 blocks)...KIFIMK GDEVK......NLFKTR GDSKK... KIFKTK GDPKA KLFESR GDAER KIFKGR GDAAK For each column in each block they counted the number of occurrences of each pair of amino acids (210 different pairs (20*21/2) ) A block of length w from an alignment of n sequences has wn(n-1)/2 occurrences of amino acid pairs Let h ab be the number of occurrences of the pair (ab) in all blocks (h ab =h ba ) T total number of pairs f ab =h ab /T

Gap weighting CLUSTAL-W For aligning DNA sequences Use of identity matrix for substitution Gap penalties 10 for gap initiation and 0.1 for gap extension by one residue For aligning protein sequences BLOSUM62 matrix Gap penalties 11 for gap initiation and 1 for gap extension by one residue