Sequence Alignment Csc 487/687 Computing for bioinformatics.

Sequence Alignment Csc 487/687 Computing for bioinformatics

Refining the Scoring Scheme - Scoring Matrix To measure the relative probability of any particular substitution. The relative frequencies of such changes to form a scoring matrix for substitution A likely change will score higher than a rare one.

Scoring matrix for nucleic acid sequences A simple scheme for substitutions: +1 for a match, -1 for a mismatch. A more complicated scheme based on the higher frequency of transition mutations than transversion mutations ag and tc (a or g) (t or c)

Refining the Scoring Scheme - Scoring Matrix The scheme should return high values for alignment of homologous proteins Should reward higher alignment of amino acids often seen in corresponding positions in homologous proteins

Scoring Matrices Importance of scoring matrices Scoring matrices appear in all analyses involving sequence comparisons. The choice of matrix can strongly influence the outcome of the analysis. Scoring matrices implicitly represent a particular theory of relationships. Understanding theories underlying a given scoring matrix can aid in making proper choice.

Identity Matrix Simplest type of scoring matrix LICA 1000L 100I 10C 1A

Similarity It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. + NH 3 CO 2 - + NH 3 CO 2 - Leucine Isoleucine Should they get a 0 (non-identical) or a 1 (identical) or Something in between?

Scoring matrices Gives scores between each pair of amino acids Should reflect The degree of ”biological relatedness” The ”probability” that two amino acids occurring in different sequences have common ancestor Should be symmetric Substitution matrices The probability that an amino acid a is changed to amino acid b (in a certain evolutionary time) Is generally not symmetric

Scoring matrices Identity matrix (scoring 0/1) Use of the distances in the genetic codes Use of the amino acid similarities based on physico-chemical properties Scoring matrices based on experimental data (PAM – BLOSUM)

DAYHOFF’s PAM-MATRICES Based on experimental data  – evolutionary time interval Sequences from 34 superfamilies were used 1. Divide the sequences into groups (71) of homologous sequences, and make a multiple alignment for each of them 2. Construct evolutionary trees for each group, and estimate the mutations that have occurred 3. Define an evolutionary model to explain the evolution 4. Construct substitution matrices, for each amino acid pairs (a,b) an estimate of the probability that an amino acid a has mutated to an amino acid b in time interval  5. Construct scoring matrices from the substitution matrices. Note that a and b are variables that mean any amino acid.

Example

The model of the evolution The probability of a mutation in a position is independent on Position and neighbour residues Previous mutations in the position The biological (evolutionary) clock is assumed (meaning constant rate of mutations) This means that evolutionary time can be measured in number of mutations (here substitutions) The measure is PAM (Point Accepted Mutations) 1 PAM is one accepted mutation per 100 residues

The Point-Accepted-Mutation (PAM) model of evolution and the PAM scoring matrix A 1-PAM unit is equivalent to 1 mutation found in a stretch of 2 sequences each containing 100 amino acids that are aligned Example 1:..CNGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. |||||||||||||| |||||||||||||||||||||||||||||||||||..CNGTTDQVDKIVKIRNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQV.. length = 100, 1 Mismatch, PAM distance = 1 A k-PAM unit is equivalent to k 1-PAM units (or M k ).

Substitution matrix M 1

Calculate M z by matrix multiplication, show for z=2 Z=2 mean two mutations per 100 residues A residue a can be changed to residue b after 2 PAM of following reasons: 1. a is mutated to b in first PAM, unchanged in the next, with probability M ab M bb 2. a is unchanged in first PAM, changed in the next, probability M aa M ab 3. a is mutated to an amino acid x in the first PAM, and then to b in the next, probability M ax M xb, x being any amino acid unequal (a,b) These three cases are disjunctive, hence

Final Scoring Matrix is the Log-Odds Scoring Matrix S (a,b) = 10 log 10 (M ab /P b ) Original amino acid Replacement amino acid Mutational probability matrix number Frequency of amino acid b

PAM-250 scoring matrix

BLOSUM (Henikoff & Henikoff) Perform best in identifying distant relationships Making use of the much larger amount of data that become available since Dayhoff’s work Based on BLOCKS database of aligned protein sequence

BLOSUM (Henikoff & Henikoff) Make multiple alignments and discover blocks not containing gaps (used over 2,000 blocks)...KIFIMK.......GDEVK......NLFKTR GDSKK... KIFKTK GDPKA KLFESR GDAER KIFKGR GDAAK For each column in each block they counted the number of occurrences of each pair of amino acids (210 different pairs (20*21/2) ) A block of length w from an alignment of n sequences has wn(n-1)/2 occurrences of amino acid pairs Let h ab be the number of occurrences of the pair (ab) in all blocks (h ab =h ba ) T total number of pairs f ab =h ab /T

Gap weighting CLUSTAL-W For aligning DNA sequences Use of identity matrix for substitution Gap penalties 10 for gap initiation and 0.1 for gap extension by one residue For aligning protein sequences BLOSUM62 matrix Gap penalties 11 for gap initiation and 1 for gap extension by one residue

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Similar presentations

Presentation on theme: "Sequence Alignment Csc 487/687 Computing for bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Alignment Csc 487/687 Computing for bioinformatics.

Similar presentations

Presentation on theme: "Sequence Alignment Csc 487/687 Computing for bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback