Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.

Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen

Sequence alignment 1.Sequence alignment is the most important technique used in bioinformatics 2.Infer properties from one protein to another 1.Homologous sequences often have similar biological functions 3.Most information can be deduced from a sequence if the 3D-structure is known 4.3D-structure determination is very time consuming (X-ray, NMR) 1.Several mg of pure protein is required (> 100mg) 2.Make crystal, solve structure, 1-3 years 3.Large facilities are needed to produce X-ray 1.Rotating anode or synchrotron 5.Determining primary sequence is fast, cheap 6.Structure more conserved than sequence

Protein class & folds

What can we learn from sequence alignment Find similar sequence from another organism Information from the known sequence can be inherited Layers of conserved information: Structure > function > sequence where, ‘>’ means more conserved than Structure (3D) is the most conserved feature Proteins with different function may still share the same structure Proteins with different may still share the same function Often same function if 40-50% sequence identity Often same protein fold if above 30% seqience identity

Sequence alignment M V S T A M L T S A Antal identiske aa, % id ? Alignment score using identity matrix? Similar aa can be substituted therefore other types of substitution matrices are used. MVSTA M10000 V01000 S00100 T00010 A00001

Blosum & PAM matrices Blosum matrices are the most commonly used substitution matrices –Blosum50, Blosum62, blosum80 PAM - Percent Accepted Mutations –PAM-0 is the identity matrix –PAM-1 diagonal small deviations from 1, off-diag has small deviations from 0 –The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed –PAM-250 is PAM-1 multiplied by itself 250 times

Blocks Database

BLOSUM = BLOck SUbstitution Matrices Focus on conserved domains, MSA's (multiple sequence alignment) are ungapped blocks. Compute pairwise amino acid alignment counts –Count amino acid replacement frequencies directly from columns in blocks –Sample bias: Cluster sequences that are x% similar. Do not count amino acid pairs within a cluster. Do count amino acid pairs across clusters, treating clusters as an "average sequence". Normalize by the number of sequences in the cluster. BLOSUM x matrices –Sequences that are x% similar were clustered during the construction of the matrix.

Log-odds scores Log-odds scores are given by Log( Observation/Expected) The log-odd score of matching amino acid j with amino acid i in an alignment is where P ij is the frequency of observation i aligned with j, and Q i, Q j are the frequency if amino acids i and j in the data set. The log-odd score is (in bit units) –Where, Log 2 (x)=log n (x)/log n (2) –S has been normalized to half bits, therefore the factor 2

Log-odds scores BLOSUM is a log-likelihood matrix: Likelihood of observing j given you have i is –P(j|i) = P ij /P i The prior likelihood of observing j is –Q j, which is simply the frequency The log-likelihood score is –S ij = 2log 2 (P(j|i)/log(Q j ) = 2log 2 (P ij /(Q i Q j )) –Where, Log 2 (x)=log n (x)/log n (2) –S has been normalized to half bits, therefore the factor 2

Example of a scoring matrix BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7

An example N AA = 14 N AD = 5 N AV = 5 N DA = 5 N DD = 8 N DV = 2 N VA = 5 N VD = 2 N VV = 2 1: VVAD 2: AAAD 3: DVAD 4: DAAA P AA = 14/48 P AD = 5/48 P AV = 5/48 P DA = 5/48 P DD = 8/48 P DV = 2/48 P VA = 5/48 P VD = 2/48 P VV = 2/48 Q A = 8/16 Q D = 5/16 Q V = 3/16 MSA

Example continued Q A =0.50 Q D =0.31 Q V =0.19 Q A Q A = 0.25 Q A Q D = 0.16 Q A Q V = 0.09 Q D Q A = 0.16 Q D Q D = 0.10 Q D Q V = 0.06 Q V Q A = 0.09 Q V Q D = 0.06 Q V Q V = 0.03 P AA = 0.29 P AD = 0.10 P AV = 0.10 P DA = 0.10 P DD = 0.17 P DV = 0.04 P VA = 0.10 P VD = 0.04 P VV = 0.04 1: VVAD 2: AAAD 3: DVAD 4: DAAA MSA

So what does this mean? Q A Q A = 0.25 Q A Q D = 0.16 Q A Q V = 0.09 Q D Q A = 0.16 Q D Q D = 0.10 Q D Q V = 0.06 Q V Q A = 0.09 Q V Q D = 0.06 Q V Q V = 0.03 P AA = 0.29 P AD = 0.10 P AV = 0.10 P DA = 0.10 P DD = 0.17 P DV = 0.04 P VA = 0.10 P VD = 0.04 P VV = 0.04 BLOSUM is a log-likelihood matrix: S ij = 2log 2 (P ij /(Q i Q j )) S AA = 0.44 S AD =-1.17 S AV = 0.30 S DA =-1.17 S DD = 1.54 S DV =-0.98 S VA = 0.30 S VD =-0.98 S VV = 0.49

The Scoring matrix ADV A0.44-1.170.30 D-1.171.54-0.98 V0.30-0.980.49 1: VVAD 2: AAAD 3: DVAD 4: DAAA MSA

And what does the BLOSUMXX mean? Cluster sequence Blocks at XX% identity To statistics only across clusters Normalize statistics according to cluster size A) B) AV AP AL VL AV GL GV } min XX % identify

And what does the BLOSUMXX mean? A) B) AV AP AL VL AV GL GV

And what does the BLOSUMXX mean? High Blosum values mean high similarity between clusters –Conserved substitution allowed Low Blosum values mean low similarity between clusters –Less conserved substitutions allowed

BLOSUM80 A R N D C Q E G H I L K M F P S T W Y V A 7 -3 -3 -3 -1 -2 -2 0 -3 -3 -3 -1 -2 -4 -1 2 0 -5 -4 -1 R -3 9 -1 -3 -6 1 -1 -4 0 -5 -4 3 -3 -5 -3 -2 -2 -5 -4 -4 N -3 -1 9 2 -5 0 -1 -1 1 -6 -6 0 -4 -6 -4 1 0 -7 -4 -5 D -3 -3 2 10 -7 -1 2 -3 -2 -7 -7 -2 -6 -6 -3 -1 -2 -8 -6 -6 C -1 -6 -5 -7 13 -5 -7 -6 -7 -2 -3 -6 -3 -4 -6 -2 -2 -5 -5 -2 Q -2 1 0 -1 -5 9 3 -4 1 -5 -4 2 -1 -5 -3 -1 -1 -4 -3 -4 E -2 -1 -1 2 -7 3 8 -4 0 -6 -6 1 -4 -6 -2 -1 -2 -6 -5 -4 G 0 -4 -1 -3 -6 -4 -4 9 -4 -7 -7 -3 -5 -6 -5 -1 -3 -6 -6 -6 H -3 0 1 -2 -7 1 0 -4 12 -6 -5 -1 -4 -2 -4 -2 -3 -4 3 -5 I -3 -5 -6 -7 -2 -5 -6 -7 -6 7 2 -5 2 -1 -5 -4 -2 -5 -3 4 L -3 -4 -6 -7 -3 -4 -6 -7 -5 2 6 -4 3 0 -5 -4 -3 -4 -2 1 K -1 3 0 -2 -6 2 1 -3 -1 -5 -4 8 -3 -5 -2 -1 -1 -6 -4 -4 M -2 -3 -4 -6 -3 -1 -4 -5 -4 2 3 -3 9 0 -4 -3 -1 -3 -3 1 F -4 -5 -6 -6 -4 -5 -6 -6 -2 -1 0 -5 0 10 -6 -4 -4 0 4 -2 P -1 -3 -4 -3 -6 -3 -2 -5 -4 -5 -5 -2 -4 -6 12 -2 -3 -7 -6 -4 S 2 -2 1 -1 -2 -1 -1 -1 -2 -4 -4 -1 -3 -4 -2 7 2 -6 -3 -3 T 0 -2 0 -2 -2 -1 -2 -3 -3 -2 -3 -1 -1 -4 -3 2 8 -5 -3 0 W -5 -5 -7 -8 -5 -4 -6 -6 -4 -5 -4 -6 -3 0 -7 -6 -5 16 3 -5 Y -4 -4 -4 -6 -5 -3 -5 -6 3 -3 -2 -4 -3 4 -6 -3 -3 3 11 -3 V -1 -4 -5 -6 -2 -4 -4 -6 -5 4 1 -4 1 -2 -4 -3 0 -5 -3 7 = 9.4 = -2.9

BLOSUM30 A R N D C Q E G H I L K M F P S T W Y V A 4 -1 0 0 -3 1 0 0 -2 0 -1 0 1 -2 -1 1 1 -5 -4 1 R -1 8 -2 -1 -2 3 -1 -2 -1 -3 -2 1 0 -1 -1 -1 -3 0 0 -1 N 0 -2 8 1 -1 -1 -1 0 -1 0 -2 0 0 -1 -3 0 1 -7 -4 -2 D 0 -1 1 9 -3 -1 1 -1 -2 -4 -1 0 -3 -5 -1 0 -1 -4 -1 -2 C -3 -2 -1 -3 17 -2 1 -4 -5 -2 0 -3 -2 -3 -3 -2 -2 -2 -6 -2 Q 1 3 -1 -1 -2 8 2 -2 0 -2 -2 0 -1 -3 0 -1 0 -1 -1 -3 E 0 -1 -1 1 1 2 6 -2 0 -3 -1 2 -1 -4 1 0 -2 -1 -2 -3 G 0 -2 0 -1 -4 -2 -2 8 -3 -1 -2 -1 -2 -3 -1 0 -2 1 -3 -3 H -2 -1 -1 -2 -5 0 0 -3 14 -2 -1 -2 2 -3 1 -1 -2 -5 0 -3 I 0 -3 0 -4 -2 -2 -3 -1 -2 6 2 -2 1 0 -3 -1 0 -3 -1 4 L -1 -2 -2 -1 0 -2 -1 -2 -1 2 4 -2 2 2 -3 -2 0 -2 3 1 K 0 1 0 0 -3 0 2 -1 -2 -2 -2 4 2 -1 1 0 -1 -2 -1 -2 M 1 0 0 -3 -2 -1 -1 -2 2 1 2 2 6 -2 -4 -2 0 -3 -1 0 F -2 -1 -1 -5 -3 -3 -4 -3 -3 0 2 -1 -2 10 -4 -1 -2 1 3 1 P -1 -1 -3 -1 -3 0 1 -1 1 -3 -3 1 -4 -4 11 -1 0 -3 -2 -4 S 1 -1 0 0 -2 -1 0 0 -1 -1 -2 0 -2 -1 -1 4 2 -3 -2 -1 T 1 -3 1 -1 -2 0 -2 -2 -2 0 0 -1 0 -2 0 2 5 -5 -1 1 W -5 0 -7 -4 -2 -1 -1 1 -5 -3 -2 -2 -3 1 -3 -3 -5 20 5 -3 Y -4 0 -4 -1 -6 -1 -2 -3 0 -1 3 -1 -1 3 -2 -2 -1 5 9 1 V 1 -1 -2 -2 -2 -3 -3 -3 -3 4 1 -2 0 1 -4 -1 1 -3 1 5 = 8.3 = -1.16

Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.

Similar presentations

Presentation on theme: "Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.

Similar presentations

Presentation on theme: "Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen."— Presentation transcript:

Similar presentations

About project

Feedback