2 New generation sequencing (NGS) The completion of human genome was just a start of modern DNA sequencing era – “high-throughput next generation sequencing” (NGS).New approaches, reduce time and cost.Holly Grail of sequencing – complete human genome below $ 1000.1st generation – Sanger dideoxy method2nd generation – sequencing by synthesis (pyrosequencing)3rd generation – single molecule sequencing
3 Sequence alignment What is sequence alignment Three flavors of sequence alignmentPoint mutations, indels
4 Homology 'Central dogma of bioinformatics' Sequences diverge Conserved residuesSequences are homologous, orthologous, paralogousThe variation between sequences – changes occurred during evolution in the form of substitutions (mutations) and/or indels.
6 Scoring systems IDNA and protein sequences can be aligned so that the number of identically matching pairs is maximized.Counting the number of matches gives us a score (3 in this case). Higher score means better alignment.This procedure can be formalized using substitution matrix.A T T G TA – - G A C A TATCG1Identity matrix
7 Scoring systems II identity matrix: NAs – OK, proteins – not enough AAs are not exchanged with the same probability as can be conceived theoretically.For example substitution of aspartic acids D by glutamic acid E is frequently observed. And change from aspartic acid to tryptophan W is very rare.For nucleotide sequences identity matrix is usually good enough.For protein sequences identity matrix is not sufficient to describe biological and evolutionary proceses.DEW
8 Scoring systems II Why is that? Triplet-based genetic code GAT (D) → GAA (E), GAT (D) → TGG (W)Both D and E have similar properties, but D and W differ considerably. D is hydrophilic, W is hydrophobic, D → W mutation can greatly alter 3D structure and consequently function.
12 Length penaltiesWe want to find alignments that are evolutionarily likely.Which of the following alignments seems more likely to you?ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCTWe can achieve this by penalizing more for a new gap, than for extending an existing gap- It’s more likely that longer stretches of sequence are deleted.
14 Substitution matrices Substitution (score) matrices show scores for amino acids substitution. Higher score means higher probability of mutation.Conservative substitutions – conserve the physical and chemical properties of the amino acids, limit structural/functional disruptionSubstitution matrices should reflect:Physicochemical properties of amino acids.Different frequencies of individual amino acids occuring in proteins.Interchangeability of the genetic code.
15 PAM matrices IHow to assign scores? Let’s get nature – evolution – involved!If you choose set of proteins with very similar sequences, you can do alignment manually.Also, if sequences in your set are similar, then there is high probability that amino acid difference are due to single mutation.From the frequencies of mutations in the set of similar protein sequences probabilities of substitutions can be derived.This is exactly the approach take by Margaret Dayhoff in 1978 to construct PAM (Accepted Point Mutation) matrices.Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found.. pp. 345–358.
16 PAM matrices IIAlignments of 71 groups of very similar (at least 85% identity) protein sequences substitutions were found.These mutations do not significantly alter the protein function. Hence they are called accepted mutations (accepted by natural selection).Probabilities that any one amino acid would mutate into any other were calculated.If I know probabilities of individual amino acids, what is the probability for the given sequence?ProductBut to calculate the score, we would like to sum probabilities, not multiply. How to achieve this?LogarithmExcellent discussion of the derivation and use of PAM matrices: George DG, Barker WC, Hunt LT. Mutation data matrix and its uses. Methods Enzymol. 1990,183: PMID:
17 PAM matrices IIIDayhoff’s definition of accepted mutation was thus based on empirically observed amino acids substitutions.The used unit is a PAM. Two sequences are 1 PAM apart if they have 99% identical residues.PAM1 matrix is the result of computing the probability of one substitution per 100 amino acids.PAM1 matrix represents probabilities of point mutations over certain evolutionary time.in Drosophila 1 PAM corresponds to ~2.62 MYAin Human 1 PAM corresponds to ~4.58 MYA- Henikoff S, Henikoff JG. Amino acid substitution matrices. Advances in Protein Chemistry. 54: 73-97, 2000- odhad delky 1 PAM prevzat z Introduction to Computational Biology – An Evolutionary Approach, by B. Haubold, T. Wiehe
19 Higher PAM matricesWhat to do if I want get probabilities over much longer evolutionary time?Dayhoff proposed a model of evolution that is a Markov process.A case of Markov process is a linear dynamical system.
20 Linear dynamical system I A new species of frog has been introduced into an area where it has too few natural predators. In an attempt to restore the ecological balance, a team of scientists is considering introducing a species of bird which feeds on this frog. Experimental data suggests that the population of frogs and birds from one year to the next can be modeled by linear relationships. Specifically, it has been found that if the quantities Fk and Bk represent the populations of the frogs and birds in the kth year, then𝐵 𝑘+1 =0.6 𝐵 𝑘 +0.4 𝐹 𝑘𝐹 𝑘+1 =−0.35 𝐵 𝑘 +1.4 𝐹 𝑘The question is this: in the long run, will the introduction of the birds reduce or eliminate the frog population growth?
21 Linear dynamical system II 𝐹 𝑘+1 𝐵 𝑘+1 = − 𝐹 𝑘 𝐵 𝑘So this system evolves in time according to x(k+1) = Ax(k). Such a system is called discrete linear dynamical system, matrix A is called transition matrix.If we need to know the state of the system in time k = 50, we have to compute x(50) = A50 x(0).And the same is true for Dayhoff’s model of evolution.If we need to obtain probability matrices for higher percentage of accepted mutations (i.e. covering longer evolutionary time), we do matrix powers.Let’s say we want PAM120 – 120 mutations fixed on average per 100 residues. We do PAM1120.
22 Higher PAM matricesBiologically, the PAM120 matrix means that in 100 amino acids there have been 50 substitutions, while in PAM250 there have been 2.5 amino acid mutation at each side.This may sound unusual, but remember, that over evolutionary time, it is possible that an alanine was changed to glycine, then to valine, and then back to alanine.These are called silent substituions.
23 PAM 120 small, polar small, nonpolar polar or acidic basic Zvelebil, Baum, Understanding bioinformatics.PAM 120Positive score – frequency of substitutions is greater than would have occurred by random chance.Zero score – frequency is equal to that expected by chance.Negative score – frequency is less than would have occurred by random chance.small, polarsmall, nonpolarpolar or acidicbasiclarge, hydrophobicaromatic
24 PAM matrices assumptions Mutation of amino acid is independent of previous mutations at the same position (Markov process requirement).Only PAM1 was “measured”, all other are extrapolations (i.e. predictions based on some model).Each amino acid position is equally mutable.Mutations are assumed to be independent of surrounding residues.Forces responsible for sequence evolution over short time are the same as these over longer times.PAM matrices are based on protein sequences available in 1978 (bias towards small, globular proteins)New generation of Dayhoff-type – e.g. PET91
25 How to calculate score? substitution matrix 2 - BLOSUM62 shown here Selzer, Applied bioinformatics.How to calculate score?substitution matrix2- BLOSUM62 shown here
26 Protein vs. DNA sequences Given the choice of aligning DNA or protein, it is often more informative to compare protein sequences.There are several reasons for this:Many changes in DNA do not change the amino acid that is specified.Many amino acids share related biophysical properties. Though these amino acids are not identical, they can be more easily substituted each with other. These relationships can be accounted for using scoring systems.When is it appropriate to compare nucleic sequences?confirming the identity of DNA sequence in database search, searching for polymorphisms, confirming identity of cloned cDNAWhen nucleotide sequence is analyzed, it is usually preferable to study the protein sequences.Particularly 3rd position in codon does not change the coded amino acid.
27 Similarity vs. identity Similarity refers to the percentage of aligned residues that can be more readily substituted for each other.have similar physicochemical characteristics andthe selective pressure results in some mutations being accepted and others being eliminatedS = [(Ls × 2)/(La + Lb)] × 100number of aligned residueswith similar characteristicstotal lengths of each sequence
28 Homology vs. similarity Two sequences are homologous when they descended from a common ancestor sequence.Similarity can be quantified: “two sequences share 40% similarity”.But NOT “two sequences share 40% homology”. Just “two sequences are homologous”Qualitative statementAnd it is a conclusion about a common ancestral relationship drawn from sequence similarity comparison- homology is like pregnancy, you’re either pregnant, or you’re not. You are not pregnant for 80%
29 Gaps How will I score this alignment? The gaps can’t be inserted freely.Indels are relatively slow evolutionary processes.And alignments with large gaps do not make biological sense.Each gap is penalized – a gap penaltyThe gap penalty is an adjustable parameter.Let’s use the gap penalty equaling to -11.V D S - C YV E S L C YV D S - C YV E S L C YS = – =15
30 Gap penalty Affine gap penalty different for opening and extendingconstant for extendingThe gap penalty is high – fewer gaps will be insertedIf you’re searching for sequences that are a strict match for your query sequence, the gap penalty should be set high.This will retrieve regions with very closely related sequences.The gap penalty is low – more and larger gaps will be insertedIf you are searching for similarity between distantly related sequences, the gap penalty should be set low.
31 Percentage identity = 10% High gap penalty. Gaps has been inserted only at the beginning and end.Percentage identity = 10%(B) Low gap penalty. More gaps. Percentage identity = 18%Zvelebil, Baum, Understanding bioinformatics.