Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans.

Similar presentations


Presentation on theme: "Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans."— Presentation transcript:

1 Pairwise Alignment

2 Sequences are related.. Phylogenetic tree of globin-type proteins found in humans

3 The process of lining up two or more sequences to achieve maximal levels of identity (or similarity, in the case of amino acid sequences). Definition of Pairwise alignment

4 What for? A Few Examples: Determining whether 2 sequences from 2 entries found by search of keywords are similar/ identical Focus on differences (genes sequenced in different labs, alternative splicing, SNPs, mutations. Finding similar (conserved) regions in two sequences More….

5 How do we align two sequences? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10 matches |, 3 mismatches 12 matches |, 2 gaps -

6 Which alignment is better? Solution 1 Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10X1+3X(-1) = 7 12X1+2X(-2) = 8 10 matches, 3 mismatches 12 matches, 2 gaps We will use a scoring scheme Match +1 +1 Mismatch –1 0 Indel (gap) - 2 -2 10X1+3X(0) = 10 12X1+2X(-2) = 8

7 Changing the scores of the matrix scheme can change the final score of a given aligned segment. So how do we determine our matrix schemes?

8 The mechanistic Rational מה קורה בעת סינתיזת DNA ?

9 Biological causes of mismatches Accumulation of mutations in a segment of the sequence that is less crucial for function can create a stretch of mismatches. (Any residue can be subject to back mutations.) Very common. ATTGCAGTGATCG ||||| ATTGCGTCGATCG ATTGCAGTGATCG ||||| | ||||| ATTGCGGCGATCG May reflect 2 or 4 independent mutations Original sequence Emerging sequence Original sequence Emerging sequence

10 Biological causes of gaps (indel – insertion / deletion) A single mutation can create a gap. Unequal crossover in meiosis can lead to insertion or deletion of strings of bases. DNA slippage in the replication procedure can result in the repetition of a string. Retrovirus insertions. Translocations of DNA between chromosomes. Less common than events leading to single mutations Are all gaps equal?

11 A sequence with a short gap: ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC A sequence with a long gap: ATCTTCAGTGTTTCCCCTGTTTTGCCC....................ATTTAGTTCGCTC ||||||||||||||||||||||||||| ||||||||||||| ATCTTCAGTGTTTCCCCTGTTTTGCCCGXXXXXXXXXXXXXXXXXXXATTTAGTTCGCTC Consider the following pair of sequences: Two options for gap scoring Keep the score similar regardless of gap length = have a zero gap extension penalty and just penalize when you open a gap. Make the score become larger as a linear function of gap length = add gap extension penalty. This will penalize several small gaps by the same extent as 1 large gap.

12 Gap penalties can penalize for: Gap opening Gap extension Gap ending (ClustalW – multiple alignment) Gap separation (minimum distance between 2 gaps) [ClustalW]

13 What happens to the alignment if we change the gap penalties? Gap opening Gap extension

14 איך יושפע global alignment מ: קנסות גבוהים על פתיחת פער קנסות גבוהים על הארכת פער האם local alignment יושפע באותו אופן/ באותה מידה?

15 ATTGCAGTGATCGATTGCAGT-GATCG ||||| |||||||||| || ||||| ATTGCGTCGATCGATTGC-GTCGATCG Matches | Mismatches Gaps - - - - - Gap opening Gap extension פרס קנסות Minimal space between two gaps הרשאות When comparing nucleotide or amino acid sequences ציון ההשוואה ניתן בשיטת השוט והגזר So far, when nucleotide sequences were considered all mismatches received the same (negative) score.

16 Ex: Pairwise alignments 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.:.......::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140

17 Pairwise alignment Percent identity is not a good measure of alignment quality 100.000% identity in 3 aa overlap SPA ::: SPA

18 Pairwise alignments: alignment score 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.:.......::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140

19 Global alignment An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A short example NLGPSTKDFGKISESREFDNQ | |||| | QLNQLERSFGKINMRLEDALV

20 Local alignment An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get: NLGPSTKDDFGKILGPSTKDDQ |||| QNQLERSSNFGKINQLERSSNN

21 Applying LOCAL Applying GLOBAL Global a. Few mismatches Several mismatches Local a.

22 If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment. Emboss [best solution] vs. Lalign (Embnet) [several solutions]

23 Pairwise alignments: conservative substitutions 43.2% identity; Global alignment score: 374 10 20 30 40 50 alpha V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA : :.:.:. : : ::::.. : :.::: :....: :..: : ::: :. beta VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP 10 20 30 40 50 60 70 80 90 100 110 alpha QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.::.::::: :.....::.:.......::.:: ::.::: ::.::.. :..:: :. beta KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 60 70 80 90 100 110 120 130 140 alpha PAEFTPAVHASLDKFLASVSTVLTSKYR :::: :.:..:.:.:...:. ::. beta GKEFTPPVQAAYQKVVAGVANALAHKYH 120 130 140

24 However, in the case of amino acids  Not all matches are equal.  Not all mismatches are equal!

25 Amino acid properties Serine (S) and Threonine (T) have similar physicochemical properties Aspartic acid (D) and Glutamic acid (E) have similar properties Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities =>

26 Non-polar hydrophobic All other aa are polar, hydrophylic: Acidic Basic All Amino Acids Are Equal…

27 http://teachline.ls.huji.ac.il/~72332/mouse/aa-properties.html Each a”a is characterized by a combination of features (size, charge, etc.). The relative importance of each feature may vary according to the a”a role in the 3-D structure and function of the protein. So how can we score matches and mismatches?

28 To that end, amino acids substitution matrices were developed (Blosum, PAM).

29 The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other. Amino Acids Substitution Matrices These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families These scoring systems have a probabilistic foundation.

30 All the PAM data come from alignments of closely related proteins (>85% amino acid identity) from 71 protein families (total of 1572 protein sequences). PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions. PAM series - Percent Accepted Mutation (Accepted by natural selection) Some of the protein families are: Ig kappa chain Kappa casein Lactalbumin Hemoglobin  Myoglobin Insulin Histone H4 Ubiquitin

31 PAM series - Percent Accepted Mutation (Accepted by natural selection) * Varying degrees of conservation

32 The PAM 250 matrix is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about ~20 percent identical. Smaller PAM number – less diversity between compared sequences Better suited for more conserved sequences PAM1 99% identity in sequences Various degrees of conservation

33 The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).

34 BLOSUM series - Blocks Substitution Matrix. (Henikoff S. & Henikoff JG., PNAS, 1992) A substitution matrix based on alignments in the BLOCKS database – conserved regions (blocks) of Families of proteins Family members have identical biochemical functions, and show common motifs Common blocks of local alignment not containing gaps. The BLOCKS database contains thousands of groups of multiple sequence alignments. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.

35 Extracting probabilities from Blocks- example A A C D A A A A A D C R D R C G N A A N C N A R C R K D A N A A K N C R Substitutions counted in column 1 AA, AD, AA, AC, AA, DA, DC, DA, AC, AA, CA 6AA (P(AA)=6/15) 4AD (P(AD)=4/15) 4AC 1DC … Statistics of substitutions and log-odds computation as described for PAM.

36 Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members.

37 Blosum62 scoring matrix

38 Using an amino acid substitution matrix Gap penalties (not included in this example) are treated as previously described match mismatch Notice that matches and mismatches don’t have the same values.

39 Different matrices give somewhat different scores, but same general trends are observed. What trends?

40 A substitution is more likely to occur between amino acids with similar biochemical properties.

41 Likelihood of a substitution is also affected by the degree of degenerativity of the genetic code of the different amino acids

42 How do we choose the most appropriate scoring matrix? Blosum matrices are more commonly used than PAM matrices. The Blosum matrices are best for detecting local alignments. The Blosum62 matrix is the best for detecting the majority of weak protein similarities. The Blosum45 matrix is the best for detecting long and weak alignments.

43 Rat versus mouse RBP Rat versus bacterial lipocalin

44 The following matrices are roughly equivalent PAM100 BLOSUM90 PAM120 BLOSUM80 PAM160 BLOSUM60 PAM200 BLOSUM52 PAM250 BLOSUM45

45 Limitations Substitution matrices do not take into account long range interactions between residues. They assume that identical residues are equal (whereas in real life a residue at the active site has other evolutionary constraints than the same residue outside of the active site) They assume evolution rate to be constant.

46 DNA Substitution Matrices Purine – Purine Pyrimidine - Pyrimidine Purine – Pyrimidine Pyrimidine - Purine

47 Conservation The extent to which nucleotide or protein sequences are related. It can be evaluated by identity and similarity. Identity ( | ) The extent to which two sequences are invariant. Similarity (. : ) Changes at a specific position of an amino acid that preserve the physico-chemical properties of the original residue. Definitions Page 47

48 There are many ways to align two sequences. Several ways to present the pairwise alignment Do not blindly trust your alignment to be the only truth. In particular, gapped regions may be quite variable. Sequences sharing less than 20% identity are difficult to align.

49 Dotplots: visual sequence comparison 1.Place two sequences along axes of plot 2.Place dot at grid points where two sequences have identical residues 3.Diagonals correspond to conserved regions


Download ppt "Pairwise Alignment. Sequences are related.. Phylogenetic tree of globin-type proteins found in humans."

Similar presentations


Ads by Google