2 Sequence Comparison Much of bioinformatics involves sequences DNA sequencesRNA sequencesProtein sequencesWe can think of these sequences as strings of lettersDNA & RNA: alphabet of 4 lettersProtein: alphabet of 20 letters
3 Sequence Comparison - Motivation NucleotideLearn about evolutionary relationshipsFinding genes, domains, signals …ProteinClassify protein families (function, structure)Identify common domains (function, structure)
5 How do we align two sequences? ATTGCAGTGATCGATTGCGTCGATCGSolution Solution 2ATTGCAGTGATCG ATTGCAGT-GATCG||||| ||||| ||||| || |||||ATTGCGTCGATCG ATTGC-GTCGATCG10 matches | , 3 mismatches12 matches |, 2 gaps -
6 Which alignment is better? Solution Solution 2ATTGCAGTGATCG ATTGCAGT-GATCG||||| ||||| ||||| || |||||ATTGCGTCGATCG ATTGC-GTCGATCG10X1+3X(-1) = 712X1+2X(-2) = 810 matches, 3 mismatches12 matches, 2 gapsWe will use a scoring schemeMatchMismatch –1 0Indel(gap)10X1+3X(0) = 1012X1+2X(-2) = 8
7 Scoring Alignments - intuition Similar sequences evolved from a common ancestorEvolution changed the sequences from this ancestral sequence by mutations:Replacements: one letter replaced by anotherDeletion: deletion of a letterInsertion: insertion of a letterScoring of sequence similarity should examine how many operations took place
8 Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced byanother nucleotide (e.g.: ATA → AGA)insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA)deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G)indel: an insertion or a deletion
9 Gaps • Positions at which a letter is paired with a null are called gaps.• Gap scores are typically negative.• Since a single mutational event may cause the insertionor deletion of more than one residue, the presence ofa gap is ascribed more significance than the lengthof the gap.
10 Gap OpeningThe gap-opening penalty defines the cost for opening a gap in one of the sequences.If you raise the gap-opening penalty above default, local alignments that contain gaps may be split into several shorter alignments.
11 Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC In nature, a series of indels often come as a single event rather than a series of single nucleotide events:ATA__GCATATTGCATAG_GCAT_GTGCThis is more likely.This is less likely.Normal scoring would give the same score for both alignmentsGap = Gapopen + Len * Gapextend
12 Gap penalties lead to:Increasing penalties for gaps opening and extensionThe alignment will contain fewer gaps and more mismatchesDecreasing penalties for gaps opening and extensionThe alignment will contain more gaps (of varied lengths) and fewer mismatchesHolding same score of penalty for gap opening and increasing penalty for gap extensionVery long gaps will not be tolerated – they will be replaced with additional gaps of medium length and with mismatches.
14 Global alignmentA global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment.As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique.Global Local _____ _______ __ ____ __ ____ ____ __ ____
15 Local alignmentLocal alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence.For example, positions of sequence A might be aligned with positions50-70 of sequence B.This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins can be identified as being related.Global Local _____ _______ __ ____ __ ____ ____ __ ____
17 Global vs. Local: Use global alignment if Use local alignment if You expect, based on some biological information, that your sequences will match over the entire length.Your sequences are of similar length.Use local alignment ifYou expect that only certain parts of two sequences will match (as in the case of conserved segment that can be found in many different proteins).Your sequences are very different in length.You want to search a sequence database (we will talk about it in details later).
18 Emboss [best solution] vs. Lalign (Embnet) [several solutions] If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment.Emboss [best solution] vs. Lalign (Embnet) [several solutions]
19 Comparing nucleotides Every match got the same scoreEvery mismatch got the same scoreGaps- we decided but default usually good.However
20 In the case of aa Not all matches are the same Different mismatches get different scores
21 Serine (S) and Threonine (T) have similar physicochemical properties Amino acid propertiesSerine (S) and Threonine (T) have similar physicochemical propertiesAspartic acid (D) and Glutamic acid (E) have similar properties=>Substitution of S/T or E/D occurs relatively often during evolution=>Substitution of S/T or E/D should result in scores that are only moderately lower than identities
22 So how can we score matches and mismatches? Each aa is characterized by a combination of features (size, charge, etc.).The relative importance of each feature may vary according to the aa role in the 3-D structure and function of the protein.So how can we score matches and mismatches?
23 Amino Acids Substitution Matrices The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other.These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein familiesThese scoring systems have a probabilistic foundation.
24 PAM series - Percent Accepted Mutation (Accepted by natural selection) All the PAM data come from alignments of closelyrelated proteins (>85% amino acid identity) from 71 protein families (total of protein sequences).PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions.Some of the protein families are:Ig kappa chainKappa caseinLactalbuminHemoglobin aMyoglobinInsulinHistone H4Ubiquitin
25 Various degrees of conservation The PAM1 is the matrix calculated from comparisonsof sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids.Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids.All the PAM data come from closely related proteins(>85% amino acid identity).
26 PAM series - Percent Accepted Mutation (Accepted by natural selection) Varying degrees of conservation*
27 Blocks Substitution Matrices- THE BLOSUM Family of MatricesBlocks Substitution Matrices-Henikoff and Henikoff, 1992Blocks are short conserved patterns of 3-60 aa long.Proteins can be divided into families by common blocks.Different BLOSUM matrices emerge by lookingat sequences with different identity percentage.Example: BLOSUM62 is derived from an alignmentof sequences that share no less than 62% identity.Block A B C D
30 Summary:BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gapsPAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions
31 PAM versus BLOSUM Based on an explicit evolutionary model Derived from small, closely related proteins with ~15% divergenceHigher PAM numbers to detect more remote sequence similaritiesErrors in PAM 1 are scaled 250X in PAM 250Based on empirical frequenciesUses much larger, more diverse set of protein sequences (30-90% ID)Lower BLOSUM numbers to detect more remote sequence similaritiesErrors in BLOSUM arise from errors in alignment
32 GuidelinesLower PAMs and higher Blosums find short local alignment of highly similar sequencesHigher PAMs and lower Blosums find longer weaker local alignmentNo single matrix answers all questions
33 Guidelines BLOSUM is generally better than PAM for local alignments. The default matrix is often identity matrix for DNA and BLOSUM 62 for proteinsWhen using BLOSUM80 instead of BLOSUM45, local alignments tend to be shorter.Low PAMs have same effects as high Blosums. BLOSUM indicates percent identity while PAM is proportional to the percent of accepted mutations.