Presentation on theme: "Sequence Alignment and Phylogeny"— Presentation transcript:
1Sequence Alignment and Phylogeny B I O I N F O R M A T I C S| | | | | | |B I O L O G Y - M A T H - SDr Peter Smooker,
2Uses of alignmentsTo determine the relationship (ie: distance) between two sequences (pair-wise alignment)To search databanks for the presence of homologuesTo look for sequence conservation in families of proteinsTo use molecular approaches to phylogeny
3Comments/CaveatsWhen sequences are aligned, we assume they share a common ancestorProtein fold is more conserved than protein sequenceDNA sequences are less informative than protein sequencesTwo sequences can always be aligned- we need to determine what is a meaningful result
4HomologyProteins or genes are defined as homologous if they can be said to have shared an ancestorGenes or proteins are either homologs or they are not- there is no such thing as percent homology. There is percent identity or similarity of the sequences
5“Ologies” Homology - descent from a common ancestor Orthology - descent from a speciation eventParalogy - descent from a duplication eventXenology - descent from a horizontal transfer event
6When Is Homology Real? As a general rule, in a pairwise alignment: >25% identical aa’s, proteins will have similar folding pattern- most likely homologous18-25% identical- twilight zone- tantalizing<18% identical- cannot determine from alignment
7Measuring Sequence Similarity Two measures of the distance between two strings:Hamming distance: strings equal length, number of positions with mismatchesLevenshtein distance: not equal length, number of edit operations to change one string to the other
9Protein Alignments-Substitution Matrices When sequences diverge over time, they accumulate mutations- some are deleterious, some are neutral, some are advantageousSome changes are more likely than othersThis can be examined and the relative probability of a change occurring calculatedSubstitution matrices have been developed
10Matrices. PAM = Percent Accepted Mutation Matrices are derived from families of proteins with a set level of identity.PAM matrices proposed by Margaret Dayhoff. Based on sequences with > 85% identity. The PAM 1 matrix was computed. Extrapolated for larger evolutionary distances
11PAM Matrices PAM 0 30 80 110 200 250 % identity 100 75 50 60 25 20 The PAM250 matrix is corresponds to proteins of average 20% identity (lowest we can reasonably be confident about). It was derived by the extrapolation of observed substitution frequencies. PAM250 refers to 250 substitutions per 100 amino acids.
12Definition of PAM from BLAST literature One "PAM" corresponds to an average change in 1% of all amino acid positions. After 100 PAMs of evolution, not every residue will have changed: some will have mutated several times, perhaps returning to their original state, and others not at all. Thus it is possible to recognize as homologous proteins separated by much more than 100 PAMs. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates.
13BLOSUM Matrices Developed by S and JG Henikoff Made use of a much larger amount of dataBased on the BLOCKS database of aligned protein domainsUsed a weighted average of closely related sequences with identities higher than a threshold. For example, the common BLOSUM62 matrix is based on proteins with greater than 62% identity
14BLOCKSThe substitutions in each aligned column are identified and a score for each substitution calculated and inserted into the matrix.
15Which Matrix to use? In BLASTP, the following matrices are offered: PAM 30PAM 70BLOSUM 80BLOSUM 62 (default)BLOSUM 42In PAM, greater numbers = more evolutionary distance. Reverse for BLOSUM
16Which Matrix to use?Generally, BLOSUM perform better than PAM for local alignment searchesUse the matrix appropriate for the task- if you expect a close match, use a low PAM or high BLOSUM numberGenerally, if you use the default (generally BLOSUM 62) and find nothing, go to a matrix derived from a more evolutionarily distant dataset
17Scoring Score of mutation i > j log observed i >j expected i > jExpected i > j is simply calculated by the frequencies of the amino acidsResult is multiplied by 10. Scores are added.
18PAM250 A R N D C Q E G H I L K M F P S T W Y V A 2 R -2 6 N 0 0 2
19Scores below 0 indicate amino acids that are rarely substituted, and different aa’s that give a high +ve score are usually functionally equivalentScores below 0 indicates that those substitutions are rarely observed
24Significance Two values are given- the Bit score and the E-value. The E-value is a statistical calculation of the probability that the match is real, ie: that in a query database of that size, the sequence would give that score by chanceThe bit score is related to both the raw score (calculated from the BLOSUM or PAM lookup matrix) but is normalised
25Bit ScoreBit scores are normalised with respect to the scoring system. Hence they can be compared across different searches (using different matrices)In particular:To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed
26Multiple Sequence Alignment To quote Lesk“One amino acid sequence plays coy; a pair of homologous sequences whisper; many aligned sequences shout out loud”
27Multiple Sequence Alignment Multiple sequence alignments can offer a considerable amount of information over a pairwise alignment.Regions of similarity (especially distant similarity) can be detectedRegions of functional significance can often be detectedEvolutionary relationships can be examined, and trees drawn.
28MSA’s are computationally expensive If we use dynamic programming, rather than a 2D array as for pairwise comparison, have an n-dimensional array. Computational time grows as Mn, where n is the number of sequences. Difficult for n=4, impossible for higher values.Use a heuristic approach. Most common is the CLUSTAL algorithm
29Progressive Alignment Iterative pairwise alignmentTwo most similar sequences aligned first, then next most similar to that pair, etc.A very popular progressive alignment algorithm is CLUSTAL W
30CLUSTAL W- StepsA matrix of pairwise distances between all sequences is constructed. This determines the similarity between all sequences to be aligned.A guide tree (dendogram), or inferred phylogeny, is builtThe alignment is constructed based on the guide tree.Generally results in a near-optimal alignment
31CLUSTAL WA major problem in MSA is the selection of an appropriate matrix for alignments consisting of divergent and closely related sequencesCLUSTAL W (weighted) assigns weights to a sequence dependent on how divergent it is from the two most closely related sequencesAdapts gap penalties and scoring matrix to suit
32An example (from our research) Some definitions:Phylogeny: Evolutionary history (“tree of life”)Molecular phylogeny: Determined using sequence dataBootstrapping: A statistical process to evaluate phylogenetic trees. The data is resampled 1000 times (generally) and the support for each branch determinedHomology modelling. Predicting the structure of a protein based on the experimentally derived structure of a homologue
34Liver fluke (Fasciola spp.) Trematode (flatworm) parasiteInfects ruminants, humansHas a complex life-cycleSecretes proteins (excretory/secretory material)Major secreted protein is cathepsin L in adults
35Cysteine proteasesDigest proteins: cleave between adjacent amino acids.Not random cleavage, different proteases show a preference for different targets.
36There are a number of Fasciola cathepsin L sequences known. At least 30 full sequences now knownOnly one contains an indelProtein sequences 46-99% identical
37Presumed to be due to changes affecting the S2 subsite of the enzyme. What are the differences between the two classes of CatL that account for the substrate specificity?Presumed to be due to changes affecting the S2 subsite of the enzyme.
38Homology ModellingFhCatL modelled on the known crystal structure of human CatL.Models of CatL2 and CatL5 (functional equivalent of CatL1) compared, especially around the S2 subsite of the enzyme.
39Homology ModellingThree substitutions is residues lining the S2 subsite were observed (L5-> L2)L69Y: Makes substantial contacts with the P2 PheN161T: Side chain points away from pocketG163A: Bottom of pocket, no substantial contact with P2 Phe
40L2 L5 GRASP electrostatic surface potential The architecture around the S2 pocket is substantially influencedby a Y or L at position 69.Made mutant, expressed in yeast, performed kinetic analysis.
41Conclusions The L69Y change does affect the substrate specificity 69Y allows increased catalysis of substrates with a P2 prolineThere are other, more subtle changes between L5 and L2
45Fasciola CatL’s form a monophyletic clade Fasciola sequences aligned to the family of papain-like cysteine proteases100% bootstrap support for cladeAll Fasciola sequences arose after divergence from SchistosomaProbably all parasitic catLs have diverged after speciation (Sajid and McKerrow)
50Evolutionary Timeframe First observed divergence (clade A) 135 MYAF. hepatica and F. gigantica predicted to diverge approx. 19 –25 MYAConfirmed by constructing a neighbour-joining tree using Glutathione-S transferase sequences: 19 +/- 5.2 MYA
51Practice runs- 1. Blast Go to the BLAST server at NCBI Note the different “flavours” of BLAST that can be performed.Go to Protein-Protein BLAST. Look at the format and the searching parameters.Paste in sequence 1 and run the BLAST
52Sequence 1 What is it? (note that a conserved domain is detected) From what organism (should be 100% match)?What is the organism that has the closest relative?What is meant by “positives”?
53For interest, use sequence 2 to run a BLAST For interest, use sequence 2 to run a BLAST. This is the mRNA sequence from which the protein sequence is translated. (note- choose your BLAST flavour carefully!)Is the same result obtained?
54Practice runs- 2. CLUSTAL W Go toUpload (or paste) Seq3.txt, run the toolDoes the dendogram resemble that previously demonstrated?