Multiple sequence alignment (vWF) RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR The problem: find for each position its conservation score.
Finding conserved regions from an alignment S1 KIFERCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN s2 KIFERCELARTLKRLGLDGYRGISLANWVCLAKWFWDYN s3 KVFERCELARTLKRLGMDFYRGISLANWMCLAKWESGYN s4 KTYERCEFARTLKRNGMSGYYGVSLADWVCLAQHESNYN s5 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYN s6 KVFSKCELAHKLKAQEMDGFGGYSLANWVCMAEYESNFS Solution 1: assign a score of 1 if the position is fully conserved and a score of 0 if it is variable. Problem: this method is very “rough…”
Finding conserved regions from an alignment S1 KIFERCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN s2 KIFERCELARTLKRLGLDGYRGISLANWVCLAKWFWDYN s3 KVFERCELARTLKRLGMDFYRGISLANWMCLAKWESGYN s4 KTYERCEFARTLKRNGMSGYYGVSLADWVCLAQHESNYN s5 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYN s6 KVFSKCELAHKLKAQEMDGFGGYSLANWVCMAEYESNFS Solution 2: count number of character states. Problem: this method does not take the evolutionary tree into account.
Evolutionary forces (e.g., mutation and selection) are the source of sequence variation S1S2S3 S6 S5 S4
A phylogenetic tree represents the history of evolution for the entire sequence. It is inferred based on all positions or from external data (e.g., fossils, other genes) S1S2S3 S6 S5 S4
Mapping changes onto the tree S1(K)S2(A)S3(A) S6(A) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(K) K A K A 3 K’s, 3 A’s and one replacement
Mapping changes onto the tree S1(K)S2(A)S3(A) S6(K) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(A) K A K A 3 K’s, 3 A’s and 3 replacements
When the phylogenetic tree is known, for each position, the minimum number of changes needed to “explain” the data will be evaluated. The more changes -> the more variable the position Maximum Parsimony (MP)
Mapping changes onto the tree S1(K)S2(A)S3(A) S6(A) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(K) K A K A Maximum parsimony score = 1 -> conserved.
Mapping changes onto the tree S1(K)S2(A)S3(A) S6(K) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(A) K A K A Maximum parsimony score = 3 -> variable.
What if the tree is not known… S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S1S2S3 S6 S5 S4 The score of each tree is the sum of scores over all positions. If the tree is not known, we choose the tree with the lowest score, the maximum parsimony tree.
Parsimony has many shortcomings. To name a few: (1) All changes are counted the same, which is not true for biological systems (Leu->Ile is much more likely than Leu-> His). (2) Cannot take biological context into account (secondary structures, dependencies among sites, evolutionary distances between the analyzed organisms, etc). (3) Statistical basis questionable.
Maximum likelihood uses a probabilistic model of evolution Each amino acid has a certain probability to change and this probability depends on the evolutionary distances. Evolutionary distances are inferred from the entire set of sequences.
Evolutionary distances Positions can be conserved because of two reasons. Either because of functional constraints, or because of short evolutionary time. 5 replacements in 10 positions between 2 chimps, is considered very variable. 5 replacements between human, and cucumber, is not considered that variable… Maximum likelihood takes this information into account.
Maximum ParsimonyMaximum Likelihood All changes counted the same Different probabilities to the different types of substitutions Statistically questionableStatistically robust Ignores biological context Accounts for biological context
The likelihood computations t1t1 t5t5 t3t3 X C K t2t2 Z Y MA t6t6 t4t4 With likelihood models we can: 1.Infer the phylogenetic tree 2.Compute conservation for each site
Maximum likelihood tree reconstruction This is incredibly difficult (and challenging) from the computational point of view, but efficient algorithms to find approximate solutions were developed.
Back to conservation: ‘rate of evolution’ We estimate the rate of evolution for each site in the alignment Conserved site Slow evolving site Variable site Fast evolving site Given a multiple sequence alignment (MSA), we define:
Evolutionary rates We model the rate by assuming that each site i in the sequence has a different rate, r i, relative to the average rate over all sites. A site of rate 2 evolves twice as fast as the average.
“conseq” (http://consurf.tau.ac.il/~consurf/conseq/html/form.html) Bcl-X L – a key regulator influencing the release of apoptosis promoting factors from mitochondria
Melamed D., et al. J. Virol (2004) 78:9675:9688 Conseq was used to study 11 unstructured amino acids in the Capsid Domain (CA) of the Gag protein. The Capsid Domain of the Gag protein makes a major contribution to the assembly process of the virion particle.
Integrating the 3D information We map each color onto the 3D structure.
Integrating the 3D information: validation of the method (1) Do the results make sense for biologists?
Conservation pattern in the Bcl-X L protein, using alignment of 53 homologes from Protomap Primary signal, Bak/ Bcl-X L interface. Secondary signal, BH4 homology region; found only on Bcl-2 subfamily (BH4 may interact with CED-4). Example: Bcl-X L protein (1bxl pdb ID)
The Structure of Human Src Tyrosine Kinase (Adapted from: Branden and Tooze, 1999)
SH2-SH3 interface ML results (233 SH2 homologues)
Web-Server We developed a Web server applying this method. Using this server, one can enter a single PDB structure, and the server finds homologous sequences, produces the alignment and the tree, calculates the conservation scores, and visualizes the results on the 3D structure…