Presentation on theme: "1 Combinatorial optimisation in protein structure prediction and recognition: Background, review, and research direction Speaker: Vicky Mak."— Presentation transcript:
1 Combinatorial optimisation in protein structure prediction and recognition: Background, review, and research direction Speaker: Vicky Mak
2 What’s in this talk? zWhat is protein structure prediction and recognition? zWho has done what before? zWhat’s interesting and hasn’t been done? zBeing critical about others’ work is easy. zDoing something brilliant is difficult. zThis talk addresses the easy problem.
3 Combining two Amino acids Amino group Residue -Carbon N-terminalC-terminal Before After Carboxy group
4 Protein: polypeptide chain A polypeptide chain: chain of amino acids linked together by peptide bonds. Each amino acid is the same except for the residues. There are 20 such amino acids. Different combinations of these 20 amino acids make different proteins. A protein sequence can contain from tens to thousands of amino acids. N-terminalC-terminal
5 An example -helix -sheet Primary structure: individual amino acids. Secondary structure: -helix and -sheet. The green chain defines a tertiary structure. So is the blue chain. Quaternary structure: green+blue chains.
6 Motivation zNotice: It is the 3-D structures of the proteins that are important (2 different sequences can have exactly the same structure!) zNeed to know the “shape” of a protein, so as to develop antibodies that “bind” that shape - Fold prediction. zAntibodies produced against one protein may also work for another protein that “looks similar” - Structure recognition.
7 Structure prediction
8 HP models (Ab initio prediction ) zGiven a sequence of amino acids, determine the structure from scratch. zHydrophobic-hydrophilic (HP) model yproposed by Dill (1985) zTwo groups of amino acids: yHydrophobic acids (H) yHydrophilic acids (P) zSelf avoiding walks on lattices zObjective: minimise global free energy yMeaning, it’s good to put as many Hydrophobic acids as close together as possible.
9 HP model on lattices: a 2-dimensional example Hydrophobic acids Hydrophilic acids
10 HP model on lattices: a 2-dimensional example Hydrophobic acids Hydrophilic acids Fold with 5 hydrophobic contacts
11 Previous work on HP models zMost previous work involves complete enumeration of self- avoiding random walks on various lattices (e.g. Lau and Dill (1989), Irback and Troein (2002)) yIrback and Troein (2002) managed sequences with up to 25 amino acids zUnger and Moult (1993) - hybrid Genetic Algorithm and Simulated Annealing (2-D) ysize Opt for size 36,48,60 (Opt ?! How do they know?) yShakhovich et al. (1991) tried SA on acid problems. (Only 1 found global minimum. Inappropriate local search is to blame.) zBackofen (2001) constraint programming approach ytested problems of size 27-36, time: 20min - 1hr38min (opt) zIP models proposed recently in Greenberg, Hart and Lancia (2002). No numerical results reported as yet. y(See pages 1-4 of pdf file)
12 Problems with IP models zDealing with symmetry yMethods are suggested in Greenberg, Hart and Lancia (2002) and in Beckofen’s PhD thesis. yWhat about other lattices? zNumber of lattice points unnecessarily large. yLau and Dill (1989) proposed maximal compact chain conformations: Lattice walks in which every point is occupied by exactly one amino acid. xE.g. 3x3x3 cubic lattice for a 27-amino acid-chain yMay be not that tight, but definitely not n 2. yMay be a union of some of those maximal compact chain conformations.
13 Let’s be critical zCubic lattices probably not good enough. But it’s a good start anyway. yFaulon, Rintoul and Young (2002) tried 2-D honeycomb, 2-D square, 3-D diamond and 3-D cubic lattices. Agarwala et al. triangular lattice (Constrained SAW, no optimisation involved). zUse energy matrix rather than simple unit credit for each HH interaction? (Different hydrophobicity) yEnergy released by putting different pairs of H-acids together are different, and are depending on how far they are apart in sequence! yDill’s HP model is too simplified. yBesides, interactions between H-acids should be defined differently to the Domain and Neighbourhood.
14 Under old definitions, suppose are hydrophobic acids, are all the same.
15 But surely look better than
16 Research opportunities zExact algorithms yAlternative ILP formulations (with tight LP relaxation bounds) xDifference in lattice neighbourhood and hydrophobic interaction neighbourhood (use Euclidean distance for the latter). yDevelopment of solution methodologies yModify Dill’s model to deal with reality xAlternative lattices (apply optimisation techniques as supposed to complete or simple constrained numeration). xMore complicated hydrophobicity (Atkins and Hart (1999) discussed fixed energy matrix and proved NP-hardness). xPrevious methods either constraints programming or integer linear programming. Why not a hybrid CP and ILP approach?
17 Research opportunities zNo methods so far can manage a sequence with >100 amino acids yHeuristics: xMeta-heuristics: still room for research, try different neighbourhood scheme Tailor-made search techniques that considers folding patterns xDevelopment of problem-specific heuristic or greedy heuristic At least that will provide quick initial bounds for exact methods.
18 Structure recognition
19 zSequence alignment yComparing a sequence of amino acids with known sequences in Protein Data Bank on the primary structure level. yDoes this sequence look alike that sequence? yMethods well developed: e.g. BLAST. zFold recognition yComparing the structure of an unknown protein with known protein structures in PDB. xContact Map Optimisation (primary-structure comparisons) xArthur Lesk’s model (secondary-structure comparisons) xIp et al.’s model (secondary-structure comparisons)
20 zComparing 3-D structures of two sequences of amino acids, e.g. s=(s 1..s m ) and t=(t 1..t n ). (Assuming you already know how each of them look like, and you now want to know how much they look alike each other.) zConstruct an undirected graph for each of s and t, amino-acids as vertices. zFor each sequence, two amino acids that are within a certain Euclidean distance from each other are connected by an edge. Contact Map Optimisation
21 Contact Map Optimisation s t s1s1 s2s2 smsm t1t1 t2t2 tntn
22 Contact Map Optimisation One way of mapping. 4 pairs of edges mapped.
23 Contact Map Optimisation Another way of mapping. 5 edges mapped.
24 Wait a minute... zRemember from the HP models, amino acids are divided into two groups. What is the point of mapping a hydrophobic amino acid in one graph to a hydrophilic amino acid in another or vice versa??? zAdding constraints that only amino acids of the same group are supposed to be matched might be helpful!!!
25 Who has done what? zNo one noticed the HP issue so models aren’t 100% cool. zLancia et al. (2001) ILP model (see pages 5-6 of pdf file) yLP-relaxation of no-crossing constraints typically weak, hence clique constraints (exponentially many) are introduced. yProblem can be converted to a max independent problem, for which cliques inequalities are facet-defining. yO(n 2 ) time separation for cliques. yRoot-node LP relaxation (from 1min to 2hours for acids and contacts. The more alike of the two proteins the faster LP relaxation can be solved!)
26 Who has done what? zHeuristic approaches: yLancia et al. (2001) xGenetic algorithm (GA) xSteepest ascent local search zResults of Lancia et al. yExact algorithm xGaps: 0->5% (Mostly >5% exactly how much??) yHeuristics xSame story as above. GA much better than LS. zWork on similar topics can also be found in Havel et al. (1979), Martin et al. (1992) and so on.
27 Let’s be critical... zEven just the LP relaxation of the IP formulation without no-crossing constraints takes a long time to solve for comparing pairs of real protein sequences with amino acids. yTried comparing two sequences with 120+ amino acids, took more than 10 hours!!! zReally should consider the HP issue, and may be even aggregating certain amino acids!
28 Let’s be critical... zA big problem with model - a 3-D example Consider the following sequence Two different structures giving the same objective value by the ILP formulation of Lancia et al. assuming acids within e-distance of 3 1/3 are connected by an edge.
29 Research opportunities zExact methods yNew ILP formulation. yAlternative solution methodologies for solving the ILPs - now that we know the ILP models are huge and solving them is hard. zHeuristics yProblem specific heuristic. yDifferent neighbourhood search for meta- heuristics.
30 Arthur Lesk’s model zCompare structures of two protein sequences by inspecting relations between secondary structures Does the blue protein look like the green protein?
33 Protein sequence 1 Protein sequence 2
34 Similar to CMO... 11 11 22 22 33 44 C D B ’1’1 ’1’1
35 Useful papers and websites zGreenberg, H.J., Hart, W.E., Lancia, G. “Opportunities for Combinatorial Optimization in Computational Biology” zhttp://www.dkfz-heidelberg.de/tbi/bioinfo/ProteinStructure/ zChristian Lemmen and Thomas Lengauer. “Computational methods for the structural alignment of molecules”, Journal of Computer- Aided Molecular Design, , 2000.