Probabilistic Methods for Interpreting Electron-Density Maps Frank DiMaio University of Wisconsin – Madison Computer Sciences Department
3D Protein Structure backbone sidechain backbone sidechain C-alpha
3D Protein Structure ALA LEU PRO VAL ARG …… ???
High-Throughput Structure Determination Protein-structure determination important Understanding function of a protein Understanding mechanisms Targets for drug design Some proteins produce poor density maps Interpreting poor electron-density maps is very (human) laborious I aim to automatically interpret poor-quality electron-density maps
Electron-Density Map Interpretation … … GIVEN: 3D electron-density map, (linear) amino-acid sequence
Electron-Density Map Interpretation … … FIND: All-atom Protein Model
My focus Density Map Resolution Morris et al. (2003) Ioerger et al. (2002) Terwilliger (2003) 2.0Å3.0Å4.0Å1.0Å
Thesis Contributions A probabilistic approach to protein-backbone tracing DiMaio et al., Intelligent Systems for Molecular Biology (2006) Improved template matching in electron-density maps DiMaio et al., IEEE Conference on Bioinformatics and Biomedicine (2007) Creating all-atom protein models using particle filtering DiMaio et al. (under review) Pictorial structures for atom-level molecular modeling DiMaio et al., Advances in Neural Information Processing Systems (2004) Improving the efficiency of belief propagation DiMaio and Shavlik, IEEE International Conference on Data Mining (2006) Iterative phase improvement in ACMI
A CMI Overview Phase 1: Local pentapeptide search (ISMB 2006, BIBM 2007) Independent amino-acid search Templates model 5-mer conformational space Phase 2: Coarse backbone model (ISMB 2006, ICDM 2006) Protein structural constraints refine local search Markov field (MRF) models pairwise constraints Phase 3: Sample all-atom models Particle filtering samples high-prob. structures Probs. from MRF guide particle trajectories
A CMI Overview Phase 1: Local pentapeptide search (ISMB 2006, BIBM 2007) Independent amino-acid search Templates model 5-mer conformational space Phase 2: Coarse backbone model (ISMB 2006, ICDM 2006) Protein structural constraints refine local search Markov field (MRF) models pairwise constraints Phase 3: Sample all-atom models Particle filtering samples high-prob. structures Probs. from MRF guide particle trajectories
5-mer Lookup …SAW C VKFEKPADKNGKTE… Protein DB A CMI searches map for each template independently Spherical-harmonic decomposition allows rapid search of all template rotations
Spherical-Harmonic Decomposition f (θ,φ)
5-mer Fast Rotation Search pentapeptide fragment from PDB (the “template”) electron density map calculated (expected) density in 5A sphere map-region sampled in spherical shells template-density sampled in spherical shells sampled region of density in 5A sphere
5-mer Fast Rotation Search map-region sampled in spherical shells template-density sampled in spherical shells template spherical- harmonic coefficients map-region spherical- harmonic coefficients correlation coefficient as function of rotation fast-rotation function (Navaza 2006, Risbo 1996)
Convert Scores to Probabilities correlation coefficients over density map t i (u i ) scan density map for fragment probability distribution over density map P(5-mer at u i | EDM) Bayes’ rule
A CMI Overview Phase 1: Local pentapeptide search (ISMB 2006, BIBM 2007) Independent amino-acid search Templates model 5-mer conformational space Phase 2: Coarse backbone model (ISMB 2006, ICDM 2006) Protein structural constraints refine local search Markov field (MRF) models pairwise constraints Phase 3: Sample all-atom models Particle filtering samples high-prob. structures Probs. from MRF guide particle trajectories
Probabilistic Backbone Model Trace assigns a position and orientation u i ={x i, q i } to each amino acid i The probability of a trace U = {u i } is This full joint probability intractable to compute Approximate using pairwise Markov field
Pairwise Markov-Field Model Joint probabilities defined on a graph as product of vertex and edge potentials GLYLYSLEUSERALA
ACMI’s Backbone Model Observational potentials tie the map to the model LEUSERGLYLYSALA
GLYLYSLEUSERALA ACMI’s Backbone Model Adjacency constraints ensure adjacent amino acids are ~3.8 Å apart and in proper orientation Occupancy constraints ensure nonadjacent amino acids do not occupy same 3D space
Backbone Model Potential
Constraints between adjacent amino acids × =
Backbone Model Potential Constraints between all other amino acid pairs
Backbone Model Potential Observational (“template-matching”) probabilities
Inferring Backbone Locations Want to find backbone layout that maximizes
Inferring Backbone Locations Exact methods are intractable Use belief propagation (Pearl 1988) to approximate marginal distributions Want to find backbone layout that maximizes
Belief Propagation Example LYS 31 LEU 32 m LYS31→LEU32 p LEU32 p LYS31 ˆ ˆ
Belief Propagation Example LYS 31 LEU 32 m LEU32→LYS31 p LEU32 p LYS31 ˆ ˆ
Naïve implementation O(N 2 G 2 ) N = the number of amino acids in the protein G = # of points in discretized density map O(G 2 ) computation for each message passed O(G log G) as Fourier-space multiplication O(N 2 ) messages computed & stored Approx (N-3) occupancy msgs with 1 message O(N) messages using a message accumulator Improved implementation O(NG log G) Scaling BP to Proteins (DiMaio and Shavlik, ICDM 2006)
Naïve implementation O(N 2 G 2 ) N = the number of amino acids in the protein G = # of points in discretized density map O(G 2 ) computation for each message passed O(G log G) as Fourier-space multiplication O(N 2 ) messages computed & stored Approx (N-3) occupancy msgs with 1 message O(N) messages using a message accumulator Improved implementation O(NG log G) Scaling BP to Proteins (DiMaio and Shavlik, ICDM 2006)
To pass a message Occupancy Message Approximation occupancy edge potential product of incoming msgs to i except from j
To pass a message Occupancy Message Approximation occupancy edge potential product of all incoming msgs to i “Weak” potentials between nonadjacent amino acids lets us approximate
Occupancy Message Approximation
Send outgoing occupancy message product to a central accumulator ACC
Occupancy Message Approximation ACC Then, each node’s incoming message product is computed in constant time
BP Output After some number of iterations, BP gives probability distributions over Cα locations ALA LEU PRO VAL ARG …… ………
A CMI ’s Backbone Trace Independently choose Cα locations that maximize approximate marginal distribution … …
Example: 1XRI HIGH LOW Å RMSd 93% complete prob(AA at location) 3.3Å resolution density map 39° mean phase error
Testset Density Maps (raw data) Density-map resolution (Å) Density-map mean phase error (deg.)
Experimental Accuracy % Cα’s located within 2Å of some Cα / correct Cα ACMIARP/ wARP TextalResolve % backbone correctly placed % amino acids correctly identified
Experimental Accuracy on a Per-Protein Basis ACMI % Cα’s located ARP/wARP % Cα’s located Resolve % Cα’s located Textal % Cα’s located
A CMI Overview Phase 1: Local pentapeptide search (ISMB 2006, BIBM 2007) Independent amino-acid search Templates model 5-mer conformational space Phase 2: Coarse backbone model (ISMB 2006, ICDM 2006) Protein structural constraints refine local search Markov field (MRF) models pairwise constraints Phase 3: Sample all-atom models Particle filtering samples high-prob. structures Probs. from MRF guide particle trajectories
Problems with A CMI Biologists want location of all atoms All C α ’s lie on a discrete grid Maximum-marginal backbone model may be physically unrealistic Ignoring a lot of information Multiple models may better represent conformational variation within crystal Probability=0.4Probability=0.35Probability=0.25 Maximum- marginal structure
A CMI with Particle Filtering (A CMI -PF) Idea: Represent protein using a set of static 3D all-atom protein models
Particle Filtering Overview (Doucet et al. 2000) Given some Markov process x 1:K X with observations y 1:K Y Particle Filtering approximates some posterior probability distribution over X using a set of N weighted point estimates
Particle Filtering Overview Markov process gives recursive formulation Use importance fn. q(x k |x 0:k-1,y k ) to grow particles Recursive weight update,
Particle Filtering for Protein Structures Particle refers to one specific 3D layout of some subsequence of the protein At each iteration advance particle’s trajectory by placing an additional amino-acid’s atoms
Particle Filtering for Protein Structures Alternate extending chain left and right
Particle Filtering for Protein Structures Alternate extending chain left and right An iteration alternately places C α position b k+1 given b k All sidechain atoms s k given b k-1:k+1 bkbk b k+1 sksk b k-1
Particle Filtering for Protein Structures Key idea: Use the conditional distribution p(b k |b i k-1,Map) to advance particle trajectories Construct this conditional distribution from BP’s marginal distributions bkbk b k+1 sksk b k-1
Algorithm place “seeds” b k i for each particle i=1…N while amino-acids remain place b k i +1 / b j i -1 given b j:k i for each i=1…N place s k i given b k i -1:k+1 for each i=1…N optionally resample N particles end while Particle Filtering for Protein Structures bkbk b k-1 b k+1 sksk … …
Backbone Step (for particle i ) (1) Sample L b k+1 ’s from b k-1 –b k –b k+1 pseudoangle distribution b k b k+1 1…L b k-1 place b k i +1 given b k i for each i=1…N
Backbone Step (for particle i ) p k+1 (b ) k+1k+1 1 k+1k+1 2 k+1k+1 L … b k b k-1 (2) Weight each sample by its ACMI-computed approximate marginal place b k i +1 given b k i for each i=1…N b k+1 1…L
Backbone Step (for particle i ) p k+1 (b ) k+1 1 p k+1 (b ) k+1 2 p k+1 (b ) k+1 L … b k b k-1 (3) Select b k+1 with probability proportional to sample weight place b k i +1 given b k i for each i=1…N b k+1 1…L
Backbone Step (for particle i ) b k-1 b k b k+1 (4) Update particle weight as sum of sample weights place b k i +1 given b k i for each i=1…N
Sidechain Step (for particle i ) place s k i given b k i -1:k+1 for each i=1…N (1) Sample s k from a database of sidechain conformations Protein Data Bank
Sidechain Step (for particle i ) p k (EDM | s ) k 1 k 2 k 3 (2) For each sidechain conformation, compute probability of density map given the sidechain place s k i given b k i -1:k+1 for each i=1…N
Sidechain Step (for particle i ) p k (EDM | s ) k 1 k 3 k 2 (3) Select sidechain conformation from this weighted distribution place s k i given b k i -1:k+1 for each i=1…N
Sidechain Step (for particle i ) (4) Update particle weight as sum of sample weights place s k i given b k i -1:k+1 for each i=1…N
Particle Resampling wt = 0.1 wt = 0.4 wt = 0.3 wt = 0.1 wt = 0.2 wt = 0.1 wt = 0.4 wt = 0.3 wt = 0.1
Amino-Acid Sampling Order Begin at some amino acid k with probability At each step, move left to right with probability j k
Experimental Methodology Run ACMI-PF 10 times with 100 particles each Return highest-weight particle from each run Each run samples amino-acids in a different order Refine each structure for 10 iterations in Refmac5 Compare 10-structure model to others using R free
A CMI -PF Versus A CMI -Naïve Refined R free Number of ACMI-PF runs Additionally, ACMI-PF’s models have … Fewer gaps (10 vs. 28) Lower sidechain RMS error (2.1Å vs. 2.3Å)
A CMI -PF Versus Others ACMI-PF R free ARP/wARP R free Resolve R free Textal R free
A CMI -PF Example: 2A3Q 1.79Å RMSd 92% complete 2.3Å resolution 66° phase err.
A CMI Overview Phase 1: Local pentapeptide search (ISMB 2006, BIBM 2007) Independent amino-acid search Templates model 5-mer conformational space Phase 2: Coarse backbone model (ISMB 2006, ICDM 2006) Protein structural constraints refine local search Markov field (MRF) models pairwise constraints Phase 3: Sample all-atom models Particle filtering samples high-prob. structures Probs. from MRF guide particle trajectories Phase 4: Iterative phase improvement Use particle-filtering models to improve density-map quality Rerun entire pipeline on improved density map Repeat until convergence
Phase Problem Intensities Phases Measured by X-ray crystallography Experimentally estimated (e.g. MAD, MIR)
Density-Map Phasing 30°60°75°0° mean phase error
Iterative Phase Improvement Predicted 3D model Initial density map Revised density map
A CMI -PF’s Phase Improvement Error in initial phases (deg. mean phase error) Error in ACMI-PF’s phases (deg. mean phase error)
Two-Iteration A CMI % backbone located Iteration 1 % backbone located Iteration
Future Work: Many-iteration A CMI Number of ACMI iterations Average % uninterpreted AAs Average mean phase error
Conclusions ACMI’s three steps construct a set of all-atom protein models from a density map Novel message approximation allows inference on large, highly-connected models Resulting protein models are more accurate than other methods
Ongoing and Future Work Incorporate additional structural biology background knowledge Incorporate more complex potential functions Further work on iterative phase improvement Generalize my algorithms to other 3D image data
Acknowledgements Advisor Jude Shavlik Committee George Phillips Charles Dyer David Page Mark Craven Collaborators Ameet Soni Dmitry Kondrashov Eduard Bitto Craig Bingman 6th floor MSCers Center for Eukaryotic Structural Genomics Funding UW-Madison Graduate School NLM 1T15 LM NLM 1R01 LM008796