Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Function Analysis using Computational Mutagenesis Iosif Vaisman Laboratory for Structural Bioinformatics proteins.gmu.edu CASB workshop, 9/23/10.

Similar presentations


Presentation on theme: "Protein Function Analysis using Computational Mutagenesis Iosif Vaisman Laboratory for Structural Bioinformatics proteins.gmu.edu CASB workshop, 9/23/10."— Presentation transcript:

1 Protein Function Analysis using Computational Mutagenesis Iosif Vaisman Laboratory for Structural Bioinformatics proteins.gmu.edu CASB workshop, 9/23/10 Department of Bioinformatics and Computational Biology

2 Dealunay simplices classification

3 Protein representation (Crambin)

4 Voronoi Tessellation Delaunay simplex is defined by points, whose Voronoi polyhedra have common vertex Delaunay simplex is always a triangle in a 2D space and a tetrahedron in a 3D space Delaunay Tessellation Neighbor identification in proteins: Voronoi/Delaunay Tessellation in 2D

5 Voronoi Tessellation Delaunay Tessellation Neighbor identification in proteins: Voronoi/Delaunay Tessellation in 2D 6 7 6

6 Delaunay tessellation of Crambin

7 Delaunay Tessellation of Protein Structure D3 A22 S64 L6 F7 G62 C63 K4 R5 D (Asp) Abstract each amino acid to a point Atomic coordinates – Protein Data Bank (PDB) C α or center of mass Delaunay tessellation: 3D “tiling” of space into non-overlapping, irregular tetrahedral simplices. Each simplex objectively defines a quadruplet of nearest-neighbor amino acids at its vertices.

8 Compositional propensities of Delaunay simplices q ijkl  log f ijkl p C  4! i n  (t i !) f - observed quadruplet frequency, p ijkl = Ca i a j a k a l, a - residue frequency i l k j AAAA: C = 4! / 4! = 1 AAAV: C = 4! / (3! x 1!) = 4 AAVV: C = 4! / (2! x 2!) = 6 AAVR: C = 4! / (2! x 1! x 1!) = 12 AVRS: C = 4! / (1! x 1! x 1! x 1!) ) = 24

9 Counting Quadruplets assuming order independence among residues comprising Delaunay simplices, the maximum number of all possible combinations of quadruplets forming such simplices is

10 Log-likelihood of amino acid quadruplets with different compositions

11 Log-likelihood of amino acid quadruplets

12

13 Computational Mutagenesis Methodology Observations: Relatively few mutant and wt structures of same protein have been solved Tessellations of mutant and wt protein structures are very similar or identical Approach: Obtain topological score (TS mut ) and 3D-1D potential profile vector (Q mut ) for any mutant protein by using the wt structure tessellation as a template Simply change the residue label at a given point(s) and re-compute A22 S64 L6 F7 G62 C63 K4 D3 I5 s(I,D,A,L) s(I,D,K,S) s(I,S,C,G) s(I,G,F,L) A22 S64 L6 F7 G62 C63 K4 D3 R5 s(R,D,A,L) s(R,D,K,S) s(R,S,C,G) s(R,G,F,L) (R5  I5) Mutation (TS wt, Q wt )(TS mut, Q mut )

14 Computational Mutagenesis Methodology Scalar “Residual Score” of a mutant: (mutant – wt) topological score difference = TS mut – TS wt (empirical measure of relative structural change due to mutation) Vector “Residual Profile” of a mutant: R = Q mut – Q wt = (mutant – wt) 3D-1D potential profile difference (environmental perturbation score at every position in structure) Denote R = EC i = q i,mut – q i,wt = relative Environmental Change at position i Geometric property: If mutant is due to a single substitution at position j, then EC j ≡ mutant residual score (“epicenter” of impact) The only other nonzero EC components correspond to neighboring positions that participate in simplices with j

15 Approach 1: Protein Topological Score (TS) Obtained by summing the log-likelihood scores of all simplicial quadruplets defined by the protein tessellation Global measure of protein sequence-structure compatibility Total (empirical or statistical) potential of the protein TS = ∑ î s(î), sum taken over all simplex quadruplets î in the entire tessellation. A22 S64 L6 F7 G62 C63 K4 D3 R5 s(R,D,A,L) s(R,D,K,S) s(R,S,C,G) s(R,G,F,L) Close-up view of only the four simplices that use R at position 5 as a vertex (hypothetical)

16 Approach 2: Residue Environment Scores For each amino acid position, locally sum the log-likelihood scores s(i,j,k,l) of only simplex quadruplets that include it as a vertex A22 S64 L6 F7 G62 C63 K4 D3 R5 s(R,D,A,L) s(R,D,K,S) s(R,S,C,G) s(R,G,F,L) Example: q 5 = q(R5) = ∑ (i,j,k,l) s(i,j,k,l), sum over all simplex quadruplets (i,j,k,l) that include amino acid R5 The scores of all amino acid positions in the protein structure form a 3D-1D Potential Profile vector Q = (N = length of primary sequence in solved structure)

17 Reversibility Analysis S1,E2 Calculated Mutant S2,E1 Calculated ‘reference’ S1,E1 ‘reference’ PDB S2,E2 Mutant PDB Forward Mutation Reverse Mutation

18 Reversibility of mutations (T4 lysozyme ) Protein Mutation Score change 1l63 T26E l E26T l63 A82S l S82A l63 V87M cu3 M87V l63 A93C l C93A l63 T152S goj S152T 1.12

19 Reversibility Analysis

20 Functional Effects of Amino Acid Substitutions Change in protein stability: Effect on melting temperature: ΔTm = Tm (mutant) – Tm (wt) Effect on thermal denaturation: ΔΔG = ΔG (mutant) – ΔG (wt) Effect on denaturant denaturation: ΔΔG H 2 O = ΔG H 2 O (mutant) – ΔG H 2 O (wt) Change in protein activity: Mutant enzymatic activity relative to wt Mutant strength of DNA binding relative to wt Disease potential of human coding nsSNPs Neutral polymorphism or disease-associated mutation? For protein targets of inhibitor drugs: Continued susceptibility or (degree of ) resistance that patients with the mutant protein have to the inhibitor Inhibitor binding energy to mutant target relative to wt

21 Examples ofExperimental Mutagenesis Data

22 Example: HIV-1 Protease (PR)

23 HIV-1 PR Dataset Example: Residual Profiles of 536 Experimental Mutants ……

24 Experimental Mutants: Residual Scores Elucidate the Structure-Function Relationship 536 HIV-1 protease mutants4041 lac repressor mutants 630 hIL-3 mutants371 gene V protein mutants

25 Universal Model Approach: 8635 Experimental Mutants from 7 Proteins

26 Universal Model Approach: 980 Experimental Mutants from 20 Proteins

27 Structure-Function Correlation Based on Residual Scores: nsSNPs 1790 nsSNPs corresponding to single amino acid substitutions in several hundred proteins with tessellatable structures Function: 1332 nsSNPs associated with disease; 458 neutral Data obtained from Swiss-Prot and HPI

28 Structure-Function Correlation Based on Residual Scores: Drug Susceptibility

29 Algorithm Performance: 2015 T4 Lysozyme Mutants

30 Learning Curves for HIV-1 protease and T4 lysozyme mutants

31 Experimental data (not part of training set) obtained from ProTherm database Result: predictions match experiments for 30/35 (~86%) of the mutants Real-World Application: T4 Lysozyme Predictions

32 T4 Lysozyme Mutational Array Training set mutants (n = 2015) ActiveInactive Predicted test set mutants (n = 1101) ActiveInactive

33 GVP Mutational Array

34 Support Vector Regression Capriotti et al. SVM regression (for comparison): r = 0.71, Standard Error = 1.3 kcal/mol, y = x –

35 Conclusions Computational mutagenesis derived from a four-body, knowledge-based statistical potential uniquely characterizes each protein mutant using both sequential and structural features Attributes correlate well with mutant function - valuable for developing accurate machine learning based predictive models

36 Acknowledgements Structural Bioinformatics Laboratory (GMU): Tariq Alsheddi (structure alignment) David Bostick (topological similarity) Andrew Carr(functional sites, visualization) Sunita Kumari(structural genomics) Yong Luo(evolutionary structure analysis) Majid Masso (mutagenesis, HIV-1 protease, LAC repressor, T4 lysozyme, SNP) Ewy Mathe (mutagenesis, p53) Olivia Peters(protein-protein interfaces) Vadim Ravich(HIV RT mutagenesis) Greg Reck(hydration potentials, amyloids) Todd Taylor(statistical potentials, secondary structure, topology, protein stability) Bill Zhang(mutagenesis, BRCA1) Collaborators: John Grefenstette (GMU) Curt Jamison (GMU) Dmitri Klimov (GMU) Dan Carr (GMU) Estela Blaisten (GMU) Vladimir Karginov (IB) Unpublished data: Clyde Hutchison (UNC) Ron Swanstrom (UNC) Funding: NSF NIH-Innovative Biologics GMU-INOVA Research Fund

37

38

39

40 Evaluating Algorithm Performance Overall goal: Develop model with known examples to accurately predict class (or value) of instances that have not yet been assayed experimentally (potentially great savings of time and money) Ideal situation: split large original dataset into 3 subsets oTraining set (learn model) oValidation set (optimize model by tweaking model parameters) oTest set (evaluate model on new data not used to develop model) oErrors measured at each step (resubstitution, validation, generalization) Approaches: Tenfold cross-validation (10-fold CV); leave-one-out CV (i.e., jackknife or N-fold CV, N = dataset size); % split (e.g., use only 2/3 for training, 1/3 held out for testing)

41 Evaluating Algorithm Performance 10-fold CV oRandomly split the dataset instances into 10 equally-sized subsets oHold-out subset 1; combine subsets 2-10 into one training set for learning a model; use trained model to predict classes of instances in subset 1 oRepeat previous step 9 more times (e.g., hold-out subset 2, combine subsets 1 and 3-10 together to train a model, use model to predict subset 2, etc) oWe end up with 10 models, each trained using 90% of the original dataset, and each used to predict the held-out 10% subset. oIn the end, each instance has one class prediction – compare to actual class LOOCV (leave-one-out CV, jackknife, or N-fold CV) oSimilar to above, but each subset contains only 1 instance oDeterministic – no randomness to which instances are grouped as subsets oOverall prediction accuracy provides rough idea of how a model trained with the full dataset will perform % split (self-explanatory)

42 Evaluating Algorithm Performance Assume instances belong to two generic classes (Pos/Neg) Results of comparing predictions with actual classes based on the approaches described (10-fold CV, LOOCV, % split) can be summarized in a confusion matrix: Classification performance measures: accuracy = (TP+TN) / (TP+FP+TN+FN); sensitivity = TP / (TP+FN); specificity = TN / (TN+FP); precision = TP / (TP+FP); BER = 0.5 × [FP / (FP+TN) + FN / (FN+TP)]; MCC = (TP×TN – FP×FN) /  (TP+FN)(TP+FP)(TN+FN)(TN+FP); AUC = area under ROC curve (plot of sensitivity vs. 1 – specificity) For regression models: correlation coefficient, standard error TPFN FPTN Predicted as Pos Neg Pos Neg Actual class

43 ROC Curve Plot of true positive rate (sensitivity) versus false positive rate (1 – specificity) in the unit square AUC = probability that classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one AUC ~ 0.5 (ROC close to diagonal line joining points (0,0) and (1,1)) suggests no signal in dataset and that trained model is not likely to perform any better than random guessing AUC = 1 (piecewise linear ROC joining (0,0) to (0,1) and (0,1) to (1,1)) indicates a perfect classifier


Download ppt "Protein Function Analysis using Computational Mutagenesis Iosif Vaisman Laboratory for Structural Bioinformatics proteins.gmu.edu CASB workshop, 9/23/10."

Similar presentations


Ads by Google