Conformational Space.  Conformation of a molecule: specification of the relative positions of all atoms in 3D-space,  Typical parameterizations:  List.

Slides:



Advertisements
Similar presentations
3D Geometry for Computer Graphics
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.
4/15/2017 Using Gaussian Process Regression for Efficient Motion Planning in Environments with Deformable Objects Barbara Frank, Cyrill Stachniss, Nichola.
Non-linear Dimensionality Reduction CMPUT 466/551 Nilanjan Ray Prepared on materials from the book Non-linear dimensionality reduction By Lee and Verleysen,
Introduction to Bioinformatics
Lecture 19 Singular Value Decomposition
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Lecture outline Dimensionality reduction
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
3D Geometry for Computer Graphics
Dimensionality Reduction
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
CSci 6971: Image Registration Lecture 2: Vectors and Matrices January 16, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI.
Previously Two view geometry: epipolar geometry Stereo vision: 3D reconstruction epipolar lines Baseline O O’ epipolar plane.
3D Geometry for Computer Graphics
Data Clustering (a very short introduction) Intuition: grouping of data into clusters so that elements from the same cluster are more similar to each other.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Dimensionality Reduction
Proximity matrices and scaling Purpose of scaling Classical Euclidean scaling Non-Euclidean scaling Non-Metric Scaling Example.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
SVD(Singular Value Decomposition) and Its Applications
Summarized by Soo-Jin Kim
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Chapter 2 Dimensionality Reduction. Linear Methods
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
CSE554Laplacian DeformationSlide 1 CSE 554 Lecture 8: Laplacian Deformation Fall 2012.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Basic Computations with 3D Structures
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Approximation of Protein Structure for Fast Similarity Measures Fabian Schwarzer Itay Lotan Stanford University.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
§ Linear Operators Christopher Crawford PHY
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Partial Shape Matching. Outline: Motivation Sum of Squared Distances.
Topics in bioinformatics CS697 Spring 2011 Class 12 – Mar Molecular distance measurements Molecular transformations.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Camera Calibration Course web page: vision.cis.udel.edu/cv March 24, 2003  Lecture 17.
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
K -Nearest-Neighbors Problem. cRMSD  cRMSD(c,c ’ ) is the minimized RMSD between the two sets of atom centers: min T [(1/n)  i=1, …,n ||a i (c) – T(a.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CSE 554 Lecture 8: Alignment
Lecture: Face Recognition and Feature Reduction
Singular Value Decomposition
Lecture 13: Singular Value Decomposition (SVD)
Lecture 15: Least Square Regression Metric Embeddings
Presentation transcript:

Conformational Space

 Conformation of a molecule: specification of the relative positions of all atoms in 3D-space,  Typical parameterizations:  List of coordinates of atom centers  List of torsional angles (e.g., the  -  -  for a protein)  Conformational space: Space of all conformations

Conformational Space q1q1 qiqi q2q2 qjqj q N-1 qNqN

Conformational Space q 1 q 3 q 0 q n q 4

Relation to Robotics/Graphics q 1 q 3 q 0 q n q 4 q 2  (t) Configuration space

Need for a Metric  Simulation and sampling techniques can produce millions of conformations  Which conformations are similar?  Which ones are close to the folded one?  Do some conformations form small clusters (e.g. key intermediates while folding)?

Metric in Conformational Space  A metric over conformational space C is a function: d: c,c’  C  d(c,c’)   +  {0} such that:  d(c,c’) = 0  c = c’ (non-degeneracy)  d(c,c’) = d(c’,c) (symmetry)  d(c,c’) + d(c’,c”)  d(c,c”)(triangle inequality)

But not all metrics are “good”  Euclidean metric: d(c,c’) =  i=1,...,n (|  i -  i ’| 2 + |  i -  i ’| 2 )

Metric in Conformational Space  A “good” metric should measure how well the atoms in two conformations can be aligned  Usual metrics: cRMSD, dRMSD

RMSD  Given two sets of n points in  3 A = {a 1,…,a n } and B = {b 1,…,b n }  The RMSD between A and B is: RMSD(A,B) = [ (1/n)  i=1,…,n ||a i -b i || 2 ] 1/2 where ||a i -b i || denotes the Euclidean distance between a i and b i in  3  RMSD(A,B) = 0 iff a i = b i for all i

cRMSD  Molecule M with n atoms a 1,…,a n  Two conformations c and c’ of M  a i (c) is position of a i when M is at c  cRMSD(c,c’) is the minimized RMSD between the two sets of atom centers: min T [ (1/n)  i=1,…,n ||a i (c) – T(a i (c’))|| 2 ] 1/2 where the minimization is over all possible rigid-body transform T

cRMSD  cRMSD verifies triangle inequality  cRMSD takes linear time to compute  Often, cRMSD is restricted to a subset of atoms, e.g., the C  atoms on a protein’s backbone

Representation Restricted to C  Atoms Protein 1tph - The positions of AA residue centers (Cα atoms) mainly determine the structure of a protein. - In structural comparison, people usually work only on the backbone of Cα atoms, and neglect the other atoms.

Possible project: Design a method for efficiently finding nearest neighbors in a sampled conformation space of a protein, using the cRMSD metric.

dRMSD  Molecule M with n atoms a 1,…,a n  Two conformations c and c’ of M  {d ij (c)}: n  n symmetrical intra-molecular distance matrix in M at c  dRMD(c, c’) is : [ (1/n(n-1))  i=1,…,n-1  j =i+1,…,n (d ij (c) – d ij (c’)) 2 ] 1/2  {d ij } is usually restricted to a subset of atoms, e.g., the C  atoms on a protein’s backbone

Intra-Molecular Distance Matrix Distances between C  pairs of a protein with 142 residues. Darker squares represent shorter distances.

Intra-Molecular Distance Matrix Distances between C  pairs of a protein with 142 residues. Darker squares represent shorter distances

Intra-Molecular Distance Matrix

dRMSD  Molecule M with n atoms a 1,…,a n  Two conformations c and c’ of M  {d ij (c)}: n  n symmetrical intra-molecular distance matrix in M at c  dRMSD(c, c’) = [ (2/n(n-1))  i=1,…,n-1  j =i+1,…,n (d ij (c) – d ij (c’)) 2 ] 1/2  {d ij } is usually restricted to a subset of atoms, e.g., the C  atoms on a protein’s backbone

dRMSD  Molecule M with n atoms a 1,…,a n  Two conformations c and c’ of M  {d ij (c)}: n  n symmetrical intra-molecular distance matrix in M at c  dRMSD(c, c’) = [ (2/n(n-1))  i=1,…,n-1  j =i+1,…,n (d ij (c) – d ij (c’)) 2 ] 1/2  {d ij } is usually restricted to a subset of atoms, e.g., the C  atoms on a protein’s backbone  Advantage: No aligning transform  Drawback: Takes quadratic time to compute

Is dRMSD a metric?  dRMSD(c, c’) = [ (2/n(n-1))  i=1,…,n-1  j =i+1,…,n (d ij (c) – d ij (c’)) 2 ] 1/2 is a metric in the n(n-1)/2-dimensional space, where a conformation c is represented by {d ij (c)}  But, in this representation, the same point represents both a conformation and its mirror image

k -Nearest-Neighbors Problem Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c (w.r.t. cRMSD, dRMSD, other metric) Can be done in time O(N(log k + L)) where: - N = size of S - L = time to compare two conformations

k -Nearest-Neighbors Problem The total time needed to compute the k nearest neighbors of every conformation in S is O(N 2 (log k + L)) Much too long for large datasets where N ranges from 10,000’s to millions!!! Can be improved by: 1. Reducing L 2. More efficient algorithm (e.g., kd-tree)

kd-Tree In a d-dimensional space, where d>2, range searching for a point takes O(dn 1-1/d )

k -Nearest-Neighbors Problem Idea: simplify protein’s description

cRMSD  O(n) time dRMSD  O(n 2 ) time Assume that each conformation is described by the coordinates of the n C  atoms

This representation is highly redundant  Proximity along the chain entails spatial proximity  Atoms can’t bunch up, hence far away atoms along the chain are on average spatially distant cici cjcj

 m-Averaged Approximation  Cut the backbone into fragments of m C  atoms  Replace each fragment by the centroid of the m C  atoms  Simplified cRMSD and dRMSD 3n coordinates3n/m coordinates

 8 diverse proteins ( residues)  Decoy sets of N =10,000 conformations from the Park-Levitt set [Park et al, 1997] Evaluation: Test Sets [Lotan and Schwarzer, 2003] mcRMSDdRMSD Higher correlation for random sets (  greater savings) Correlation:

Running Times

Further Reduction for dRMSD 1) Stack m-averaged distance matrices as vectors of a matrix A

A r N Vector a i of elements of distance matrix of i th conformation (i = 1 to N)

Further Reduction for dRMSD 1) Stack m-averaged distance matrices as vectors of a matrix A 2) Compute the SVD A = UDV T

A (r x N) r N U (r x r) D (r x r) V T (r x N) = SVD Decomposition Vector a j of elements of distance matrix of j th conformation (j = 1 to N) Orthonormal (rotation) matrix Diagonal matrix

A (r x N) r N U (r x r) V T (r x N) = SVD Decomposition Vector a j of elements of distance matrix of j th conformation (j = 1 to N) Orthonormal (rotation) matrix Diagonal matrix s 1 s 2 s r 0 0 s 1  s 2 ...  s r  0 (singular values)

A (r x N) r N U (r x r) D (r x r) V T (r x N) = SVD Decomposition Vector a j of elements of distance matrix of j th conformation (j = 1 to N) Orthonormal (rotation) matrix Diagonal matrix Matrix with orthonormal rows vjTvkTvjTvkT v i and v j are orthogonal unit Nx1 vectors

A (r x N) r N U (r x r) D (r x r) V T (r x N) = SVD Decomposition r-dimensional space x y X Y Representation of A in space (X,Y) does not depend on the coordinate system!

v1Tv1T v2Tv2T A (r x N) r N U (r x r) D (r x r) V T (r x N) = SVD Decomposition s 1 s 2 s 3 s r ||s 1 v 1 ||  ||s 2 v 2 ||...

v1Tv1T v2Tv2T A (r x N) r N U (r x r) D (r x r) V T (r x N) = SVD Decomposition s 1 s 2 s 3 s r vpTvpT p principal components

A (r x N) r N U (r x r) D (r x r) V T (r x N) = SVD Decomposition s 1 s 2 s p v1Tv1T v2Tv2T vpTvpT p principal components 0

Further Reduction for dRMSD 1) Stack m-averaged distance matrices as vectors of a matrix A 2) Compute the SVD A = UDV T 3) Project onto p principal components

Correlation between dRMSD and is reduced to summing up 12 to 20 terms (instead of ~ 80 to 200, since the proteins have 54 to 76 amino acids)

Complexity of SVD  SVD of rxN matrix, where N > r, takes O(r 2 N) time  Here r ~ (n/m) 2  So, time complexity is O(n 4 N)  Would be too costly without m-averaging

Evaluation for 1CTF Decoy Sets [Lotan and Schwarzer, 2003]  N = 100,000, k = 100, 4-averaging, 16 PCs  70% correct, with furthest NN off by 20%  Brute-force: 84 h  Brute-force + m-averaging: 4.8 h  Brute-force + m-averaging + PC: 41 min  kD-tree + m-averaging + PC: 19 min  Speedup greater than x200  6 k approximate NNs contain all true k NNs  Use m-averaging and PC reduction as fast filters