Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.

Similar presentations


Presentation on theme: "Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan."— Presentation transcript:

1 Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan

2 Motivation SRS 1.Sample conformations 2.Create edges between “neighboring” conformations Ab-initio structure prediction 1.Generate a large decoy set 2.Cluster based on similarity When the number of conformations is large, finding neighboring (similar) conformations is costly

3 Similarity Measures Given the backbone Cα atom positions of two conformations – how similar are they? –Hard to define when comparing two different proteins –Straightforward when comparing two conformations of the same protein.

4 Similarity Measures We are interested in comparing conformations of the same protein. Hence - trivial correspondence between the two point sets. The two most common measures are: –cRMS deviation –dRMS deviation

5 cRMS T is the rigid body transform that optimally aligns P and Q cRMS is a metric, but the space is not Euclidean There is a closed form solution for T Complexity is linear in the number of points (plus a 4x4 eigenvectors computation)

6 dRMS A metric over a Euclidean space. Complexity is quadratic in the number of points (size of protein) D is the internal distances matrix:

7 k Nearest Neighbors Find the k nearest neighbors of every conformation in the set Currently the fastest algorithm in practice for high dimensionality is brute force: For each conformation q in set Compute distance to all other conformations Find the k nearest conformations Complexity is O(n 2 log k)

8 The literature has a number of efficient nearest neighbor algorithms: –kd-trees is the most prevalent We cannot use these algorithms: –Require a Euclidean space – cRMS –Not efficient with high dimensionality - dRMS k Nearest Neighbors We reduce the dimensionality of dRMS to make kd-trees applicable.

9 Uniform Simplification Cut sequence into m equal subsequences Average the coordinates of the Cα atoms in each subsequence Use averaged coordinates a i when computing cRMS and dRMS a0a0 a1a1 amam a6a6 a5a5 a4a4 a3a3 a2a2

10 Uniform Simplification - Results There is a high correlation between the full and the averaged representation when using cRMS and dRMS: –Proteins with 60 – 75AA: r > 0.95 for m > 12 –Protein with 374 AA: r > 0.95 for m > 16 Even with m = 12, the dimensionality of the internal distances matrix used by dRMS is too high (66) for a kd-tree to be used. Further reduction is needed.

11 Proteins 1HTB (374) 4PTI (58) 1R69 (63) 1CTF (68)

12 Further Reduction using SVD We Apply SVD to the reduced distance matrices (stacked as vectors) We project the reduced matrices onto the important singular vectors to further reduce the size.

13 Further Reduction – Results. Averaging before creating internal distances vector makes SVD feasible For proteins with 60-75 AA, dRMS using only 20 parameters was highly correlated (r > 0.90) with dRMS using full representation. 20 Dimensions is not too much for kd- trees.

14 Finding k Nearest Neighbors We tested the actual ability of the reduced representation to find NNs 80 of the 100 true NNs (using dRMS) where found using the reduced rep. of decoy sets Results are better (90) when the data set contains uniformly sampled conformations The maximal relative error was 10% - 20% (0.5Å – 1.5Å) The average relative error was < 5%

15 Using kd-trees We used the ANN implementation (UMD kd-tree software). The data set contained 100,000 conformations. We want to find 100 NN for each conformation. Full rep., cRMS (brute force) :~52h Ave. rep., cRMS (brute force) :~35h Full rep., dRMS (brute force) :~84h Ave. rep., dRMS (brute force) :~4.8h SVD red. rep., dRMS (brute force) :41min SVD red. rep., dRMS (brute force) :19min

16 Why Does Averaging Work? The mean distance of the i’th point from the origin is O(N 0.5 ) and its stdev is also O(N 0.5 ). There is very high corr. between dRMS using the full distances vector and using only distances between “highly” separated points The amount of distortion added by averaging has a mean of 0 and stdev of O(n 0.5 )

17 Conjecture: The important differences between two conformations are found in the distances between “highly” separated points. These distances are large and therefore only distorted a little by averaging


Download ppt "Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan."

Similar presentations


Ads by Google