Download presentation
Presentation is loading. Please wait.
Published byPoppy Blair Modified over 8 years ago
1
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14 th 2014
2
Chemical space - 10 60
3
Navigation in chemical space
4
Clustering
5
Science dimensions
6
~30 million chemicals and growing Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching
7
ChemSpider
8
Properties
9
Classification
10
ChemSpider Data Slices
11
Tagging in ChemSpider
12
RSC Archive – since 1841
13
DERA - Digitally Enabling RSC Archive
14
Twelve broad categories
15
Largest category is 30 times the size of the smallest
16
200 subcategories
17
How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.
18
RSC Data Repository
25
Structures similarity Molecule Similarity Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 01010110 Y: 01101101 X: 25 01234567
26
Structures similarity Molecule Similarity 26 Important fingerprint properties: 1.Length:length of the binary vector 2.Density:fraction of 1-bits Various fingerprint types exist –Different atom typing and generation procedure –Different properties (length, density,...) Alternative representation: Feature list –Store only index numbers of vector positions –Memory-efficient storage 0101011001010110 Length 0100010000000010 Sparse fingerprint (sFP) 1101011001110111 Dense fingerprint (dFP) 01010110 1,3,5,6
27
Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) Molecules as binary vectors Various chemoinformatics dis-/similiarity measures: –Euclidean distance –Cosine similarity (inner product) Most frequently used: Tanimoto Coefficient 2,3 –Corresponds to Jaccard index –Metric –[0.0, 1.0] (dissimilar similar) Molecule Similarity
28
Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace ZINC all purchasable set: ~ 17x10 6 compounds (sFP) Tanimoto cutoff analysis: 0.76 Opteron, 64 threads, 100 GB main memory Total run-time:64 hours CCs decomposition:12 hours Total run-time:64 hours CCs decomposition:12 hours
29
Federated linked system
30
Thank you Email: tkachenkov@rsc.orgtkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.