Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,

Similar presentations


Presentation on theme: "Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,"— Presentation transcript:

1 Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14 th 2014

2 Chemical space - 10 60

3 Navigation in chemical space

4 Clustering

5 Science dimensions

6 ~30 million chemicals and growing Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching

7 ChemSpider

8 Properties

9 Classification

10 ChemSpider Data Slices

11 Tagging in ChemSpider

12 RSC Archive – since 1841

13 DERA - Digitally Enabling RSC Archive

14 Twelve broad categories

15 Largest category is 30 times the size of the smallest

16 200 subcategories

17 How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.

18 RSC Data Repository

19

20

21

22

23

24

25 Structures similarity Molecule Similarity Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 01010110 Y: 01101101 X: 25 01234567

26 Structures similarity Molecule Similarity 26 Important fingerprint properties: 1.Length:length of the binary vector 2.Density:fraction of 1-bits Various fingerprint types exist –Different atom typing and generation procedure –Different properties (length, density,...) Alternative representation: Feature list –Store only index numbers of vector positions –Memory-efficient storage 0101011001010110 Length 0100010000000010 Sparse fingerprint (sFP) 1101011001110111 Dense fingerprint (dFP) 01010110 1,3,5,6

27 Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) Molecules as binary vectors Various chemoinformatics dis-/similiarity measures: –Euclidean distance –Cosine similarity (inner product) Most frequently used: Tanimoto Coefficient 2,3 –Corresponds to Jaccard index –Metric –[0.0, 1.0] (dissimilar  similar) Molecule Similarity

28 Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace ZINC all purchasable set: ~ 17x10 6 compounds (sFP) Tanimoto cutoff analysis: 0.76 Opteron, 64 threads, 100 GB main memory Total run-time:64 hours CCs decomposition:12 hours Total run-time:64 hours CCs decomposition:12 hours

29 Federated linked system

30 Thank you Email: tkachenkov@rsc.orgtkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16


Download ppt "Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,"

Similar presentations


Ads by Google