Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,

Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan, Antony Williams, Oliver Kohlbacher, Philipp Thiel, Colin Batchelor ACS, 248th National Meeting San Francisco, CA August 14 th 2014

Chemical space - 10 60

Navigation in chemical space

Clustering

Science dimensions

~30 million chemicals and growing Data sourced from >500 different sources Crowdsourced curation and annotation Ongoing deposition of data from our journals and our collaborators A structure centric hub for web-searching

ChemSpider

Properties

Classification

ChemSpider Data Slices

Tagging in ChemSpider

RSC Archive – since 1841

DERA - Digitally Enabling RSC Archive

Twelve broad categories

Largest category is 30 times the size of the smallest

200 subcategories

How does it work? Latent Semantic Analysis to build feature sets for (1) articles (2) categories. Features: words, citations and pairs of words. Domain experts (Journal Development staff) build a category vector. All articles with a cosine similarity greater than an adjustable threshold go into the category.

RSC Data Repository

Structures similarity Molecule Similarity Similarity ? Suitable in silico representation: 2D binary fingerprints Suitable in silico representation: 2D binary fingerprints 01010110 Y: 01101101 X: 25 01234567

Structures similarity Molecule Similarity 26 Important fingerprint properties: 1.Length:length of the binary vector 2.Density:fraction of 1-bits Various fingerprint types exist –Different atom typing and generation procedure –Different properties (length, density,...) Alternative representation: Feature list –Store only index numbers of vector positions –Memory-efficient storage 0101011001010110 Length 0100010000000010 Sparse fingerprint (sFP) 1101011001110111 Dense fingerprint (dFP) 01010110 1,3,5,6

Structures similarity 27 2. Jaccard P., Bulletin del la Société Vaudoise des Sciences Naturelles (1901), 37, 547-579 3. Tanimoto T.T., IBM Internal Report (1957) Molecules as binary vectors Various chemoinformatics dis-/similiarity measures: –Euclidean distance –Cosine similarity (inner product) Most frequently used: Tanimoto Coefficient 2,3 –Corresponds to Jaccard index –Metric –[0.0, 1.0] (dissimilar  similar) Molecule Similarity

Full Similarity Matrix Clustering 28 Results: Clustering the Available Chemspace ZINC all purchasable set: ~ 17x10 6 compounds (sFP) Tanimoto cutoff analysis: 0.76 Opteron, 64 threads, 100 GB main memory Total run-time:64 hours CCs decomposition:12 hours Total run-time:64 hours CCs decomposition:12 hours

Federated linked system

Thank you Email: tkachenkov@rsc.orgtkachenkov@rsc.org Slides: http://www.slideshare.net/valerytkachenko16

Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,

Similar presentations

Presentation on theme: "Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,

Similar presentations

Presentation on theme: "Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,"— Presentation transcript:

Similar presentations

About project

Feedback