Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring Scholarly Impact and Beyond Ying Ding Indiana University 1.

Similar presentations

Presentation on theme: "Measuring Scholarly Impact and Beyond Ying Ding Indiana University 1."— Presentation transcript:

1 Measuring Scholarly Impact and Beyond Ying Ding Indiana University 1

2 Outline New Research Analytics Tool: Data2Knowledge Platform Future Directions 2

3 Next Generation of Bibliometrics Newly developed methods allow in-depth analysis of scholarly communication – Topic modeling (e.g., Latent Dirichlet Allocation) – Information Extraction (e.g., Entity Extraction) – Social Network Analysis (e.g., Community Detection) Big data demonstrates the power of connected data to enable knowledge discovery – Structured data – Unstructured data – Social media data 3 Ding, Y., Rousseau, R., & Wolfram, D. (Eds.) (2014). Measuring scholarly impact: Methods and practice. Springer.

4 Content-based Impact Analysis There are two levels: – Syntactic level (position) Papers cited in different sections of the articles How many times papers are mentioned in one article – Semantic level (semantics) Citation Sentiment Analysis: sentence level, window-size Concept-based: knowledge concept level (e.g., topic, knowledge unit/entity, or bio-entities) Ding, Y., Song, M., Wang, X., Zhang, G., Zhai, C., & Chambers, T. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the American Society for Information Science & Technology, 65(9), 1820-1833. 4

5 Semantic Level Concepts – Topics – Major entities in research Bio entities Knowledge unit (e.g., domain theories, well-established algorithms) Why not keyword – Ambiguous literal words – Not normalized – But can be a starting point to extract concepts. – Concept (keywords, synonyms (students vs. pupils), antonyms (birth vs. death), homonyms (pupil (student) vs. pupil (part of eye), etc.) Jeong, Y., Song, M., & Ding, Y. (2014). Content-based Author co-citation analysis. Journal of Informetrics, 8(1),197-211. 5

6 EntityMetrics Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS One, 8(8): 1-14. Entitymetrics is defined as using entities (i.e., evaluative entities or knowledge entities) in the measurement of impact, knowledge usage, and knowledge transfer, to facilitate knowledge discovery. 6

7 EntityMetrics 7

8 PubMed Entities Drug Disease Protein Pathway Cite Gene 8

9 Entity Graph Heterogeneous Entity Graph Drug Disease Protein Pathway Gene Bcl-2 Inhibitorp53 AMPK ci: cite co: co-occur ci/co: DiabetesCancerSTAT3 Metformin P13KBreast Cancer 9

10 Metformin related entity-entity citation network Data: 4,770 articles retrieved from PubMed Central with 134,844 references, and 1,969 bio-entities (i.e., 880 genes, 376 drugs, and 713 diseases) 10

11 Metformin related entity-entity citation network 11

12 Dataset 12

13 Main Path 13

14 Main Path 14

15 Main Path 15

16 Main Path 16

17 Entity Citation Network vs. Entity Co- Occurrence Network Gene Gene Co-Occurrence Network (GG) vs. Gene Cite Gene Network (GCG) – The GCG network shares many genes with the GG network and as a result is a competitive complement to the GG network – Using gene relationships based on citation relation extends the assumption of gene interaction being limited to the same article and opens up a new opportunity to analyze gene interaction from a wider spectrum of datasets. – 1,149 gene pairs from GCG were found in GG. A total of 164 pairs out of 1,149 were not found in GG before 2005, but were found in GCG before 2005. In particular, the PARK2 and PINK1 gene pair ranks fifth by co-occurrence frequency in the GG network, implying the gene pair has highly been studied since 2005 Song, M., Han, N., Kim, Y., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PLoS One, 8(12), e84639 17

18 Big Data in Life Sciences There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on: – 69 million compounds and 449,392 bioassays (PubChem) – 59 million compound bioactivities (PubChem Bioassay) – 4,763 drugs (DrugBank) – 9 million protein sequences (SwissProt) and 58,000 3D structures (PDB) – 14 million human nucleotide sequences (EMBL) – 22 million life sciences publications - 800,000 new each year (PubMed) – Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics …) Even more important are the relationships between these entities. For example a chemical compound can be linked to a gene or a protein target in a multitude of ways: – Biological assay with percent inhibition, IC50, etc – Crystal structure of ligand/protein complex – Co-occurrence in a paper abstract – Computational experiment (docking, predictive model) – Statistical relationship – System association (e.g. involved in same pathways cellular processes) Wild, D. J., Ding, Y., Sheth, A. P., Harland, L., Gifford, E. M., & Lajiness, M. S. (2012). System chemical biology and the Semantic Web: What they mean for the future of drug discovery research. Drug Discovery Today (impact factor=6.422), 17(9-10), 469-474. 18

19 Drug Protein DNA Tissue Patient RNA Cell Pathway Disease TextCSVTableHTMLXML 19

20 Chem2Bio2RDF NCI Human Tumor Cell Lines Data PubChem Compound Database PubChem Bioassay Database PubChem Descriptions of all PubChem bioassays Pub3D: A similarity-searchable database of minimized 3D structures for PubChem compounds Drugbank MRTD: An implementation of the Maximum Recommended Therapeutic Dose set Medline: IDs of papers indexed in Medline, with SMILES of chemical structures ChEMBL chemogenomics database KEGG Ligand pathway database Comparative Toxicogenomics Database PhenoPred Data HuGEpedia: an encyclopedia of human genetic variation in health and disease. 31m chemical structures 59m bioactivity data points 3m/19m publications ~5,000 drugs Chen, B., Dong, X., Jiao, Dazhi, Wang, H., Zhu, Q., Ding, Y. and Wild, D. (2010). Chem2Bio2RDF: A semantic framework for linking and mining chemogenomic and systems chemical biology data. BMC Bioinformatics, 2010, 11, 255. 20

21 uniprot Bio2RDF Others LODD Chem2Bio2RDF RDF Triple store SPARQL ENDPOINTS Dereferenable URI Browsing PlotViz: Visualization Cytoscape Plugin Linked Path Generation and Ranking Third party tools 21

22 22 Chen, B., Ding, Y., & Wild, D. J. (2012). Improving integrative searching of systems chemical biology data using semantic annotation. Journal of Cheminformatics, 4:6 (doi:10.1186/1758-2946-4-6).

23 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 26 23 24 25 S EMANTIC GRAPH MINING : P ATH F INDING A LGORITHM Dijkstra’s algorithm 23

24 Bio-LDA Latent Dirichlet Allocation (LDA) – The core of the group of powerful statistical modeling techniques for automated extraction of latent topics from large document collections Bio-LDA – Extended LDA model with Bio-terms as latent variable – Bio-terms: compound, gene, drug, disease, protein, side effect, pathways Calculate bio-term entropies over topics Use the Kullback-Leibler divergence as the non-symmetric distance measure for two bio- terms over topics 24

25 Example: Topic 10 Apply Bio-LDA on 336,899 PubMed article abstracts in 2009 and extract 50 topics Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J. & Wild, D. (2011): Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLos One 6(3): e17243. doi:10.1371/journal.pone.0017243. 25

26 Thiazolinediones (TZDs) – revolutionary treatment for type II Diabetes Pioglitazone: ???? (does decrease blood sugar levels, was associated with bladder tumors and has been withdrawn in some countries.) Rosiglitazone (Avandia): restricted in 2010 (cardiac disease) Troglitazone (Rezulin): withdrawn in 2000 (liver disease) Rosiglitazone bound into PPAR-γ 26

27 PPARG: TZD target SAA2: Involved in inflammatory response implicated in cardiovascular disease (Current Opinion in Lipidology 15,3,,269- 278 2004) APOE: Apolipoprotein E3 essential for lipoprotein catabolism. Implicated in cardiovascular disease. ADIPOQ: Adiponectin involved in fatty acid metabolism. Implicated in metabolic syndrome, diabetes and cardiovascular disease CYP2C8: Cytochrome P450 present in cardiovascular tissue and involved in metabolism of xenobiotics CDKN2A: Tumor suppression gene SLC29A1: Membrane transporter 27

28 Semantic Prediction Chen, B., Ding, Y., & Wild, D. (2012). Assessing Drug Target Association using Semantic Linked Data. PLoS Computational Biology, 8(7): e1002574. doi:10.1371/journal.pcbi.1002574, 28

29 Example: Troglitazone and PPARG Association score: 2385.9 Association significance: 9.06 x 10 -6 => missing link predicted 29

30 Topology is important for association Cmpd 1Protein 1Cmpd 2 Cmpd 1Protein 1Cmpd 2 hasSubstructure bind 30

31 Protein 2 Cmpd 1 Cmpd 2 Protein 1 Protein 2 Cmpd1 Protein 1 hasGO Protein 2 Cmpd1 Protein 1 bind PPI Cmpd1 Protein 1 Cmpd 2 bindhasSideeffecthasSide ffect Cmpd1 Protein 1 Cmpd 2 bindhasSubstructure Semantics is important for association GO:0000 1 hyperten sion substruct ure1 bind 31

32 SLAP Pipeline Path filtering 32

33 Cross-check with SEA SEA analysis (Nature 462, 175-181, 2009) predicts 184 new compound-target pairs, 30 of which were experimentally tested 23 of these pairs were experimentally validated (<15uM) including 15 aminergic GPCR targets and 8 which crossed major receptor classification boundaries 9 of the aminergic GPCR target pairings were correctly predicted by SLAP (p<0.05) – for the other 6 compounds were not present in our set 1 of the 8 cross-boundary pairs was predicted 33

34 Assessing drug similarity from biological function Took 157 drugs with 10 known therapeutic indications, and created SLAP profiles against 1,683 human targets Pearson correlation between profiles > 0.9 from SLAP was used to create associations between drugs Drugs with the same therapeutic indication unsurprisingly cluster together Some drugs with similar profile have different indications – potential for use in drug repurposing? 34

35 Data2Knowledge platform… Data2Knowledge Mining knowledge from articles: Researcher profiling Expert search Topic analysis Reviewer suggestion AMiner Mining knowledge from patents: Competitor analysis Company search Patent summarization PMiner Mining drug discovery data Predicting targets Repurposing drugs Heterogeneous graph search SLAP Mining more data…... 35

36 Research profiling Integration Interest analysis Topic analysis Course search Expert search Association Disambiguation Suggestion Geo search Collaboration recommendation AMiner Researchers: 31,222,410 Publications: 69,962,333 Conferences/Journals: 330,236 Citations: 133,196,029 Knowledge Concepts: 7,854,301 36

37 Expert Search Basic Info. Citation statistics Research InterestsPublications Social Network 37

38 Expertise Search Finding top experts, top conferences, and highly cited papers for “data mining” 38

39 Geographic Search Finding the most hot regions on “data mining” 39

40 Conference Analysis Which year is the most successful year in the KDD’s history? Who are the highly cited authors? What is author nationality distribution for the highly cited KDD papers in the past years? 40

41 Reviewer Suggestion Interest matching COI avoiding Load balancing Forecast review quality 41

42 Cross-domain Collaboratinon Recommendation What are the cross-domain topics on which you can work in the target domain? Who are the best collaborators on each of these topics? 42

43 Topic Browser 200 topics have been discovered automatically from the academic articles 43

44 Academic Performance Measure Academic StatisticsNew Stars 44

45 Widely used.. The largest publisher: Elsevier Conferences KDD 2010 KDD 2011 KDD 2012 KDD 2013 KDD 2014 WSDM 2011 WSDM 2014 ICDM 2011 ICDM 2012 SocInfo 2011 ICMLA 2011 WAIM 2011 etc. …… 45

46 What is PMiner? Current patent analysis systems focus on search – Google Patent, WikiPatent, FreePatentsOnline PMiner is designed for an in-depth analysis of patent activity at the topic-level – Topic-driven modeling of patents – Heterogeneous network co-ranking – Intelligent competitive analysis – Patent summarization * Patent data: > 3.8M patents > 2.4M inventors > 400K companies > 10M citation relationships * Journal data: > 2k journal papers > 3.7k authors The crawled data is increasing to >300 Gigabytes. J. Tang, B. Wang, Y. Yang, P. Hu, Y. Zhao, X. Yan, B. Gao, M. Huang, P. Xu, W. Li, and A. K. Usadi. PatentMiner: Topic-driven Patent Analysis and Mining. KDD'12. pp. 1366-1374. 46

47 Patent Search Topics of search results Top Inventors Top Patents Top Companies 47

48 Topic-based Analysis for “Microsoft” 48

49 A court decision in 08/2012: Samsung’s Galaxy smart phone infringed upon a series of patents of Apple’s iphone, besides 4 appearance design patents, 3 software patents so-called 381, 915, and 163 are included, respectively cover "bounce back", “pinch-to-zoom”, and “tap-to-zoom”. The above 3 software patents all belong to the following three patent categories: active solid- state devices (touch screen), computer graphics processing (graph scaling), and selective visual display systems (tap to select). 49

50 Demo Y. Yang, J. Tang, J. Keomany, Y. Zhao, Y. Ding, J. Li, and L. Wang. Mining Competitive Relationships by Learning across Heterogeneous Networks. CIKM’12. pp. 1432-1441. 50


52 Semantic Publishing Turn literature knowledge into actionable data to generate more powerful knowledge 52

53 AAAAAAAAAAAAAAAAAAAAAA Gene1 AAAAAAA(XYZ, 2002), AAAAAAA Disease2AAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAA Paper 1, 2, … Paper 1: Gene 1, Disease 2, (XYZ, 2002) Paper 2: Gene 1, Disease 2, (XYZ, 2002) Paper n: Gene 1, Disease 2, (XYZ, 2002) …… Demonstrating strong evidence of the connection/relationship between concept Gene 1 and Disease 2, or concept GENE and DISEASE 53

54 Knowledge Graph 54

55 Ongoing Initiatives Semantic Scholar – Semantic Scholar is a new service for scientific literature search and discovery, focusing on semantics and textual understanding. (Alan Institute of Artificial Intelligence) Automated Hypothesis Generation – The ever-expanding volumes of publications, on the one side, pose a fundamental threat to scientists, on the other side, generate a unique opportunity to discover new knowledge as the accumulated publications contain minable knowledge ( helps-researchers-decide-what-they-should-be-looking ) helps-researchers-decide-what-they-should-be-looking Becoming critical – WWW2015 workshops 2nd Workshop on Knowledge Extraction from Text (KET 2015). 2015 Workshop on “Semantics, Analytics and Visualization: Enhancing Scholarly Data” (SAVE-SD 2015) The 2nd WWW Workshop on Big Scholarly Data: Towards the Web of Scholars (BigScholar) 55

56 Knowledge is power! 56

57 Thanks to all the collaborators: David Wild, Kyle Stirling, Judy Qiu (Indiana) Jie Tang, Juanzi Li, Jing Zhang, Zhanpeng Fang, Yang Yang (TsingHua) Jim Walson (Panoscopix) Bing He (Johns Hopkins) Chengxiang Zhai, Chi Wang, Brian Foote (UIUC) Eric M. Gifford, Huijun Wang (Merck) Bin Chen (Stanford) Michael S. Lajiness (Eli Lilly) Acknowledgements 57

Download ppt "Measuring Scholarly Impact and Beyond Ying Ding Indiana University 1."

Similar presentations

Ads by Google