Presentation is loading. Please wait.

Presentation is loading. Please wait.

Measuring Scholarly Impact and Beyond

Similar presentations

Presentation on theme: "Measuring Scholarly Impact and Beyond"— Presentation transcript:

1 Measuring Scholarly Impact and Beyond
Ying Ding Indiana University

2 Outline New Research Analytics Tool: Data2Knowledge Platform
Future Directions

3 Next Generation of Bibliometrics
Newly developed methods allow in-depth analysis of scholarly communication Topic modeling (e.g., Latent Dirichlet Allocation) Information Extraction (e.g., Entity Extraction) Social Network Analysis (e.g., Community Detection) Big data demonstrates the power of connected data to enable knowledge discovery Structured data Unstructured data Social media data Ding, Y., Rousseau, R., & Wolfram, D. (Eds.) (2014). Measuring scholarly impact: Methods and practice. Springer.

4 Content-based Impact Analysis
There are two levels: Syntactic level (position) Papers cited in different sections of the articles How many times papers are mentioned in one article Semantic level (semantics) Citation Sentiment Analysis: sentence level, window-size Concept-based: knowledge concept level (e.g., topic, knowledge unit/entity, or bio-entities) Ding, Y., Song, M., Wang, X., Zhang, G., Zhai, C., & Chambers, T. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the American Society for Information Science & Technology, 65(9),

5 Semantic Level Concepts Why not keyword Topics
Major entities in research Bio entities Knowledge unit (e.g., domain theories, well-established algorithms) Why not keyword Ambiguous literal words Not normalized But can be a starting point to extract concepts. Concept (keywords, synonyms (students vs. pupils), antonyms (birth vs. death), homonyms (pupil (student) vs. pupil (part of eye), etc.) Jeong, Y., Song, M., & Ding, Y. (2014). Content-based Author co-citation analysis. Journal of Informetrics, 8(1),

6 EntityMetrics Entitymetrics is defined as using entities (i.e., evaluative entities or knowledge entities) in the measurement of impact, knowledge usage, and knowledge transfer, to facilitate knowledge discovery. Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS One, 8(8): 1-14.

7 EntityMetrics

8 PubMed Entities Drug Disease Protein Pathway Gene Cite

9 Entity Graph Heterogeneous Entity Graph Bcl-2 Inhibitor Diabetes p53
Cancer STAT3 Metformin ci: cite co: co-occur ci/co: Drug Disease Breast Cancer AMPK P13K Protein Pathway Gene

10 Metformin related entity-entity citation network
Data: 4,770 articles retrieved from PubMed Central with 134,844 references, and 1,969 bio-entities (i.e., 880 genes, 376 drugs, and 713 diseases)

11 Metformin related entity-entity citation network
Gene OTC Disease UlcerDrug nitric oxid Gene casp3 Gene ube2v1

12 Dataset

13 Main Path

14 Main Path

15 Main Path

16 Main Path

17 Entity Citation Network vs. Entity Co-Occurrence Network
Gene Gene Co-Occurrence Network (GG) vs. Gene Cite Gene Network (GCG) The GCG network shares many genes with the GG network and as a result is a competitive complement to the GG network Using gene relationships based on citation relation extends the assumption of gene interaction being limited to the same article and opens up a new opportunity to analyze gene interaction from a wider spectrum of datasets. 1,149 gene pairs from GCG were found in GG. A total of 164 pairs out of 1,149 were not found in GG before 2005, but were found in GCG before In particular, the PARK2 and PINK1 gene pair ranks fifth by co-occurrence frequency in the GG network, implying the gene pair has highly been studied since 2005 Song, M., Han, N., Kim, Y., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PLoS One, 8(12), e84639

18 Big Data in Life Sciences
There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on: 69 million compounds and 449,392 bioassays (PubChem) 59 million compound bioactivities (PubChem Bioassay) 4,763 drugs (DrugBank) 9 million protein sequences (SwissProt) and 58,000 3D structures (PDB) 14 million human nucleotide sequences (EMBL) 22 million life sciences publications - 800,000 new each year (PubMed) Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics …) Even more important are the relationships between these entities. For example a chemical compound can be linked to a gene or a protein target in a multitude of ways: Biological assay with percent inhibition, IC50, etc Crystal structure of ligand/protein complex Co-occurrence in a paper abstract Computational experiment (docking, predictive model) Statistical relationship System association (e.g. involved in same pathways cellular processes) Wild, D. J., Ding, Y., Sheth, A. P., Harland, L., Gifford, E. M., & Lajiness, M. S. (2012). System chemical biology and the Semantic Web: What they mean for the future of drug discovery research. Drug Discovery Today (impact factor=6.422), 17(9-10),

19 Text CSV Table HTML XML Patient Disease Tissue Cell Pathway DNA RNA Protein Drug

20 Chem2Bio2RDF NCI Human Tumor Cell Lines Data PubChem Compound Database
PubChem Bioassay Database PubChem Descriptions of all PubChem bioassays Pub3D: A similarity-searchable database of minimized 3D structures for PubChem compounds Drugbank MRTD: An implementation of the Maximum Recommended Therapeutic Dose set Medline: IDs of papers indexed in Medline, with SMILES of chemical structures ChEMBL chemogenomics database KEGG Ligand pathway database Comparative Toxicogenomics Database PhenoPred Data HuGEpedia: an encyclopedia of human genetic variation in health and disease. 31m chemical structures 59m bioactivity data points 3m/19m publications ~5,000 drugs Chen, B., Dong, X., Jiao, Dazhi, Wang, H., Zhu, Q., Ding, Y. and Wild, D. (2010). Chem2Bio2RDF: A semantic framework for linking and mining chemogenomic and systems chemical biology data. BMC Bioinformatics, 2010, 11, 255.

21 RDF Triple store Dereferenable URI PlotViz: Visualization Bio2RDF
Browsing RDF Triple store Cytoscape Plugin Chem2Bio2RDF Linked Path Generation and Ranking LODD uniprot Others SPARQL ENDPOINTS Third party tools

22 Chen, B., Ding, Y., & Wild, D. J. (2012). Improving integrative searching of systems chemical biology data using semantic annotation. Journal of Cheminformatics, 4:6 (doi: / ).

23 Semantic graph mining: Path Finding Algorithm
5 15 2 8 13 23 3 6 14 19 9 16 24 26 1 21 10 18 4 25 Node 18 has been visited by node 1, so we can find the shortest path through 18 is 1 – 10 – 18 – 21 – 26 7 17 11 20 22 12 Dijkstra’s algorithm

24 Bio-LDA Latent Dirichlet Allocation (LDA) Bio-LDA
The core of the group of powerful statistical modeling techniques for automated extraction of latent topics from large document collections Bio-LDA Extended LDA model with Bio-terms as latent variable Bio-terms: compound, gene, drug, disease, protein, side effect, pathways Calculate bio-term entropies over topics Use the Kullback-Leibler divergence as the non-symmetric distance measure for two bio- terms over topics

25 Example: Topic 10 Venlafaxine: prescription drug, Treats depression. Effexor XR® also treats panic disorder, social anxiety disorder, and generalized anxiety disorder. HTR1A: HTR1A (5-hydroxytryptamine (serotonin) receptor 1A, G protein-coupled) is a protein-coding gene. Diseases associated with HTR1A include anxiety disorder, Inactivation of this gene in mice results in behavior consistent with an increased anxiety andstress response. HTR2A: HTR2A (5-hydroxytryptamine (serotonin) receptor 2A, G protein-coupled) is a protein-coding gene. Diseases associated with HTR2A include major depressive disorder, and seasonal affective disorder, Mutations in this gene areassociated with susceptibility to schizophrenia and obsessive-compulsive disorder, and are also associated withresponse to the antidepressant citalopram in patients with major depressive disorder (MDD). Bipolar disorder is a condition in which a person has periods of depression and periods of being extremely happy or being cross or irritable. Apply Bio-LDA on 336,899 PubMed article abstracts in 2009 and extract 50 topics Wang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J. & Wild, D. (2011): Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLos One 6(3): e doi: /journal.pone

26 Thiazolinediones (TZDs) – revolutionary treatment for type II Diabetes
Troglitazone (Rezulin): withdrawn in 2000 (liver disease) Rosiglitazone (Avandia): restricted in 2010 (cardiac disease) Rosiglitazone bound into PPAR-γ Pioglitazone: ???? (does decrease blood sugar levels, was associated with bladder tumors and has been withdrawn in some countries.)

27 PPARG: TZD target SAA2: Involved in inflammatory response implicated in cardiovascular disease (Current Opinion in Lipidology 15,3,, ) APOE: Apolipoprotein E3 essential for lipoprotein catabolism. Implicated in cardiovascular disease. ADIPOQ: Adiponectin involved in fatty acid metabolism. Implicated in metabolic syndrome, diabetes and cardiovascular disease CYP2C8: Cytochrome P450 present in cardiovascular tissue and involved in metabolism of xenobiotics CDKN2A: Tumor suppression gene SLC29A1: Membrane transporter Pioglitazone increases quantity of ADIPOQ in blood, Pioglitazone can cause fluid retention and peripheral edema. As a result, it may precipitate congestive heart failure

28 Semantic Prediction
Chen, B., Ding, Y., & Wild, D. (2012). Assessing Drug Target Association using Semantic Linked Data. PLoS Computational Biology, 8(7): e doi: /journal.pcbi ,

29 Example: Troglitazone and PPARG
Association score: Association significance: 9.06 x 10-6 => missing link predicted

30 Topology is important for association
Cmpd 1 Cmpd 2 Protein 1 hasSubstructure bind hasSubstructure Cmpd 1 Cmpd 2 Protein 1 bind hasSubstructure hasSubstructure Benzen ring

31 Semantics is important for association
Cmpd 1 Protein 2 Cmpd 2 Protein 1 bind bind bind Cmpd1 Protein 2 GO:00001 Protein 1 bind hasGO hasGO Cmpd1 Protein 2 Protein 1 bind PPI Cmpd1 hypertension Cmpd 2 Protein 1 hasSideeffect hasSide ffect bind Cmpd1 substructure1 Cmpd 2 Protein 1 hasSubstructure hasSubstructure bind

32 SLAP Pipeline Path filtering

33 Cross-check with SEA SEA analysis (Nature 462, , 2009) predicts 184 new compound-target pairs, 30 of which were experimentally tested 23 of these pairs were experimentally validated (<15uM) including 15 aminergic GPCR targets and 8 which crossed major receptor classification boundaries 9 of the aminergic GPCR target pairings were correctly predicted by SLAP (p<0.05) – for the other 6 compounds were not present in our set 1 of the 8 cross-boundary pairs was predicted

34 Assessing drug similarity from biological function
Took 157 drugs with 10 known therapeutic indications, and created SLAP profiles against 1,683 human targets Pearson correlation between profiles > 0.9 from SLAP was used to create associations between drugs Drugs with the same therapeutic indication unsurprisingly cluster together Some drugs with similar profile have different indications – potential for use in drug repurposing? Migraine is a chronic neurological disorder characterized by recurrent moderate to severe headaches often in association with a number of autonomic nervous system symptoms. Allergic rhinitis [rainaitis] (hay fever) is an allergic inflammation of the nasal airways. It occurs when an allergen, such as pollen, dust or animal dander (particles of shed skin and hair) is inhaled by an individual with a sensitized immune system Insomnia, or sleeplessness, is a sleep disorder in which there is an inability to fall asleep or to stay asleep as long as desired

35 Data2Knowledge platform…
AMiner Mining knowledge from articles: Researcher profiling Expert search Topic analysis Reviewer suggestion PMiner Mining knowledge from patents: Competitor analysis Company search Patent summarization Mining drug discovery data Predicting targets Repurposing drugs Heterogeneous graph search SLAP ... Mining more data…

36 Researchers: 31,222,410 Publications: 69,962,333 Conferences/Journals: 330,236 Citations: 133,196,029 Knowledge Concepts: 7,854,301 AMiner Research profiling Integration Interest analysis Topic analysis Course search Expert search Association Disambiguation Suggestion Geo search Collaboration recommendation

37 Expert Search Basic Info. Social Network Citation statistics
Research Interests Publications

38 and highly cited papers
Expertise Search Finding top experts, top conferences, and highly cited papers for “data mining”

39 Finding the most hot regions on “data mining”
Geographic Search Finding the most hot regions on “data mining”

40 Conference Analysis Which year is the most successful year in the KDD’s history? Who are the highly cited authors? What is author nationality distribution for the highly cited KDD papers in the past years?

41 Reviewer Suggestion Interest matching COI avoiding Load balancing
Forecast review quality

42 Cross-domain Collaboratinon Recommendation
What are the cross-domain topics on which you can work in the target domain? Who are the best collaborators on each of these topics?

43 Topic Browser 200 topics have been discovered automatically from the academic articles

44 Academic Performance Measure
Academic Statistics New Stars

45 Widely used.. …… The largest publisher: Elsevier Conferences KDD 2010
WSDM 2011 WSDM 2014 ICDM 2011 ICDM 2012 SocInfo 2011 ICMLA 2011 WAIM 2011 etc. ……

46 What is PMiner? Current patent analysis systems focus on search
Google Patent, WikiPatent, FreePatentsOnline PMiner is designed for an in-depth analysis of patent activity at the topic-level Topic-driven modeling of patents Heterogeneous network co-ranking Intelligent competitive analysis Patent summarization * Patent data: > 3.8M patents > 2.4M inventors > 400K companies > 10M citation relationships * Journal data: > 2k journal papers > 3.7k authors The crawled data is increasing to >300 Gigabytes. J. Tang, B. Wang, Y. Yang, P. Hu, Y. Zhao, X. Yan, B. Gao, M. Huang, P. Xu, W. Li, and A. K. Usadi. PatentMiner: Topic-driven Patent Analysis and Mining. KDD'12. pp

47 Topics of search results
Patent Search Topics of search results Top Patents Top Inventors Top Companies

48 Topic-based Analysis for “Microsoft”
Also, for a specific company, the system analyzes its patent trend, that is, in a year like 2010, how many patents the company issued. Analysis results of hot terms trends, and top inventors trends also shows here [click]. Additionally, we mine the competitive relationships between companies in this work. For example, here is a list of Microsoft’s competitors generated by the system. We see that IBM and Apple are on the list. We also give a comparison between competitors. [40]

49 A court decision in 08/2012: Samsung’s Galaxy smart phone infringed upon a series of patents of Apple’s iphone, besides 4 appearance design patents, 3 software patents so-called 381, 915, and 163 are included, respectively cover "bounce back" , “pinch-to-zoom”, and “tap-to-zoom”. The above 3 software patents all belong to the following three patent categories: active solid-state devices (touch screen), computer graphics processing (graph scaling), and selective visual display systems (tap to select).

50 Demo Y. Yang, J. Tang, J. Keomany, Y. Zhao, Y. Ding, J. Li, and L. Wang. Mining Competitive Relationships by Learning across Heterogeneous Networks. CIKM’12. pp

51 What is going on Now and future

52 Semantic Publishing Turn literature knowledge into actionable data to generate more powerful knowledge

Paper 1, 2, … AAAAAAAAAAAAAAAAAAAAAA Gene1 AAAAAAA(XYZ, 2002), AAAAAAA Disease2AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Paper n: Gene 1, Disease 2, (XYZ, 2002) …… Paper 2: Gene 1, Disease 2, (XYZ, 2002) Paper 1: Gene 1, Disease 2, (XYZ, 2002) Demonstrating strong evidence of the connection/relationship between concept Gene 1 and Disease 2, or concept GENE and DISEASE

54 Knowledge Graph

55 Ongoing Initiatives Semantic Scholar Automated Hypothesis Generation
Semantic Scholar is a new service for scientific literature search and discovery, focusing on semantics and textual understanding. (Alan Institute of Artificial Intelligence) Automated Hypothesis Generation The ever-expanding volumes of publications, on the one side, pose a fundamental threat to scientists, on the other side, generate a unique opportunity to discover new knowledge as the accumulated publications contain minable knowledge ( new-type-software-helps-researchers-decide-what-they-should-be-looking) Becoming critical WWW2015 workshops 2nd Workshop on Knowledge Extraction from Text (KET 2015). 2015 Workshop on “Semantics, Analytics and Visualization: Enhancing Scholarly Data” (SAVE-SD 2015) The 2nd WWW Workshop on Big Scholarly Data: Towards the Web of Scholars (BigScholar)

56 Knowledge is power!

57 Acknowledgements Thanks to all the collaborators:
David Wild , Kyle Stirling, Judy Qiu (Indiana) Jie Tang, Juanzi Li, Jing Zhang, Zhanpeng Fang, Yang Yang (TsingHua) Jim Walson (Panoscopix) Bing He (Johns Hopkins) Chengxiang Zhai, Chi Wang, Brian Foote (UIUC) Eric M. Gifford, Huijun Wang (Merck) Bin Chen (Stanford) Michael S. Lajiness (Eli Lilly)

Download ppt "Measuring Scholarly Impact and Beyond"

Similar presentations

Ads by Google