Presentation on theme: "Measuring Scholarly Impact and Beyond"— Presentation transcript:
1 Measuring Scholarly Impact and Beyond Ying DingIndiana University
2 Outline New Research Analytics Tool: Data2Knowledge Platform Future Directions
3 Next Generation of Bibliometrics Newly developed methods allow in-depth analysis of scholarly communicationTopic modeling (e.g., Latent Dirichlet Allocation)Information Extraction (e.g., Entity Extraction)Social Network Analysis (e.g., Community Detection)Big data demonstrates the power of connected data to enable knowledge discoveryStructured dataUnstructured dataSocial media dataDing, Y., Rousseau, R., & Wolfram, D. (Eds.) (2014). Measuring scholarly impact: Methods and practice. Springer.
4 Content-based Impact Analysis There are two levels:Syntactic level (position)Papers cited in different sections of the articlesHow many times papers are mentioned in one articleSemantic level (semantics)Citation Sentiment Analysis: sentence level, window-sizeConcept-based: knowledge concept level (e.g., topic, knowledge unit/entity, or bio-entities)Ding, Y., Song, M., Wang, X., Zhang, G., Zhai, C., & Chambers, T. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the American Society for Information Science & Technology, 65(9),
5 Semantic Level Concepts Why not keyword Topics Major entities in researchBio entitiesKnowledge unit (e.g., domain theories, well-established algorithms)Why not keywordAmbiguous literal wordsNot normalizedBut can be a starting point to extract concepts.Concept (keywords, synonyms (students vs. pupils), antonyms (birth vs. death), homonyms (pupil (student) vs. pupil (part of eye), etc.)Jeong, Y., Song, M., & Ding, Y. (2014). Content-based Author co-citation analysis. Journal of Informetrics, 8(1),
6 EntityMetricsEntitymetrics is defined as using entities (i.e., evaluative entities or knowledge entities) in the measurement of impact, knowledge usage, and knowledge transfer, to facilitate knowledge discovery.Ding, Y., Song, M., Han, J., Yu, Q., Yan, E., Lin, L., & Chambers, T. (2013). Entitymetrics: Measuring the impact of entities. PLoS One, 8(8): 1-14.
17 Entity Citation Network vs. Entity Co-Occurrence Network Gene Gene Co-Occurrence Network (GG) vs. Gene Cite Gene Network (GCG)The GCG network shares many genes with the GG network and as a result is a competitive complement to the GG networkUsing gene relationships based on citation relation extends the assumption of gene interaction being limited to the same article and opens up a new opportunity to analyze gene interaction from a wider spectrum of datasets.1,149 gene pairs from GCG were found in GG. A total of 164 pairs out of 1,149 were not found in GG before 2005, but were found in GCG before In particular, the PARK2 and PINK1 gene pair ranks fifth by co-occurrence frequency in the GG network, implying the gene pair has highly been studied since 2005Song, M., Han, N., Kim, Y., Ding, Y., & Chambers, T. (2013). Discovering implicit entity relation with the gene-citation-gene network. PLoS One, 8(12), e84639
18 Big Data in Life Sciences There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on:69 million compounds and 449,392 bioassays (PubChem)59 million compound bioactivities (PubChem Bioassay)4,763 drugs (DrugBank)9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)14 million human nucleotide sequences (EMBL)22 million life sciences publications - 800,000 new each year (PubMed)Multitude of other sets (drugs, toxicogenomics, chemogenomics, metagenomics …)Even more important are the relationships between these entities. For example a chemical compound can be linked to a gene or a protein target in a multitude of ways:Biological assay with percent inhibition, IC50, etcCrystal structure of ligand/protein complexCo-occurrence in a paper abstractComputational experiment (docking, predictive model)Statistical relationshipSystem association (e.g. involved in same pathways cellular processes)Wild, D. J., Ding, Y., Sheth, A. P., Harland, L., Gifford, E. M., & Lajiness, M. S. (2012). System chemical biology and the Semantic Web: What they mean for the future of drug discovery research. Drug Discovery Today (impact factor=6.422), 17(9-10),
20 Chem2Bio2RDF NCI Human Tumor Cell Lines Data PubChem Compound Database PubChem Bioassay DatabasePubChem Descriptions of all PubChem bioassaysPub3D: A similarity-searchable database of minimized 3D structures for PubChem compoundsDrugbankMRTD: An implementation of the Maximum Recommended Therapeutic Dose setMedline: IDs of papers indexed in Medline, with SMILES of chemical structuresChEMBL chemogenomics databaseKEGG Ligand pathway databaseComparative Toxicogenomics DatabasePhenoPred DataHuGEpedia: an encyclopedia of human genetic variation in health and disease.31m chemical structures59m bioactivity data points 3m/19m publications ~5,000 drugsChen, B., Dong, X., Jiao, Dazhi, Wang, H., Zhu, Q., Ding, Y. and Wild, D. (2010). Chem2Bio2RDF: A semantic framework for linking and mining chemogenomic and systems chemical biology data. BMC Bioinformatics, 2010, 11, 255.
21 RDF Triple store Dereferenable URI PlotViz: Visualization Bio2RDF BrowsingRDFTriple storeCytoscape PluginChem2Bio2RDFLinked Path Generation and RankingLODDuniprotOthersSPARQL ENDPOINTSThird party tools
22 Chen, B., Ding, Y., & Wild, D. J. (2012). Improving integrative searching of systems chemical biology data using semantic annotation. Journal of Cheminformatics, 4:6 (doi: / ).
23 Semantic graph mining: Path Finding Algorithm 51528132336141991624261211018425Node 18 has been visited by node 1, so we can find the shortest path through 18 is 1 – 10 – 18 – 21 – 2671711202212Dijkstra’s algorithm
24 Bio-LDA Latent Dirichlet Allocation (LDA) Bio-LDA The core of the group of powerful statistical modeling techniques for automated extraction of latent topics from large document collectionsBio-LDAExtended LDA model with Bio-terms as latent variableBio-terms: compound, gene, drug, disease, protein, side effect, pathwaysCalculate bio-term entropies over topicsUse the Kullback-Leibler divergence as the non-symmetric distance measure for two bio- terms over topics
25 Example: Topic 10Venlafaxine: prescription drug, Treats depression. Effexor XR® also treats panic disorder, social anxiety disorder, and generalized anxiety disorder.HTR1A: HTR1A (5-hydroxytryptamine (serotonin) receptor 1A, G protein-coupled) is a protein-coding gene. Diseases associated with HTR1A include anxiety disorder, Inactivation of this gene in mice results in behavior consistent with an increased anxiety andstress response.HTR2A: HTR2A (5-hydroxytryptamine (serotonin) receptor 2A, G protein-coupled) is a protein-coding gene. Diseases associated with HTR2A include major depressive disorder, and seasonal affective disorder, Mutations in this gene areassociated with susceptibility to schizophrenia and obsessive-compulsive disorder, and are also associated withresponse to the antidepressant citalopram in patients with major depressive disorder (MDD).Bipolar disorder is a condition in which a person has periods of depression and periods of being extremely happy or being cross or irritable.Apply Bio-LDA on 336,899 PubMed article abstracts in 2009 and extract 50 topicsWang, H., Ding, Y., Tang, J., Dong, X., He, B., Qiu, J. & Wild, D. (2011): Finding complex biological relationships in recent PubMed articles using Bio-LDA. PLos One 6(3): e doi: /journal.pone
26 Thiazolinediones (TZDs) – revolutionary treatment for type II Diabetes Troglitazone (Rezulin): withdrawn in 2000 (liver disease)Rosiglitazone (Avandia): restricted in 2010 (cardiac disease)Rosiglitazone bound into PPAR-γPioglitazone: ???? (does decrease blood sugar levels, was associated with bladder tumors andhas been withdrawn in some countries.)
27 PPARG: TZD target SAA2: Involved in inflammatory response implicated in cardiovascular disease (Current Opinion in Lipidology 15,3,, ) APOE: Apolipoprotein E3 essential for lipoprotein catabolism. Implicated in cardiovascular disease. ADIPOQ: Adiponectin involved in fatty acid metabolism. Implicated in metabolic syndrome, diabetes and cardiovascular diseaseCYP2C8: Cytochrome P450 present in cardiovascular tissue and involved in metabolism of xenobiotics CDKN2A: Tumor suppression gene SLC29A1: Membrane transporterPioglitazone increases quantity of ADIPOQ in blood, Pioglitazone can cause fluid retention and peripheral edema. As a result, it may precipitate congestive heart failure
28 Semantic Prediction http://chem2bio2rdf.org/slap Chen, B., Ding, Y., & Wild, D. (2012). Assessing Drug Target Association using Semantic Linked Data. PLoS Computational Biology, 8(7): e doi: /journal.pcbi ,
29 Example: Troglitazone and PPARG Association score:Association significance: 9.06 x 10-6 => missing link predicted
30 Topology is important for association Cmpd 1Cmpd 2Protein 1hasSubstructurebindhasSubstructureCmpd 1Cmpd 2Protein 1bindhasSubstructurehasSubstructureBenzen ring
31 Semantics is important for association Cmpd1Protein2Cmpd 2Protein1bindbindbindCmpd1Protein2GO:00001Protein1bindhasGOhasGOCmpd1Protein2Protein1bindPPICmpd1hypertensionCmpd 2Protein1hasSideeffecthasSide ffectbindCmpd1substructure1Cmpd 2Protein1hasSubstructurehasSubstructurebind
33 Cross-check with SEASEA analysis (Nature 462, , 2009) predicts 184 new compound-target pairs, 30 of which were experimentally tested23 of these pairs were experimentally validated (<15uM) including 15 aminergic GPCR targets and 8 which crossed major receptor classification boundaries9 of the aminergic GPCR target pairings were correctly predicted by SLAP (p<0.05) – for the other 6 compounds were not present in our set1 of the 8 cross-boundary pairs was predicted
34 Assessing drug similarity from biological function Took 157 drugs with 10 known therapeutic indications, and created SLAP profiles against 1,683 human targetsPearson correlation between profiles > 0.9 from SLAP was used to create associations between drugsDrugs with the same therapeutic indication unsurprisingly cluster togetherSome drugs with similar profile have different indications – potential for use in drug repurposing?Migraine is a chronic neurological disorder characterized by recurrent moderate to severe headaches often in association with a number of autonomic nervous system symptoms.Allergic rhinitis [rainaitis] (hay fever) is an allergic inflammation of the nasal airways. It occurs when an allergen, such as pollen, dust or animal dander (particles of shed skin and hair) is inhaled by an individual with a sensitized immune systemInsomnia, or sleeplessness, is a sleep disorder in which there is an inability to fall asleep or to stay asleep as long as desired
35 Data2Knowledge platform… AMinerMining knowledge from articles:Researcher profilingExpert searchTopic analysisReviewer suggestionPMinerMining knowledge from patents:Competitor analysisCompany searchPatent summarizationMining drug discovery dataPredicting targetsRepurposing drugsHeterogeneous graph searchSLAP...Mining more data…
37 Expert Search Basic Info. Social Network Citation statistics Research InterestsPublications
38 and highly cited papers Expertise SearchFinding top experts,top conferences,and highly cited papersfor “data mining”
39 Finding the most hot regions on “data mining” Geographic SearchFinding the most hot regions on “data mining”
40 Conference AnalysisWhich year is the most successful year in the KDD’s history?Who are the highly cited authors?What is author nationality distribution for the highly cited KDD papers in the past years?
45 Widely used.. …… The largest publisher: Elsevier Conferences KDD 2010 WSDM 2011WSDM 2014ICDM 2011ICDM 2012SocInfo 2011ICMLA 2011WAIM 2011etc.……
46 What is PMiner? Current patent analysis systems focus on search Google Patent, WikiPatent, FreePatentsOnlinePMiner is designed for an in-depth analysis of patent activity at the topic-levelTopic-driven modeling of patentsHeterogeneous network co-rankingIntelligent competitive analysisPatent summarization* Patent data:> 3.8M patents> 2.4M inventors> 400K companies> 10M citation relationships* Journal data:> 2k journal papers> 3.7k authorsThe crawled data is increasing to >300 Gigabytes.J. Tang, B. Wang, Y. Yang, P. Hu, Y. Zhao, X. Yan, B. Gao, M. Huang, P. Xu, W. Li, and A. K. Usadi. PatentMiner: Topic-driven Patent Analysis and Mining. KDD'12. pp
47 Topics of search results Patent SearchTopics of search resultsTop PatentsTop InventorsTop Companies
48 Topic-based Analysis for “Microsoft” Also, for a specific company, the system analyzes its patent trend, that is, in a year like 2010, how many patents the company issued. Analysis results of hot terms trends, and top inventors trends also shows here [click]. Additionally, we mine the competitive relationships between companies in this work. For example, here is a list of Microsoft’s competitors generated by the system. We see that IBM and Apple are on the list. We also give a comparison between competitors. 
49 A court decision in 08/2012: Samsung’s Galaxy smart phone infringed upon a series of patents of Apple’s iphone, besides 4 appearance design patents, 3 software patents so-called 381, 915, and 163 are included, respectively cover "bounce back" , “pinch-to-zoom”, and “tap-to-zoom”.The above 3 software patents all belong to the following three patent categories: active solid-state devices (touch screen), computer graphics processing (graph scaling), and selective visual display systems (tap to select).
50 DemoY. Yang, J. Tang, J. Keomany, Y. Zhao, Y. Ding, J. Li, and L. Wang. Mining Competitive Relationships by Learning across Heterogeneous Networks. CIKM’12. pp
55 Ongoing Initiatives Semantic Scholar Automated Hypothesis Generation Semantic Scholar is a new service for scientific literature search and discovery, focusing on semantics and textual understanding. (Alan Institute of Artificial Intelligence)Automated Hypothesis GenerationThe ever-expanding volumes of publications, on the one side, pose a fundamental threat to scientists, on the other side, generate a unique opportunity to discover new knowledge as the accumulated publications contain minable knowledge (http://www.economist.com/news/science-and-technology/ new-type-software-helps-researchers-decide-what-they-should-be-looking)Becoming criticalWWW2015 workshops2nd Workshop on Knowledge Extraction from Text (KET 2015).2015 Workshop on “Semantics, Analytics and Visualization: Enhancing Scholarly Data” (SAVE-SD 2015)The 2nd WWW Workshop on Big Scholarly Data: Towards the Web of Scholars (BigScholar)
57 Acknowledgements Thanks to all the collaborators: David Wild , Kyle Stirling, Judy Qiu (Indiana)Jie Tang, Juanzi Li, Jing Zhang, Zhanpeng Fang, Yang Yang (TsingHua)Jim Walson (Panoscopix)Bing He (Johns Hopkins)Chengxiang Zhai, Chi Wang, Brian Foote (UIUC)Eric M. Gifford, Huijun Wang (Merck)Bin Chen (Stanford)Michael S. Lajiness (Eli Lilly)