Semantics-Empowered Text Exploration for Knowledge Discovery Delroy Cameron, Pablo N. Mendes, Amit P. Sheth Knowledge Enabled Information and Services Science Center (Kno.e.sis) Department of Computer Science and Engineering Wright State University Dayton, OH Victor Chan Division of Biosciences and Performance Human Effectiveness Directorate Air Force Research Lab (AFRL) Wright-Patterson Air Force Base Dayton, OH 48 th ACM Southeast Conference. ACMSE Oxford, Mississippi. April 15-17, 2010.
OUTLINE Background Paradigm Shift Demo Architecture Experimental Results Future Work Conclusion 3
BACKGROUND IR Systems - Interaction Paradigm Manually seek information Hyperlinked Documents Document-Centric Model Basis - Interaction Paradigm Keyword Search Document Browsing 4
S BACKGROUND Interaction Sequence 1. Assemble Keywords and Search 2. Document Selection 3. Document Inspection 4. Aggregation/Organization 5 Information Need What is the role of Magnesium in relation to Migraine? Magnesium migraine Search
LIMITATIONS Query Reformulations Impatient users Recognition over Recall Constrained navigation Hyperlink dependent - apriori Fuzzy User Interests Haiti Earthquake – Recovery, Relief, Political Climate, Crime Ineffective for Exploratory Search Search-and-Sift Query: Father of the Web Answer: Sir Tim Berners-Lee Amit P. Sheth, Cartic Ramakrishnan: Relationship Web: Blazing Semantic Trails between Web Resources. IEEE Internet Computing 11(4): (2007)
MOTIVATION Users are A priori hyperlink dependent Semantic Web Standards Entity Identification (Semantic Annotations) Relationship and Triple Identification Explore documents/information via relationships information seekers Informationdocumentsis embedded in 7
PARADIGM SHIFT Search Hit > Annotated Hit Bag of annotated words/phrases Annotated phrase is known entity Entity is Subject/Object of Triple Navigation driven by relationships Entity[Document]RelationshipEntity[Document] Entity[Document] Relationship Entity[Document] Contextual Navigation (relationships as context) 8
CONTRIBUTIONS 1. Novel Information Exploration Paradigm Data-Centric Model 2. Demonstrate use of background knowledge Named Entities, Relationships 3. Prototype Implementation Semantic annotations for navigation 4. Aggregation Utilities Saving, bookmarking, publishing etc 9
DEMO 10
Trie-based Spotter for Named Entity Identification used ultimately for document annotation Semantic Browser Controlled Vocabulary 992,281 DBpedia terms 15,742 HPCO terms 5,232 UMLS terms Controlled Vocabulary 992,281 DBpedia terms 15,742 HPCO terms 5,232 UMLS terms Medline (19 million Abstracts) Medline (19 million Abstracts) Spotter Module Document Corpus Linked Open Data SavePublishOrganize Utilities provided for promoting, bookmarking, and saving search results Search Workbench (SERP) Annotated entities provide anchors that serve as entry points to navigation Semantic Trail Log Sequential record of each triple navigated by a user Yahoo (indexed documents accessed as a Web Service using Yahoo Search Boss) Yahoo (indexed documents accessed as a Web Service using Yahoo Search Boss) Articles saved using Lucene. Indexed as of Aug Figure 1: System Components and Architecture ARCHITECTURE Background Knowledge HCPO Ontology UMLS
IMPLEMENTATION Spotter Module Dietary restriction with hypomagnesia is normally associated with diminished urinary excretion. magnesium UMLS Controlled Vocabulary Entity LabelPubMed ID Magnesium Deficiency C Dietary restriction with hypomagnesia C Magnesium EntityID: This process is called Spotting and uses a Trie data structure. 12 magnesium
ARCHITECTURE Document Corpus Medline Lucene Index - 19 million abstracts Aug REST Endpoint: XML Response (or JSON) Keyword queries, Document IDs Background Knowledge UMLS (Unified Medical Language System) 5,232 entities and 16,540 triples HPCO (Human Performance & Cognition Ontology) 15,742 entities and 22,298 triples 13
Rank Feature on [1-5] scale Normalized Relative Aggregated Scores EVALUATION Evaluation Metrics Search User Interfaces Semantic Browser (Medline + UMLS) PubMedYahoo Interface Design Useful Features Motivation to Explore Information Novelty Effectiveness of Task outcome Required Cognitive Load Overall Satisfaction
CONCLUSION Novel Information Exploration Paradigm Semantic Browser support Contextual Navigation Identify Named Entities and Relationships Provide Semantic Annotations Utilities for Aggregation Semantic Trails to Knowledge Discovery 15
x Formal Model for Paradigm Shift Improved Spotter – Additional Vocabularies, Context, Rule Based Relationship Ranking Document Re-ranking Trail Logs Analysis FUTURE WORK 16
ACKNOWLEDGEMENTS People Cartic Ramakrishnan Bilal Gonen, Aditya Dhoke Wesley Workman, Rodrigo Gama, Guilherme de Napoli Air Force Research Lab Human Effectiveness Directorate Wright-Patterson Air Force Base National Science Foundation Award SemDis: Discovering Complex Relationships in the Semantic Web. No Wright State University No. IIS to University of Georgia 17
QUESTIONS 18
Semantic Web extension of the current web common vocabulary machine processable Semantic Web – is an extension of the current web in which data is expressed in a common vocabulary making such that the data becomes machine processable. Ontology conceptsrelationships Ontology – is a specification of concepts and relationships between them. Triple subject-predicate-object Triple - a ternary relation containing an entity pair and a relationship that expresses the link between them i.e. subject-predicate-object Entity/Concept thing Entity/Concept – an instance of a thing URI URI – a unique identifier for any resource/entity/thing on the web LOD LOD - a semantic web initiative to provide a repository of semantically connected datasets TERMINOLOGY 19