Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
SPICE! An Ontology Based Web Application By Angela Maduko and Felicia Jones Final Presentation For CSCI8350: Enterprise Integration.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Information Retrieval
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Search Engines and Information Retrieval Chapter 1.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Computer/Human Interaction Spring 2013 Northeastern University1 Bricolage: Example-Based Retargeting for Web Design Kumar, R.,Talton, J.O., Ahmad, S.,
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.
SemSearch: A Search Engine for the Semantic Web Yuangui Lei, Victoria Uren, Enrico Motta Knowledge Media Institute The Open University EKAW 2006 Presented.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
Author Name Disambiguation in Medline Vetle I. Torvik and Neil R. Smalheiser August 31, 2006.
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
Introduction to the Semantic Web and Linked Data
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Progress Report - Year 2 Extensions of the PhD Symposium Presentation Daniel McEnnis.
Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.
An Ontological Approach to Financial Analysis and Monitoring.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Lecture #1: Introduction to Algorithms and Problem Solving Dr. Hmood Al-Dossari King Saud University Department of Computer Science 6 February 2012.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
General Architecture of Retrieval Systems 1Adrienn Skrop.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Information Retrieval in Practice
Recommendation in Scholarly Big Data
Algorithms and Problem Solving
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Cloud based linked data platform for Structural Engineering Experiment
Improving Data Discovery Through Semantic Search
Social Knowledge Mining
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
How to publish in a format that enhances literature-based discovery?
Algorithms and Problem Solving
Presentation transcript:

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell

Introduction ► No explicit semantic information about data and objects are presented in most of the Web pages. ► Semantic Web aims to solve this problem by providing an underlying mechanism to add semantic metadata to content:  Ex: The entity “UGA” pointing to  Using entity disambiguation

Introduction ► We use background knowledge in the form of an ontology ► Our contributions are two-fold:  A novel method to disambiguate entities within unstructured text by using clues in the text and exploiting metadata from the ontology,  An implementation of our method that uses a very large, real-world ontology to demonstrate effective entity disambiguation in the domain of Computer Science researchers.

Background ► Sesame Repository  Open source RDF repository  We chose Sesame, as opposed to Jena and BRAHMS, because of its ability to store large amounts of information by not being dependant on memory storage alone  We chose to use Sesame’s native mode because our dataset is typically too large to fit into memory and using the database option is too slow in update operations

Dataset 1: DBLP Ontology ► DBLP is a website that contains bibliographic information for computer scientists, journals and proceedings:  3,079,414 entities (447,121 are authors)  We used a SAX parser to parse DBLP XML file that is available online  Created relationships such as “co-author”  Added information regarding affiliations  Added information regarding areas of interest  Added alternate spellings for international characters

Dataset 2: DBWorld Posts ► DBWorld  Mailing list of information for upcoming conferences related to the databases field  Created a HTML scraper that downloads everything with “Call for Papers”, “Call for Participation” or “CFP” in its subject  Unstructured text

Overview of System Architecture

Approach ► Entity Names  Entity attribute that represents the name of the entity  Can contain more than one name

Approach ► Text-proximity Relationships  Relationships that can be expected to be in text- proximity of the entity  Nearness measured in character spaces

Approach ► Text Co-occurrence Relationships  Similar to text-proximity relationships except proximity is not relevant

Approach ► Popular Entities  The intuition behind this is to specify relationships that will bias the right entity to be the most popular entity  This should be used with care, depending on the domain  DBLP ex: the number of papers the entity has authored

Approach ► Semantic Relationships  Entities can be related to one another through their collaboration network  DBLP ex: Entities are related to one another through co- author relationships

Algorithm ► Idea is to spot entity names in text and assign each potential match a confidence score ► This confidence score will be adjusted as the algorithm progresses and represents the certainty that this spotted entity represents a particular object in the ontology

Algorithm – Flow Chart

Algorithm ► Spotting Entity Names  Search document for entity names within the ontology  Each of the entities in the ontology that match a name found in the document become a candidate entity  Assign initial confidence scores for candidate entities based on these formulas:

Algorithm ► Spotting Literal Values of Text-proximity Relationships  Only consider relationships from candidate entities  Substantially increase confidence score if within proximity  Ex: Entity affiliation found next to entity name

Algorithm ► Spotting Literal Values of Text Co- occurrence Relationships  Only consider relationships from candidate entities  Increase confidence score if found within the document (location does not matter)  Ex: Entity’s areas of interest found in the document

Algorithm ► Using Popular Entities  Slightly increase the confidence score of candidate entities based on the amount of popular entity relationships  Valuable when used as a tie-breaker  Ex: Candidate entities with more than 15 publications receive a slight increase in their confidence score

Algorithm ► Using Semantic Relationships  Use relationships among entities to boost confidence scores of candidate entities  Each candidate entity with a confidence score above the threshold is analyzed for semantic relationships to other candidate entities. If another candidate entity is found and is below the threshold, that entity’s confidence score is increased

Algorithm ► If any candidate entity rises above the threshold, the process repeats until the algorithm stabilizes ► This is an iterative step and always converges

Output ► XML format  URI – the DBLP URL of the entity  Entity name  Confidence score  Character offset – the location of the entity in the document ► This is a generic output and can easily be converted for use in Microformats, RDFa, etc.

Output

Output - Microformat

Evaluation: Gold Standard Set ► We evaluate our system using a gold standard set of documents  20 manually disambiguated documents  Randomly chose 20 consecutive post from DBWorld  We use precision and recall as the measurement of evaluation for our system

Evaluation: Gold Standard Set

Evaluation: Precision & Recall ► We define set A as the set of unique names identified using the disambiguated dataset ► We define set B as the set of entities found by our method ► The intersection of these sets represents the set of entities correctly identified by our method

Evaluation: Precision & Recall ► Precision is the proportion of correctly disambiguated entities with regard to B ► Recall is the proportion of correctly disambiguated entities with regard to A

Evaluation: Results ► Precision and recall when compared to entire gold standard set: ► Precision and recall on a per document basis: Correct DisambiguationFound EntitiesTotal EntitiesPrecisionRecall %79.4%

Related Work ► Semex:  Personal information management system that works with a user’s desktop  Takes advantage of a predictable structure  The results of disambiguated entities are propagated to other ambiguous entities, which could then be reconciled based on recently reconciled entities much like our work does

Related Work ► Kim:  An application that aims to be an automatic ontology population  Contains an entity recognition portion that uses natural language processors  Evaluations performed on human annotated corpora  Missed a lot of entities and results had many false positives

Conclusion ► Our method uses relationships between entities in the ontology to go beyond traditional syntactic-based disambiguation techniques ► This work is among the first to successfully use relationships for identifying entities in text without relying on the structure of the text

Thank you!