Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

Slides:



Advertisements
Similar presentations
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection World Wide Web 2006 Conference May 23-27,
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
An Ontological Approach to the Document Access Problem of Insider Threat ISI 2005, (May 20) Boanerges Aleman-Meza 1 Phillip Burns 2 Matthew Eavenson 1.
Semantic Annotation for Multilingual Search Shibamouli Lahiri
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
OntoBlog: Linking Ontology and Blogs Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of Informatics, Japan 2 Asian.
A Taxonomy-based Model for Expertise Extrapolation Delroy Cameron, Amit P. Sheth Ohio Center for Excellence in Knowledge-enabled Computing (Kno.e.sis)
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Semantic Web Technology Evaluation Ontology (SWETO): A test bed for evaluating tools and benchmarking semantic applications WWW2004 (New York, May 22,
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Ranking Documents based on Relevance of Semantic Relationships Boanerges Aleman-Meza LSDIS labLSDIS lab, Computer Science, University of Georgia Advisor:
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
An Introduction to the Resource Description Framework Eric Miller Online Computer Library Center, Inc. Office of Research Dublin, Ohio 元智資工所 系統實驗室 楊錫謦.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
ImArray - An Automated High-Performance Microarray Scanner Software for Microarray Image Analysis, Data Management and Knowledge Mining Wei-Bang Chen and.
Towards Distributed Information Retrieval in the Semantic Web: Query Reformulation Using the Framework Wednesday 14 th of June, 2006.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Searching and Ranking Documents based on Semantic Relationships PaperPaper presentation ICDE Ph.D. Workshop 2006 April 3rd, 2006, Atlanta, GA, USA This.
Introduction to the Semantic Web and Linked Data
Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration.
Using linked data to interpret tables Varish Mulwad September 14,
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Ontology Quality by Detection of Conflicts in Metadata Budak I. Arpinar Karthikeyan Giriloganathan Boanerges Aleman-Meza LSDIS lab Computer Science University.
1 SEMEF : A Taxonomy-Based Discovery of Experts, Expertise and Collaboration Networks Delroy Cameron Masters Thesis Computer Science, University of Georgia.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Progress Report - Year 2 Extensions of the PhD Symposium Presentation Daniel McEnnis.
Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.
An Ontological Approach to Financial Analysis and Monitoring.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Discovering and Ranking Semantic Associations over a Large RDF Metabase Chris Halaschek, Boanerges Aleman- Meza, I. Budak Arpinar, Amit P. Sheth 30th International.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Neighborhood - based Tag Prediction
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Exploiting Synergy Between Ontologies and Recommender Systems
Social Knowledge Mining
Presentation 王睿.
[jws13] Evaluation of instance matching tools: The experience of OAEI
Managing Semantic Content for the Web
Introduction to Information Retrieval
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9, 2006 Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Acknowledgement: NSF-ITR-IDM Award # ‘SemDIS: Discovering Complex Relationships in the Semantic Web’

2 The Question is … How to determine the most likely match of a named-entity in unstructured text? Example: Which “A. Joshi” is this text referring to? out of, say, 20 candidate entities (in a populated ontology)

3 “likely match” = confidence score Idea is to spot entity names in text and assign each potential match a confidence score The confidence score represents a degree of certainty that a spotted entity refers to a particular object in the ontology

4 Our Approach, three steps 1.Spot Entity Names -assign initial confidence score 2.Adjust confidence score using: -proximity relationships (text) -co-occurrence relationships (text) -connections (graph) -popular entities (graph) 3.Iterate again to propagate result -finish when confidence scores are not updated

5 Spotting Entity Names Search document for entity names within the ontology Each match becomes a “candidate entity” Assign initial confidence scores

6 Using Text-proximity Relationships Relationships that can be expected to be in near text-proximity of the entity –Measured in terms of character spaces

7 Using Co-occurrence Relations Similar to text-proximity with the exception that proximity is not relevant i.e., location within the document does not matter

8 Using Popular Entities (graph) Intention: bias the right entity to be the most popular entity This should be used with care, depending on the domain good for tie-breaking DBLP scenario: entity with more papers e.g., only two “A. Joshi” entities with >50 papers

9 Using Relations to other Entities Entities can be related to one another through their collaboration network –‘neighboring’ entities get a boost in their confidence score i.e., propagation –This is the ‘iterative’ step in our apprach, It starts with entities having highest confidence score –Example: “Conference Program Committee Members:” - Professor Smith - Professor Smith’s co-editor in recent book - Professor Smith’s recently graduated Ph.D advisee

10 In Summary, ontology-driven Using “clues” –from the text where the entity appears –from the ontology Example: RDF/XML snippet of a person’s metadata

11 Overview of System Architecture

12 Once no more iterations are needed Output of results: XML format –URI –Confidence score –Entity name (as it appears in the text) –Start and end position (location in a document) Can easily be converted to other formats –Microformats, RDFa,...

13 Sample Output

14 Sample Output - Microformat

15 Evaluation: Gold Standard Set We evaluate our method using a gold standard set of documents –Randomly chose 20 consecutive post from DBWorld –Set of manually disambiguated documents (two) humans validated the ‘right’ entity match –We used precision and recall as the measurement of evaluation for our system

16 Evaluation, sample DBWorld post

17 Sample disambiguated document

18 Using DBLP data as ontology Converted DBLP’s bibliographic data to RDF –447,121 authors –A SAX parser to convert DBLP’s XML data to RDF –Created relationships such as “co-author” –Added Affiliations (for a subset of authors) Areas of interest (for a subset of authors) spellings for international characters Lessons learned lead us to create SwetoDblp (containing many improvements) [DBLP] [SwetoDblp]

19 Evaluation, Precision & Recall We define set A as the set of unique names identified using the disambiguated dataset (i.e., exact results) We define set B as the set of entities found by our method A  B represents the set of entities correctly identified by our method

20 Evaluation, Precision & Recall Precision is the proportion of correctly disambiguated entities with regard to B Recall is the proportion of correctly disambiguated entities with regard to A

21 Evaluation, Results Precision and recall (compared to gold standard) Precision and recall on a per document basis: Correct DisambiguationFound EntitiesTotal EntitiesPrecisionRecall %79.4%

22 Related Work Semex Personal Information Management: –The results of disambiguated entities are propagated to other ambiguous entities, which could then be reconciled based on recently reconciled entities (much like our work does) –Takes advantage of a predictable structure such as fields where an or name is expected to appear Our approach works with unstructured data [Semex] Dong, Halevy, Madhaven, SIGMOD-2005

23 Related Work Kim –Contains an entity recognition portion that uses natural language processing –Evaluations performed on human annotated corpora SCORE Technology (now, –Uses associations from a knowledge base, yet implementation details are not available (commercial product) [SCORE] Sheth et al, Internet Computing, 6(4), 2002[Kim] Popov et al., ISWC-2003

24 Conclusions Our method uses relationships between entities in the ontology to go beyond traditional syntactic-based disambiguation techniques This work is among the first to successfully use relationships for identifying named-entities in text without relying on the structure of the text

25 Future Work Improvements on spotting –e.g., canonical names (Tim = Timothy) Integration/deployment as a UIMA component allows analysis along a document collection for applications such as semantic annotation and search Further evaluations –Using different datasets and document sets –Compare with respect to other methods, and –to determine best contributing factor in disambiguation –measure how far in the list we missed the ‘right’ entity [UIMA] IBM’s Unstructured Information Management Architecture

26 Scalability, Semantics, Automation Usage of background knowledge in the form of a (large) populated ontology Flexibility to use a different ontology, but, –the ontology must ‘fit’ the domain It’s an ‘automatic’ approach, yet … –Human defines threshold values (and some weights)

27 References 1.Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A., Arpinar, B., Joshi, A.,Finin, T.: Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. 15th International World Wide Web Conference, Edinburgh, Scotland (2006) 2.DBWorld. April 9, Dong, X. L., Halevy, A., Madhaven, J.: Reference Reconciliation in Complex Information Spaces. Proc. of SIGMOD, Baltimore, MD. (2005) 4.Ley, M.: The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives. Proc. of the 9th International Symposium on String Processing and Information Retrieval, Lisbon, Portugal (Sept. 2002) Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM - Semantic Annotation Platform. Proc. of the 2nd Intl. Semantic Web Conf, Sanibel Island, Florida (2003) 6.Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., Warke, Y.: Managing semantic content for the Web. IEEE Internet Computing, 6(4) (2002) Zhu, J., Uren, V., Motta, E.: ESpotter: Adaptive Named Entity Recognition for Web Browsing, 3rd Professional Knowledge Management Conference, Kaiserslautern, Germany, 2005 Evaluation datasets at: