Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 11: Advanced Machine Learning.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Introduction to Machine Learning Approach Lecture 5.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Survey of Semantic Annotation Platforms
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
Henk Harkema Andrea Setzer Ian Roberts Rob Gaizauskas Mark Hepple University of Sheffield Jeremy Rogers University of Manchester Richard Power Open University.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Decision Support Systems
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Open Health Natural Language Processing Consortium
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.
Language Identification and Part-of-Speech Tagging
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Tomás Murillo-Morales and Klaus Miesenberger
CSE 635 Multimedia Information Retrieval
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo presented by George Demetriou Natural Language Processing Group, University of Sheffield, UK

Introduction Combining techniques for entity recognition: Dictionary based term recognition Filtering of ambiguous terms Statistical entity recognition How do the techniques compare: separately and in combination? When combined, can we retain the advantages of both?

LocusConditionLocus Investigation Semantic annotation of clinical text Our basic task is semantic annotation of clinical text For the purposes of this paper, we ignore: Modifiers such as negation Relations and coreference These are the subject of other papers Punch biopsy of skin. No lesion on the skin surface following fixation.

Entity recognition in specialist domains Specialist domains, e.g. medicine, are rich in: Complex terminology Terminology resources and ontologies We might expect these resources to be of use in entity recognition We might expect annotation using these resources to add value to the text, providing additional information to applications

Ambiguity in term resources Most term resources have not been designed with NLP applications in mind When used for dictionary lookup, many suffer from problems of ambiguity I: Iodine, an Iodine test or the personal pronoun be: bacterial endocarditis or the root of a verb Various techniques can overcome this: Filtering or elimination of problematic terms Use of context: in our case, statistical models

Corpus: the CLEF gold standard For experiments, we used a manually annotated gold standard Careful construction of a schema and guidelines Double annotation with a consensus step Measurement of Inter Annotator Agreement (IAA)‏ (Roberts et al 2008 LREC bio text mining workshop)‏ For the experiments reported, we use 77 gold standard documents

Entity types

Termino matchers Termino annotators External ontologies Termino database Link back to resources External databases External terminologies Dictionary lookup: Termino Termino is loaded from external resources FSM matchers are compiled out of Termino

Finding entities with termino GATE application pipeline Termino Application texts Annotated texts Termino term recognition Linguistic pre-processing Termino loaded with selected terms from UMLS (600K terms)‏ Pre-processing includes tokenisation and morphological analysis Lookup is against the roots of tokens

Filtering problematic terms Many UMLS terms are not suitable for NLP Ambiguity with common general language words To identify the most problematic of these, we ran Termino over a separate development corpus, and manually inspected the results A supplementary list of missing terms was compiled by domain experts (6 terms) Creation of these lists took a couple of hours

Creating the filter list 1. Add all unique terms of 1 character to the list 2. For all unique terms of <= 6 characters: i.Add to the list if it matches a common general language word or abbreviation ii. Add to the list if it has a numeric component iii. Reject from the list if it is an obvious technical term iv. Reject from the list if none of the above apply 3. Filter list size: 232 terms

Entities found by Termino UMLS alone gives poor precision, due to term ambiguity with general language words Adding in the filter list improves precision with little loss in recall

Statistical entity recognition Statistical entity recognition allows us to model context We use an SVM implementation provided with GATE Mapping of our multi-class entity recognition task to binary SVM classifiers is handled by GATE

Features for machine learning Token kind (e.g. number, word) Orthographic type (e.g. lower case, upper case)‏ Morphological root Affix Generalised part of speech The first two characters of Penn Treebank tagset Termino recognised terms

Finding entities: ML GATE application pipeline GATE training pipeline Statistical model of text Term model learning Linguistic processing Gold standard annotated texts (human annotated)‏ Application texts Annotated texts Term model application Linguistic processing

Finding entities: ML + Termino GATE application pipeline GATE training pipeline Statistical model of text Term model learning Linguistic processing Termino term recognition Termino Gold standard annotated texts (human annotated)‏ Application texts Annotated texts Term model application Termino term recognition Linguistic processing

Entities found by SVM Statistical entity recognition alone gives a higher P than dictionary lookup, but a lower R The combined system gains from the higher R of dictionary lookup, with no loss in P

Linkage to external resources The peritoneum contains deposits of tumour... the tumour cells are negative for desmin. Semantic annotation allows us to link texts to existing domain resources Giving more intelligent indexing and making additional information available to applications

Linkage to external resources UMLS links terms to Concept Unique Identifiers (CUIs) Where a recognised entity is associated with an underlying Termino term, can likewise automatically link the entity to a CUI If the SVM finds an entity when Termino has found nothing, the entity cannot be linked to a CUI

CUIs assigned At least one CUI can be automatically assigned to 83% of the terms in the gold standard Some are ambiguous, and resolution is needed

Availability Most of the software is open source and can be downloaded as part of GATE We are currently packaging Termino for public release We are currently preparing a UK research ethics committee application for release of the annotated gold standard

Conclusions Dicitionary lookup gives a good recall but poor precision, due to term ambiguity Much ambiguity is due to a few of terms, which can be filtered to give little loss in recall Combining dictionary lookup with statistical models of context improves precision A benefit of dictionary lookup, linkage to external resources, can be retained in the combined system

Questions?