Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Similar presentations


Presentation on theme: "Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo."— Presentation transcript:

1 Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo presented by George Demetriou Natural Language Processing Group, University of Sheffield, UK

2 Introduction Combining techniques for entity recognition: Dictionary based term recognition Filtering of ambiguous terms Statistical entity recognition How do the techniques compare: separately and in combination? When combined, can we retain the advantages of both?

3 LocusConditionLocus Investigation Semantic annotation of clinical text Our basic task is semantic annotation of clinical text For the purposes of this paper, we ignore: Modifiers such as negation Relations and coreference These are the subject of other papers Punch biopsy of skin. No lesion on the skin surface following fixation.

4 Entity recognition in specialist domains Specialist domains, e.g. medicine, are rich in: Complex terminology Terminology resources and ontologies We might expect these resources to be of use in entity recognition We might expect annotation using these resources to add value to the text, providing additional information to applications

5 Ambiguity in term resources Most term resources have not been designed with NLP applications in mind When used for dictionary lookup, many suffer from problems of ambiguity I: Iodine, an Iodine test or the personal pronoun be: bacterial endocarditis or the root of a verb Various techniques can overcome this: Filtering or elimination of problematic terms Use of context: in our case, statistical models

6 Corpus: the CLEF gold standard For experiments, we used a manually annotated gold standard Careful construction of a schema and guidelines Double annotation with a consensus step Measurement of Inter Annotator Agreement (IAA)‏ (Roberts et al 2008 LREC bio text mining workshop)‏ For the experiments reported, we use 77 gold standard documents

7 Entity types

8 Termino matchers Termino annotators External ontologies Termino database Link back to resources External databases External terminologies Dictionary lookup: Termino Termino is loaded from external resources FSM matchers are compiled out of Termino

9 Finding entities with termino GATE application pipeline Termino Application texts Annotated texts Termino term recognition Linguistic pre-processing Termino loaded with selected terms from UMLS (600K terms)‏ Pre-processing includes tokenisation and morphological analysis Lookup is against the roots of tokens

10 Filtering problematic terms Many UMLS terms are not suitable for NLP Ambiguity with common general language words To identify the most problematic of these, we ran Termino over a separate development corpus, and manually inspected the results A supplementary list of missing terms was compiled by domain experts (6 terms) Creation of these lists took a couple of hours

11 Creating the filter list 1. Add all unique terms of 1 character to the list 2. For all unique terms of <= 6 characters: i.Add to the list if it matches a common general language word or abbreviation ii. Add to the list if it has a numeric component iii. Reject from the list if it is an obvious technical term iv. Reject from the list if none of the above apply 3. Filter list size: 232 terms

12 Entities found by Termino UMLS alone gives poor precision, due to term ambiguity with general language words Adding in the filter list improves precision with little loss in recall

13 Statistical entity recognition Statistical entity recognition allows us to model context We use an SVM implementation provided with GATE Mapping of our multi-class entity recognition task to binary SVM classifiers is handled by GATE

14 Features for machine learning Token kind (e.g. number, word) Orthographic type (e.g. lower case, upper case)‏ Morphological root Affix Generalised part of speech The first two characters of Penn Treebank tagset Termino recognised terms

15 Finding entities: ML GATE application pipeline GATE training pipeline Statistical model of text Term model learning Linguistic processing Gold standard annotated texts (human annotated)‏ Application texts Annotated texts Term model application Linguistic processing

16 Finding entities: ML + Termino GATE application pipeline GATE training pipeline Statistical model of text Term model learning Linguistic processing Termino term recognition Termino Gold standard annotated texts (human annotated)‏ Application texts Annotated texts Term model application Termino term recognition Linguistic processing

17 Entities found by SVM Statistical entity recognition alone gives a higher P than dictionary lookup, but a lower R The combined system gains from the higher R of dictionary lookup, with no loss in P

18 Linkage to external resources The peritoneum contains deposits of tumour... the tumour cells are negative for desmin. Semantic annotation allows us to link texts to existing domain resources Giving more intelligent indexing and making additional information available to applications

19 Linkage to external resources UMLS links terms to Concept Unique Identifiers (CUIs) Where a recognised entity is associated with an underlying Termino term, can likewise automatically link the entity to a CUI If the SVM finds an entity when Termino has found nothing, the entity cannot be linked to a CUI

20 CUIs assigned At least one CUI can be automatically assigned to 83% of the terms in the gold standard Some are ambiguous, and resolution is needed

21 Availability Most of the software is open source and can be downloaded as part of GATE We are currently packaging Termino for public release We are currently preparing a UK research ethics committee application for release of the annotated gold standard

22 Conclusions Dicitionary lookup gives a good recall but poor precision, due to term ambiguity Much ambiguity is due to a few of terms, which can be filtered to give little loss in recall Combining dictionary lookup with statistical models of context improves precision A benefit of dictionary lookup, linkage to external resources, can be retained in the combined system

23 Questions? http://www.clinical-escience.org http://www.clef-user.com


Download ppt "Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo."

Similar presentations


Ads by Google