Presentation on theme: "Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh."— Presentation transcript:
Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh Conference on Natural Language Learning email@example.com@firstname.lastname@example.org@cs.stanford.edu
2 Unknown Words are a Central Challenge for NER Recognizing known named-entities (NEs) is relatively simple and accurate Recognizing novel NEs requires recognizing context and/or word-internal features External context and frequent internal words (e.g. Inc.) are most commonly used features Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) Staffordshire Abdul-Karim al-Kabariti CentrInvest
3 Are Names Self-Describing? NO: names can be opaque/ambiguous Word-Level: Washington occurs as LOC, PER, and ORG Char-Level: –ville suggests LOC, but exceptions like Neville YES: names can be highly distinctive/descriptive Word-Level: National Bank is a bank (i.e. ORG) Char-Level: Cotramoxazole is clearly a drug name Question: Overall, how informative are names alone?
4 How Internally Descriptive are Isolated Named Entities? Classification accuracy of pre-segmented CoNLL NEs without context is ~90% Using character n-grams as features instead of words yields 25% error reduction On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors NE Classification Accuracy (%) [not CoNLL task]
5 Exploiting Word-Internal Features Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.) e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 Features usually language-dependent (e.g. morphology) Our approach: use char n-grams as primary representation Use all substrings as classification features: Char n-grams subsume word features Features are language-independent (assuming its alphabetic) Similar in spirit to Cucerzan and Yarowsky (99), but uses ALL char n-grams vs. just prefix/suffix #Tom# #Tom#, #Tom, Tom#, #To, Tom, om#, #T, To, om, m#, T, o, m
6 Character-Feature Based Classifier Model I: Independent classification at each word maxent classifiers, trained using conjugate gradient equal-scale gaussian priors for smoothing trained models with >800K features in ~2 hrs POS tags and contextual features complement n- grams DescriptionAdded FeaturesOverall F 1 (English Dev.) Wordsw0w0 Official Baseline - Char N-Gramsn(w 0 ) POS Tagst0t0 Simple Context w -1, w 0, t -1, t 1 More Contextw -1, w 0, w 0, w 1, t -1, t 0, t 0, w 1
7 Character-Based CMM Model II: Joint classifications along the sequence Previous classification decisions are clearly relevant: Grace Road is a single location, not a person + location Include neighboring classification decisions as features Perform joint inference across chain of classifiers Conditional Markov Model (CMM, aka. maxent Markov model) Borthwick 1999, McCallum et al 2000
8 Character-Based CMM Final extra features: Letter-type patterns for each word United Xx, 12-month d-x, etc. Conjunction features E.g., previous state and current signature Repeated last words of multi-word names E.g., Jones after having seen Doug Jones … and a few more Description Added Features Overall F 1 (English Dev) More Context w -1, w 0, w 0, w 1, t -1, t 0, t 0, w 1 Simple Sequence s -1, s -1, t -1, t 0 More Sequence s -2, s -1, s -2, s -1, t -1, t 0 Final misc. extra features
9 Final Results Drop from English dev to test largely due to inconsistent labeling Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence
10 Conclusions Character substrings are valuable and underexploited model features Named entities are internally quite descriptive 25-30% error reduction vs. word-level models Discriminative maxent models allow productive feature engineering 30% error reduction vs. basic model What distinguishes our approach? More and better features Regularization is crucial for preventing overfitting
11 References Daniel M. Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel. 1997. Nymble: a highperformance learning namefinder. In Proceedings of ANLP97, pages 194--201. Andrew Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University. Silviu Cucerzan and David Yarowsky. 1999. Language independent named entity recognition combining morphological and contextual evidence. In Joint SIGDAT Conference on EMNLP and VLC. Shai Fine, Yoram Singer, and Naftali Tishby. 1998. The hierarchical hidden markov model: Analysis and applications. Machine Learning, 32:41--62.
12 References (cont.) Andrew McCallum, Dayne Freitag, and Fernando Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In ICML 2000. Andrei Mikheev. 1997. Automatic rule induction for unknownword guessing. Computational Linguistics, 23(3):405--423. Adwait Ratnaparkhi. 1996. A maximum entropy model for partofspeech tagging. In EMNLP 1, pages 133-- 142. Joseph Smarr and Christopher D. Manning. 2002. Classifying unknown proper noun phrases without context. Technical Report dbpubs/200246, Stanford University, Stanford, CA. Nina Wacholder, Yael Ravin, and Misook Choi. 1997. Disambiguation of proper names in text. In ANLP 5, pages 202--208.
13 CoNLL Named Entity Recognition Task: Predict semantic label of each word in text Foreign NNP I-NP ORG Ministry NNP I-NP ORG spokesman NN I-NP O Shen NNP I-NP PER Guofang NNP I-NP PER told VBD I-VP O Reuters NNP I-NPORG : : O O