CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science Nothman et al. 2009, EACL

Outline 1.Motivation 2.NER and Gold-Standard Corpora 3.The Problem: Cross-corpora Performance 4.Wikipedia for NER 5.Results 6.Conclusion and My Observation

Motivation 1.Manual Annotation is “expensive”. (1) expensive (2) time (3) extra problems Can we use linguistic resources to create NER corpus automatically? 2.What’s the cross-corpora NER performance? 3.How can we utilize Web resource (e.g. Wikipedia) to improve NER?

NER Gold Corpora 1.MUC-7: Locations(LOC), organizations(ORG), personal names(PER) 2.CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC) 3.BBN: 54 tags in Penn Treebank CorpusTagsTrain Tokens Dev Tokens Test Tokens MUC-73836011865560436 CoNLL-0342036215136246435 BBN54901894142218129654

Problem: Poor Cross-corpus Performance TrainWith MISC CoNLL BBN Without MISC MUC CoNLL BBN MUC— 73.555.567.5 CoNLL81.262.365.982.162.4 BBN54.786.777.953.988.4

Corpus and Error Analysis N-gram tag variation: Check tags of all n-grams appear multiple times to see if the NE tags are consistent Entity type frequency: (1) POS tag with its NE tag (e.g. nationalities are often with JJ or NNPS) (2) Wordtypes (3) Wordtypes with Functions (e.g. Bank of New England -> Aaa of Aaa Aaa) Tag sequence confusion: Looking into the detail of confusion matrix

Using Wikipedia to Build NER Corpus 1.Classify all articles into entity classes 2. Split Wikipedia articles into sentences 3. Label NEs according to link targets 4. Select sentences for inclusion in a corpus

Improve Wikipedia NER Baseline: 58.9% and 62.3% on CoNLL and BBN 1.Inferring extra links using Wikipedia Disambiguation Pages 2.Personal titles: not all preceding titles indicate PER (e.g. Prime Minister of Australia) 3.Previously missed JJ entities (e.g. American / MISC) 4.Miscellaneous changes

Results TrainWith MISC CoNLL BBN Without MISC MUC CoNLL BBN MUC— 82.354.969.3 CoNLL85.961.969.986.960.2 BBN59.486.580.259.088.0 WP062.869.7 64.770.0 WP167.273.475.367.773.6 WP269.074.076.669.475.1 WP368.973.577.269.573.7 WP466.272.375.667.373.3 DEV set results (higher but similar to test set results)

Conclusion The impact of NER training corpora on its corresponding test set is huge Annotation-free Wikipedia NER corpora created Wikipedia data performs better in the cross- corpora NER task Still much room for improvement

Comments What I like about this paper: The scope of this paper is unique (analogy: cross- cultural studies) Utilizing novel linguistic resources to solve basic NLP problems Good results Relatively clear and easy to understand What I don’t like about this paper: The overall method to improve Wikipedia NER training is not a principal approach

Overall Assessment: 8/10

Thank you!

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Similar presentations

Presentation on theme: "CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Similar presentations

Presentation on theme: "CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback