Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced,

Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced, or disclosed without Fair Isaac Corporation's express consent. © 2008 Fair Isaac Corporation. Improving NER in Arabic using a Morphological Tagger Benjamin Farber, Dayne Freitag FairIsaac Nizar Habash, Owen Rambow Columbia-CCLS habash@ccls.columbia.edu

2 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Overview Named Entity Recognition (NER) NER for Arabic: the Challenges Using Morphological Analysis and Disambiguation Error Analysis

3 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Named Entity Recognition: Mention Detection NEW YORK, March 19 (AFP) Media tycoon Barry Diller on Wednesday quit as chief of Vivendi Universal Entertainment… Sample below from Adaptive Content Extraction (ACE) Our objective: Find name mentions Geo-political entity (GPE) Organization Person

4 © 2008 Fair Isaac Corporation and Columbia University. Confidential. NER: The Approach Media tycoon Barry Diller on Wednesday quit as chief of Vivendi … O B:person I:person O B:org … Structured perceptron label sequence model Token stream BIO-encoded name mentions Mentions Person: “Barry Diller” Organization: “Vivendi Universal Entertainment”

5 © 2008 Fair Isaac Corporation and Columbia University. Confidential. The Role of Features State-of-the-art methods rely on word-local features Typically have form F(S, i)  {0,1}, for sequence S and position i Classes of feature  Word identity (e.g., word at index is “ the ”)  Orthographic (e.g., word is capitalized)  Lexical (e.g., word is a noun, or in a list of cities) Media tycoon Barry Diller word_media:1 word_the:0 capitalized:1 numeric:0 … word_tycoon:1 word_the:0 capitalized:1 numeric:0 … word_barry:1 word_the:0 capitalized:1 numeric:0 in_name_list:1 … word_diller:1 word_the:0 capitalized:1 numeric:0 in_name_list:0 …

6 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Challenges of Arabic NER Dearth of lexical features Orthographic ambiguity Clitics, affixation English features Word identity Orthographic Gazetteers Arabic features Word identity Omission of short vowels Increased lexical ambiguity Word identity less reliable feature Addressed in this study Addressed in this study Future work للارمن ل+ ال+ ارمن “for the Armenians” والحكومة و+ ال+ حكومة “and the government” Arabic rich in affixes best tokenization?

7 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Features Based on Term Clusters Distributional term clustering using unlabeled corpora Source of features for NER (Miller, et al, 2004; Freitag, 2004) Boolean features reflecting cluster membership جاكسون jAkswn Jackson وايزمن wAyzmn Weisman سولانا swlAnA Solana بيكر bykr Baker بوش bw$ Bush سادات sAdAt Sadat كول kwl Kohl …...… Example English clusters Example Arabic clusters البرتو Albrtw Alberto ستيفان styfAn Stephan يان yAn Jan كريس krys Chris دانيال dAnyAl Daniel كارل kArl Carl توماس twmAs Thomas …...… clinton dole brown johnson gingrich king yeltsin... texas london california florida chicago boston tokyo...

8 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Buckwalter Arabic Morphological Analysis (BAMA): ;;WORD byn bayãna=[bayãn_1 POS:V +PV +S:3MS BW:+bayãn/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben MADA ( Habash and Rambow, 2005; Roth et al 2008) : Which BAMA analysis is correct?  Combination of classifiers on orthogonal dimensions of Arabic morphology  96% disambiguation accuracy  99.3% word-level PATB tokenization accuracy Morphological Analysis and Disambiguation for Arabic (MADA)

9 © 2008 Fair Isaac Corporation and Columbia University. Confidential. The Initial Experiment Two new features:  Capitalized gloss (GlossCap)  No entry exists for a word in our morphological database (OOV) ;;WORD byn bayãna=[bayãn_1 POS:V +PV +S:3MS BW:+bayãn/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben Two enhanced NER models, OOV feature in both:  BAMA only. A GlossCap feature is true if the gloss of any analysis returned by BAMA is capitalized.  MADA. A GlossCap feature is true only if the gloss of the analysis selected by MADA is capitalized.

10 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Results BaseBAMAMADA F10.6670.676 0.715 Precision0.7140.713 0.735 Recall0.6270.643 0.697 Base: Recall limited BAMA: Marginal improvement MADA: 7% Improvement in recall while also improving precision!

11 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Overview Named Entity Recognition (NER) NER for Arabic: the Challenges Using Morphological Analysis and Disambiguation Error Analysis

13 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Spans or Tags? Correct NER = span is correct AND tag is correct Question: how hard is each component of the problem? Evaluate performance  On Span AND Tag (S&T) - (same evaluation as before)  On Span only (S) Note: different evaluation set, thus different numbers for S&T compared to earlier Conclusion The harder problem in NER is the correct identification of the spans. BaseMADA S&TS S F10.6500.6950.6960.757 Precision0.7030.7520.7230.787 Recall0.6040.6460.6700.730

14 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Errors by Type Categorizing all errors in development set by type  44% are recall errors (we miss an NE)  16% are precision errors (we propose a false NE)  25% are span errors (we propose one or more false span that overlap(s) with a gold span)  only 15% are label errors (the span is correct, the label is not) Confirms previous result that labels are not an important source of errors Recall errors (we do not find NE): these are often very common entities Major way to improve results: improve recall, perhaps by use of gazeteer

15 © 2008 Fair Isaac Corporation and Columbia University. Confidential. System Combination Experiments We have three systems – can we combine?  Baseline without morphology  System with analyzer only (BAMA)  System with disambiguated analysis (MADA) Three combinations:  Oracle: choose the correct system  Union: choose NE if any system chooses NE  Intersection: choose NE only if all systems choose NE Precision – Recall tradeoff Oracle shows high potential BASEBAMAMADAUnionIntersOracle PRE0.7030.7010.7230.6120.8430.880 REC0.6040.6190.6700.7310.5360.731 F-10.6500.6570.6960.6660.6550.798

16 © 2008 Fair Isaac Corporation and Columbia University. Confidential. Conclusion Morphological disambiguation in-context using MADA helps NER More precisely, what helps NER is the case (uppercase/lowercase) of the English gloss of the MADA selected entry! Ideas for improvement:  Gazetteer for recall improvement  Use of lexemes (lemmatization, performed by MADA) in clustering  Can other MT-based techniques help NER in a NER-resource-poorer language?

Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced,

Similar presentations

Presentation on theme: "Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced,

Similar presentations

Presentation on theme: "Confidential. The material in this presentation is the property of Fair Isaac Corporation, is provided for the recipient only, and shall not be used, reproduced,"— Presentation transcript:

Similar presentations

About project

Feedback