Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU.

Similar presentations


Presentation on theme: "Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU."— Presentation transcript:

1 Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU

2 Outline 1.Named Entity Recognition (NER) 2.Domain Adaptation Problem for NER 3.Cross-domain Bootstrapping 3.1Feature Generalization with Word Clusters 3.2Instance Selection Based on Multiple Criteria 4.Conclusion NYU

3 1.Named Entity Recognition (NER)  Two missions  U.S. Defense Secretary Donald H. Rumsfeld discussed the resolution … NYU Identification Classification NAME GPEORGPERSON

4 2.Domain Adaptation Problem for NER  NYU NER system performs well on in-domain data (F- measure 83.08)  But performs poorly on out-of-domain data (F- measure 65.09) NYU Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq …

5 2.Domain Adaptation Problem for NER NYU 1.No annotated data from the target domain 2.Many words are out-of-vocabulary 3.Naming conventions are different: 1.Length: short vs long source: George Bush; Donald H. Rumsfeld target: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud 2.Capitalization: weaker in target 4.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … 1.No annotated data from the target domain 2.Many words are out-of-vocabulary 3.Naming conventions are different: 1.Length: short vs long source: George Bush; Donald H. Rumsfeld target: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud 2.Capitalization: weaker in target 4.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … We want to automatically adapt the source- domain tagger to the target domain without annotating target domain data We want to automatically adapt the source- domain tagger to the target domain without annotating target domain data

6 3. Cross-domain Bootstrapping 1.Train a tagger from labeled source data 2.Tag all unlabeled target data with current tagger 3.Select good tagged words and add these to labeled data 4.Re-train the tagger Trained tagger Unlabeled target data Instance Selection Labeled Source data President Assad Feature Generalization Multiple Criteria NYU

7 3.1 Feature Generalization with Word Clusters  The source model  Sequential model, assigning name classes to a sequence of tokens  One name type is split into two classes  B_PER (beginning of PERSON)  I_PER (continuation of PERSON)  Maximum Entropy Markov Model (McCallum et al., 2000)  Customary features NYU 3. Cross-domain Bootstrapping U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER

8 3.1 Feature Generalization with Word Clusters  The source/seed model  Customary features  Extracted from context window ( t i-2, t i-1, t i, t i+1, t i+2 ) NYU 3. Cross-domain Bootstrapping U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER currentTokenDonald wordType_currentTokeninitial_capitalized previousToken_-1Secretary previousToken_-1_class O previousToken_-2Defense nextToken_+1H. ……

9 3.1 Feature Generalization with Word Clusters Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm Represent each word as a bit string NYU Bit stringExamples 110100011John, James, Mike, Steven 11010011101Abdul, Mustafa, Abi, Abdel 11010011111Shaikh, Shaykh, Sheikh, Sheik 111111110Qaeda, Qaida, qaeda, QAEDA 00011110000FBI, FDA, NYPD 000111100100Taliban

10 3.1 Feature Generalization with Word Clusters Add an additional layer of features that include word clusters currentToken = John currentPrefix3 = 100fires also for target words  To avoid commitment to a single cluster: cut word hierarchy at different levels NYU

11 3.1 Feature Generalization with Word Clusters  Performance on the target domain  Test set contains 23K tokens  PERSON/ORGANIZATION/GPE 771/585/559 instances  All other tokens belong to not-a-name class  4 points improvement of F-measure NYU

12 3.2 Instance Selection Based on Multiple Criteria  Single-domain bootstrapping uses a confidence measure as the single selection criterion  In a cross-domain setting, the most confidently labeled instances  are highly correlated with the source domain  contain little information about the target domain.  We propose multiple criteria  Criterion 1: Novelty– prefer target-specific instances  Promote Abdul instead of John NYU

13 3.2 Instance Selection Based on Multiple Criteria  Criterion 2: Confidence - prefer confidently labeled instances  Local confidence: based on local features NYU

14 3.2 Instance Selection Based on Multiple Criteria  Criterion 2: Confidence  Global confidence: based on corpus statistics NYU 1PrimeMinisterAbdulKarimKabaritiPER 2warlordGeneralAbdulRashidDostumPER 3PresidentA.P.J.AbdulKalamwillPER 4PresidentA.P.J.AbdulKalamhasPER 5AbdullahbinAbdulAziz,PER 6atKingAbdulAzizUniversityORG 7NawabMohammedAbdulAli,PER 8DrAliAbdulAzizAlPER 9NayefbinAbdulAzizsaidPER 10leaderGeneralAbdulRashidDostumPER

15 3.2 Instance Selection Based on Multiple Criteria  Criterion 2: Confidence  Global confidence  Combined confidence: product of local and global confidence NYU

16 3.2 Instance Selection Based on Multiple Criteria  Criterion 3: Density - prefer representative instances which can be seen as centroid instances NYU

17 3.2 Instance Selection Based on Multiple Criteria  Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances  “, said * in his”  Highly confident instance  High density, representative instance  BUT, continuing to promote such instance would not gain additional benefit NYU

18 3.2 Instance Selection Based on Multiple Criteria  Putting all criteria together 1.Novelty: filter out source-dependent instances 2.Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set 3.Density: rank instances in the candidate set in descending order of density 4.Diversity: 1.accepts the first instance (with the highest density) in the candidate set 2.and selects other candidates based on the diff measure. NYU

19 3.2 Instance Selection Based on Multiple Criteria  Results NYU

20 4. Conclusion  Proposed a general cross-domain bootstrapping algorithm for adapting a model trained only on a source domain to a target domain  Improved the source model’s F score by around 7 points  This is achieved 1.without using any annotated data from the target domain 2.without explicitly encoding any target-domain-specific knowledge into our system  The improvement is largely due to 1.the feature generalization of the source model with word clusters 2.the multi-criteria-based instance selection method NYU


Download ppt "Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU."

Similar presentations


Ads by Google