Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University.

Similar presentations


Presentation on theme: "Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University."— Presentation transcript:

1 Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University April 30, 2012  Committee Prof. Ralph Grishman Prof. Satoshi Sekine Prof. Heng Ji Prof. Ernest Davis Prof. Lakshminarayanan Subramanian

2 Outline I.Introduction II.Relation Type Extension: Active Learning with Local and Global Data Views III.Relation Type Extension: Bootstrapping with Local and Global Data Views IV.Cross-Domain Bootstrapping for Named Entity Recognition V.Conclusion

3 Part I Introduction

4 Tasks 1.Named Entity Recognition (NER) 2.Relation Extraction (RE) i.Relation Extraction between Names ii.Relation Mention Extraction

5 NER NameType Bill GatesPERSON SeattleLOCATION MicrosoftORGANIZATION Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. NER Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft.

6 RE i.Relation Extraction between Names NER Adam, a data analyst for ABC Inc. ABC Inc.Adam Employment RE

7 i.Relation Mention Extraction Entity Extraction Entity MentionEntity Adam{Adam, a data analyst} a data analyst{Adam, a data analyst} ABC Inc.{ABC Inc.} Adam, a data analyst for ABC Inc.

8 RE i.Relation Mention Extraction RE Adam, a data analyst for ABC Inc. ABC Inc. a data analyst Employment

9 Prior Work – Supervised Learning Learn with labeled data – –, Employment >

10 Prior Work – Supervised Learning O.J.Simpsonwas arrestedandchargedwith murderinghisex-wife, NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith murderinghisex-wife, NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald OOOP Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald OOOP Goldmanin1994. POOO Expensive!

11 Expensive A trained model is typically domain-dependent – Porting it to a new domain usually involves annotating data from scratch Prior Work – Supervised Learning Domains

12 O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald OOOP Goldmanin1994. POOO Annotation is tedious! 15 minutes 1 hour 2 hours Prior Work – Supervised Learning

13 Prior Work – Semi-supervised Learning Learn with both – labeled data – Unlabeled data The learning is an iterative process 1. Train an initial model with labeled data 2. Apply the model to tag unlabeled data 3. Select good tagged examples as additional training examples 4. Re-train the model 5. Repeat from Step2. Small Large

14 Prior Work – Semi-supervised Learning Problem 1: Semantic Drift Example1:  Learner for PERSON names ends up learning flower names.  Because women's first names intersect with names of flowers (Rose,...) Example 2:  Learner for LocatedIn relation patterns ends up learning patterns for other relations (birthPlace, governorOf, …)

15 Prior Work – Semi-supervised Learning Problem 2: Lacks a good stopping criterion Most systems – either use a fixed number of iterations – or use a labeled development set to detect the right stopping point

16 Prior Work – Unsupervised Learning Learn with only unlabeled data Unsupervised Relation Discovery – Context based clustering – Group pairs of named entities with similar context to the same relation cluster

17 Prior Work – Unsupervised Learning Unsupervised Relation Discovery (Hasegawa et al., (04))

18 Prior Work – Unsupervised Learning Unsupervised Relation Discovery – The semantics of clusters are usually unknown – Some clusters are coherent  can consistently label them – Some are mixed, containing different topics  difficult to label them

19 PART II Relation Type Extension: Active Learning with Local and Global Data Views

20 Relation Type Extension Extend a relation extraction system to new types of relations ACE 2004 Relations TypeExample EMP-ORGthe CEO of Microsoft PHYSa military base in Germany GPE-AFFU.S. businessman PER-SOChis ailing father ARTUS helicopters OTHER-AFFCuban-American people Multi-class Setting: Target relation: one of the ACE relation types Labeled data: 1) a few labeled examples of the target relation (possibly by random selection). 2) all labeled auxiliary relation examples. Unlabeled data: all other examples in the ACE corpus Target Labeled

21 Relation Type Extension Extend a relation extraction system to new types of relations ACE 2004 Relations TypeExample EMP-ORGthe CEO of Microsoft PHYSa military base in Germany GPE-AFFU.S. businessman PER-SOChis ailing father ARTUS helicopters OTHER-AFFCuban-American people Binary Setting: Target relation: one of the ACE relation types Labeled data: a few labeled examples of the target relation (possibly by random selection). Unlabeled data: all other examples in the ACE corpus Target Un- labeled

22 LGCo-Testing LGCo-Testing := co-testing with local and global views The general idea 1.Train one classifier based on the local view (the sentence that contains the pair of entities) 2.Train another classifier based on the global view (distributional similarities between relation instances) 3.Reduce annotation cost by only requesting labels of contention data points

23 The local view President Clinton traveled to the Irish border for an evening ceremony. Words before entity 1{NIL} Words between{travel, to} Words after entity 2{for, an} # words between2 Token pattern coupled with entity typesPERSON_traveled_to_LOCATION Token Sequence Syntactic Parsing Tree Path of phrase labels connecting E1 and E2 augmented with the head word of the top phrase NP--S--traveled--VP--PP

24 The local view President Clinton traveled to the Irish border for an evening ceremony. Dependency Parsing Tree Shortest path connecting the two entities coupled with entity types PER_nsubj'_traveled_prep_to_LOC

25 The local view The local view classifier – Binary Setting: MaxEnt binary classifier – Multi-class Setting: MaxEnt multiclass classifier The local view classifier – Binary Setting: MaxEnt binary classifier – Multi-class Setting: MaxEnt multiclass classifier

26 The global view Corpus of 2,000,000,000 tokens Corpus of 2,000,000,000 tokens * * * * * * * (7-grams) 1.Compile corpus to database of 7-grams 2.Represent each relation instance as a relational phrase 3.Compute distributional similarities between phrases in the 7-grams database 4.Build a relation classifier based on the K-nearest neighbor idea Clinton traveled to the Irish border for … The General Idea traveled to Relation InstanceRelational Phrase … his brother said that ….his brother

27 Compute distributional similarities President Clinton traveled to the Irish border for an evening ceremony. * * * * * traveled totraveled to * * * * * * * * * traveled to ** traveled to * * * * * * * traveled to * ** * traveled to * * * > * * * traveled to * * 3 's headquarters here traveled to the U.S. 4 laundering after he traveled to the country 3, before Paracha traveled to the United 3 have never before traveled to the United 3 had in fact traveled to the United 4 two Cuban grandmothers traveled to the United 3 officials who recently traveled to the United 6 President Lee Teng-hui traveled to the United 4 1996, Clinton traveled to the United 4 commission members have traveled to the United 4 De Tocqueville originally traveled to the United 4 Fernando Henrique Cardoso traveled to the United 3 Elian 's grandmothers traveled to the United

28 Compute distributional similarities Ang arrived in Seattle on Wednesday. > * * * arrived in * * 4 Arafat, who arrived in the other 5of sorts has arrived in the new 5 inflation has not arrived in the U.S. 3 Juan Miguel Gonzalez arrived in the U.S. 3it almost certainly arrived in the New 44 to Brazil, arrived in the country 4 said Annan had arrived in the country 21 he had just arrived in the country 5 had not yet arrived in the country 3 when they first arrived in the country 3 day after he arrived in the country 5 children who recently arrived in the country 4 Iraq Paul Bremer arrived in the country 3 head of counterterrorism arrived in the country 3 election monitors have arrived in the country

29 Compute distributional similarities – Represent each phrase as a feature vector of contextual tokens – Compute cosine similarity between two feature vectors – Feature weight? President Clinton traveled to the Irish border

30 Features for traveled to (sorted by frequency) 1-10 11-2021-3031-40 R1_the L1_haveR4_toR3_in L2_, R2_andR1_WashingtonL2_and R2_to R2_inR1_NewL1_He L1_who L3_.R4_,R1_a R2_, L1_andR2_onL1_also L1_, R3_toR4_theR3_a L1_had L2_whoR1_ChinaR2_with L1_he R2_forL4_,L3_the L3_, L4_.L2_theL2_when L1_has R3_theR3_,L1_then Features for arrived in (sorted by frequency) 1-10 11-2021-3031-40 R1_the R1_BeijingR2_inR3_a R2_on L1_hadR3_,R4_a L1_who R2_toR2_forR4_the L2_, R3_onR3_forL3_, L1_, L4_.R2_fromR1_a L3_. R3_theR4_forL1_they R2_, R1_NewR3_toR4_to L1_he R2_.L3_theR1_Moscow L1_has L2_whenR3_capitalL5_. L1_have R4_,L2_theL3_The Feature Weight Feature Weight Use Frequency ? Use Frequency ?

31 Feature Weight Feature Weight Use tf-idf Use tf-idf tf the number of corpus instances of P having feature f divided by the number of instances of P idf the total number of phrases in the corpus divided by the number of phrases with at least one instance with feature f

32 Feature Weight Feature Weight Use tf-idf Use tf-idf Features for traveled to (sorted by tf-idf) 1-10 11-2021-3031-40 L1_had L1_HeL1_thenR1_Beijing L1_who R1_NewL1_sheR1_London L1_he L2_whoL1_alsoR2_for L1_has R1_ChinaR2_YorkR2_in L1_have R2_,R1_AfghanistanL2_when R2_to L1_recentlyL1_ZerhouniR1_Baghdad L1_, R1_TheniaL1_ClintonR1_Mexico R1_the L1_andL1_theyL2_He R1_Washington R1_EuropeL3_NouredineR4_to L2_, R1_CubaR2_andR2_United Features for arrived in (sorted by tf-idf) 1-10 11-2021-3031-40 L1_who R1_BaghdadL1_theyR1_Seoul R2_on R1_MoscowR2_SundayL5_. R1_Beijing L1_delegationR2_TuesdayR1_Damascus L1_has R3_capitalR1_WashingtonR2_, L1_he R1_NewR3_MondayR3_Wednesday L1_have L3_.R2_WednesdayR3_Thursday L1_had L2_,R3_SundayR2_from L1_, L1_HeR2_ThursdayR1_Amman R1_Cairo L2_whenR3_TuesdayL3_Minister R1_the R2_MondayR2_YorkR1_Belgrade

33 Compute distributional similarities traveled tohis family PhraseSim.PhraseSim. visited 0.779his staff0.792 arrived in 0.763his brother0.789 worked in 0.751his friends0.780 lived in 0.719his children0.769 served in 0.686their families0.753 consulted with 0.672his teammates0.746 played for 0.670his wife0.725 Sample of similar phrases.

34 The global view classifier traveled tohis family PhraseSim.PhraseSim. visited 0.779his staff0.792 arrived in 0.763his brother0.789 worked in 0.751his friends0.780 lived in 0.719his children0.769 served in 0.686their families0.753 consulted with 0.672his teammates0.746 played for 0.670his wife0.725 k-nearest neighbor classifier: classify an unlabeled example based on closest labeled examples President Clinton traveled to the Irish border PHYS-LocatedIn Ang Sun arrived in Seattle on Wednesday. ? PHYS-LocatedIn … his brother said that …PER-SOC sim(arrived in, traveled to) = 0.763sim(arrived in, his brother) = 0.012

35 LGCo-Testing Procedure in Detail Use KL-divergence to quantify the disagreement between the two classifiers KL-divergence:  0 for identical distributions  max when distributions are peaked and prefer different class labels KL-divergence:  0 for identical distributions  max when distributions are peaked and prefer different class labels  Rank instances by descending order of KL-divergence  Pick the top 5 instances to request human labels during a single iteration  Rank instances by descending order of KL-divergence  Pick the top 5 instances to request human labels during a single iteration

36 Active Learning Baselines RandomAL UncertaintyAL – Local view classifier – Sample selection: UncertaintyAL+ – Local view classifier (with phrase cluster features) – Sample selection: SPCo-Testing – Co-Testing ( sequence view classifier and parsing view classifier ) – Sample selection: KL-divergence

37  Annotation speed:  4 instances per minute  200 instances per hour (annotator takes a 10-mins break in each hour) Supervised 36K instances 180 Hours Supervised 36K instances 180 Hours LGCo-Testing 300 instances 1.5 Hour LGCo-Testing 300 instances 1.5 Hour Results for PER-SOC (Multi-class Setting)  Results for other types of relations have similar trends (in both binary and multiclass settings)

38 Precision-recall Curve of LGCo-Testing (Multi-class setting)

39 Comparing LGCo-Testing with the Two Settings F1 difference (in percentage) = F1 of active learning minus F1 of supervised learning the reduction of annotation cost by incorporating auxiliary types is more pronounced in early learning stages (#labels < 200) than in later ones

40 Part I Part III Relation Type Extension: Bootstrapping with Local and Global Data Views

41 Basic Idea Consider a bootstrapping procedure to discover semantic patterns for extracting relations between named entities

42 Basic Idea It starts from some seed patterns which are used to extract named entity (NE) pairs, which in turn result in more semantic patterns learned from the corpus.

43 Basic Idea Semantic drift occurs because 1)a pair of names may be connected by patterns belonging to multiple relations 2)the bootstrapping procedure is looking at the patterns in isolation Named Entity 1 Pattern Named Entity 2 Bill Clinton visit Arkansas born in fly to governor of arrive in campaign in …

44 Unguided Bootstrapping Guided Bootstrapping NE Pair Ranker  Use local evidence  Look at the patterns in isolation NE Pair Ranker  Use local evidence  Look at the patterns in isolation NE Pair Ranker  Use global evidence  Take into account the clusters (C i ) of patterns NE Pair Ranker  Use global evidence  Take into account the clusters (C i ) of patterns

45 Unguided Bootstrapping Initial Settings: – The seed patterns for the target relation R have precision 1 and all other patterns 0. – All NE pairs have confidence of 0

46 Unguided Bootstrapping Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs – if many of the k patterns connecting the two names are high-precision patterns – then the name pair should have a high confidence. – The confidence of NE pairs is estimated as – Problem: over-rate NE pairs which are connected by patterns belonging to multiple relations

47 Unguided Bootstrapping Step 2: Use NE pairs to search for new patterns and rank patterns – Similarly, for a pattern p, – if many of the NE pairs it matches are very confident – then p has many supporters and should have a high ranking – Estimation of the confidence of patterns the number of unique NE pairs matched by p sum of the support from the |H| pairs sum of the support from the |H| pairs

48 Unguided Bootstrapping Step 2: Use NE pairs to search for new patterns and rank patterns – Sup(p) is the sum of the support it can get from the |H| pairs – The precision of p is given by the average confidence of the NE pairs matched by p It normalizes the precision to range from 0 to 1 As a result the confidence of each NE pair is also normalized to between 0 and 1

49 Unguided Bootstrapping Step 3: Accept patterns – accept the K top ranked patterns in Step 2 Step 4: Loop or stop – The procedure now decides whether to repeat from Step 1 or to terminate. – Most systems simply do NOT know when to stop

50 Guided Bootstrapping Pattern Clusters-- Clustering steps: I.Extract features for patterns II.Compute the tf-idf value of extracted features III.Compute the cosine similarity between patterns IV.Build a pattern hierarchy by complete linkage Sample features for “X visited Y” as in “Jordan visited China”

51 Guided Bootstrapping Pattern Clusters – We use 0.005 to cut the pattern hierarchy to generate clusters – This ‘cutoff’ is decided by trying a series of thresholds searching for the maximal one that is capable of placing the seed patterns for each relation into a single cluster – We define target cluster C t as the one containing the seeds

52 Guided Bootstrapping Pattern cluster example – Top 15 patterns in the Located-in Cluster

53 Guided Bootstrapping Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs the total number of pattern instances matching N i the number of times p matches N i the number of times p matches N i Degree of association between N i and C t

54 Guided Bootstrapping Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs Why it gives better confidence estimation?  for the Located-in relation  Local_Conf(N i ) is very high  Global_Conf(N i ) is very low (less than 0.1)  Conf(N i ) is low, high Local_Conf(N i ) is discounted by low Global_Conf(N i ) Why it gives better confidence estimation?  for the Located-in relation  Local_Conf(N i ) is very high  Global_Conf(N i ) is very low (less than 0.1)  Conf(N i ) is low, high Local_Conf(N i ) is discounted by low Global_Conf(N i )

55 Guided Bootstrapping Step 2: Use NE pairs to search for new patterns and rank patterns. – All the measurement functions are the same as those used in the unguided bootstrapping. – However, with better ranking of NE pairs in Step 1 – the patterns are also ranked better Step 3: Accept patterns – We also accept the K top ranked patterns

56 Guided Bootstrapping Step 4: Loop or stop Since each pattern in our corpus has a cluster membership, we can – monitor the semantic drift easily – and naturally stop it drifts when the procedure tries to accept patterns which do not belong to the target cluster we can stop when the procedure tends to accept more patterns outside of the target cluster

57 Experiments Pattern clusters: – Computed from a corpus of 1.3 billion tokens Evaluation data: – ACE 2004 training data (no relation annotation between each pair of names) – We take advantage of entity co-reference information to automatically re-annotate the relations – Annotation was reviewed by hand Evaluation method: – direct evaluation – strict pattern match

58 Experiments Red: guided bootstrapping Blue: unguided bootstrapping Red: guided bootstrapping Blue: unguided bootstrapping drift : the percentage of false positives belonging to ACE relations other than the target relation

59 Experiments Red: guided bootstrapping Blue: unguided bootstrapping Red: guided bootstrapping Blue: unguided bootstrapping drift : the percentage of false positives belonging to ACE relations other than the target relation

60 Experiments Red: guided bootstrapping Blue: unguided bootstrapping Red: guided bootstrapping Blue: unguided bootstrapping drift : the percentage of false positives belonging to ACE relations other than the target relation

61 Experiments Guided bootstrapping terminates when the precision is still high while maintaining a reasonable recall It also effectively prevented semantic drift

62 Part I Part IV Cross-Domain Bootstrapping for Named Entity Recognition NER Semi-supervised learning NER Source Domain Target Domain

63 NER Model  Maximum Entropy Markov Model (McCallum et al., 2000)  Split a name type into two classes  B_PER (beginning of PERSON)  I_PER (continuation of PERSON) U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 Goal MEMM Maximum Entropy Classifier Viterbi Algorithm

64 NER Model  Estimate the name class of each individual token t i  Extract a feature vector from the local context window ( t i-2, t i-1, t i, t i+1, t i+2 )  Learn feature weights using Maximum Entropy model U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER FeatureValue currentTokenDonald wordType_currentTokeninitial_capitalized previousToken_-1Secretary previousToken_-1_class O previousToken_-2Defense nextToken_+1H. ……

65 NER Model  Estimate the name classes of the token sequence  Search the most likely path argmax ( )  Use dynamic programming ( possible paths)  N := number of name classes  L := length of the token sequence U.S.DefenseSecretaryDonaldH.Rumsfeld B-PER I-PER B-ORG I-ORG B-GPE I-GPE O U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER

66 Domain Adaptation Problems Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data

67 Domain Adaptation Problems Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports ) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Target domain (reports ) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … 1.Many words are out-of-vocabulary 2.Naming conventions are different: 1.Length: short vs long 2.Capitalization: weaker in target 3.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … 1.Many words are out-of-vocabulary 2.Naming conventions are different: 1.Length: short vs long 2.Capitalization: weaker in target 3.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … We want to automatically adapt the source-domain tagger to the target domain without annotating target domain data We want to automatically adapt the source-domain tagger to the target domain without annotating target domain data

68 The Benefits of Incorporating Global Data View -- Feature Generalization Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data Bit stringExamples 110100011John, James, Mike, Steven 11010011101Abdul, Mustafa, Abi, Abdel 11010011111Shaikh, Shaykh, Sheikh, Sheik 111111110Qaeda, Qaida, qaeda, QAEDA 00011110000FBI, FDA, NYPD 000111100100Taliban  Global Data View Comes to the Rescue!  Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm  Global Data View Comes to the Rescue!  Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm

69 The Benefits of Incorporating Global Data View -- Feature Generalization Add an additional layer of features that include word clusters currentToken = John currentPrefix3 = 100 currentPrefix3 = 100 fires also for target words! To avoid commitment to a single cluster: cut word hierarchy at different levels

70 The Benefits of Incorporating Global Data View -- Feature Generalization  Performance on the target domain ModelPRF1 Source_Model70.0261.86 65.69 Source_Model + Word Clusters 72.8266.61 69.58

71 The Benefits of Incorporating Global Data View -- Instance Selection Cross-domain Bootstrapping Algorithm: 1.Train a tagger from labeled source data 2.Tag all unlabeled target data with current tagger 3.Select good tagged words and add these to labeled data 4.Re-train the tagger Trained tagger Unlabeled target data Instance Selection Labeled Source data President Assad Feature Generalization Multiple Criteria

72 The Benefits of Incorporating Global Data View -- Instance Selection Multiple criteria – Criterion 1: Novelty– prefer target-specific instances Promote Abdul instead of John – Criterion 2: Confidence - prefer confidently labeled instances

73 The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 2: Confidence - prefer confidently labeled instances  Local confidence: based on local features

74 The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 2: Confidence  Global confidence: based on corpus statistics 1PrimeMinisterAbdulKarimKabaritiPER 2warlordGeneralAbdulRashidDostumPER 3PresidentA.P.J.AbdulKalamwillPER 4PresidentA.P.J.AbdulKalamhasPER 5AbdullahbinAbdulAziz,PER 6atKingAbdulAzizUniversityORG 7NawabMohammedAbdulAli,PER 8DrAliAbdulAzizAlPER 9NayefbinAbdulAzizsaidPER 10leaderGeneralAbdulRashidDostumPER

75 The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 2: Confidence  Global confidence  Combined confidence: product of local and global confidence

76 The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 3: Density - prefer representative instances which can be seen as centroid instances

77 The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances  “, said * in his”  Highly confident instance  High density, representative instance  BUT, continuing to promote such instance would not gain additional benefit

78 The Benefits of Incorporating Global Data View -- Instance Selection  Putting all criteria together 1.Novelty: filter out source-dependent instances 2.Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set 3.Density: rank instances in the candidate set in descending order of density 4.Diversity: 1.accepts the first instance (with the highest density) in the candidate set 2.and selects other candidates based on the diff measure.

79 The Benefits of Incorporating Global Data View -- Instance Selection  Results

80 Part V Conclusion

81 Contribution The main contribution is the use of both local and global evidence for fast system development The co-testing procedure reduced annotation cost by 97% The use of pattern clusters as the global view in bootstrapping – not only greatly improved the quality of learned patterns – but also contributed to a natural stopping criterion Feature generalization and instance selection in the cross-domain bootstrapping were able to improve the source model's performance on the target domain by 7% F1 without annotating any target domain data

82 Future Work Active Learning for Relation Type Extension – conduct real world active learning – combine semi-supervised learning with active learning to further reduce annotation cost Semi-supervised Learning for Relation Type Extension – better seed selection strategy Cross-domain Bootstrapping for Named Entity Recognition – extract dictionary-based features to further generalize lexical features – combine with distantly annotated data to further improve performance

83 Thanks!

84 ?

85 Backup slides

86 Experimental Setup for Active Learning ACE 2004 data – 4.4K relation instances – 45K non-relation instances 5-fold cross validation – Roughly 36K unlabeled instances (45K ÷ 5 X 4) – Random initialization (repeated 10 times) – Totally 50 runs – Each iteration selects 5 instances for annotation – 200 iterations are performed

87


Download ppt "Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University."

Similar presentations


Ads by Google