Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

Similar presentations


Presentation on theme: "Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With."— Presentation transcript:

1 Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With contributions from Tom Mitchell and Ellen Riloff)

2 What is Information Extraction? Analyze unrestricted text in order to extract pre- specified types of events, entities or relationships Recent Commercial Applications Database of Job Postings extracted from corporate web pages (flipdog.com) Extracting specific fields from resumes to populate HR databases (mohomine.com) Information Integration (fetch.com) Shopping Portals

3 IE Approaches Hand-Constructed Rules Supervised Learning Still costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction (Seymore et al 99) 7000 labeled examples to learn MUC extraction rules (Soderland 99) Semi-Supervised Learning

4 Semi-Supervised Approaches Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora Expectation-Maximization, Co-Training, CoBoost, Meta-Bootstrapping, Co-EM, etc. Goal: Systematically analyze and test The Assumptions underlying the algorithms The Effectiveness of the algorithms on a common set of problems and corpus

5 Tasks Extract Noun Phrases belonging to the following semantic classes Locations Organizations People

6 Aren’t you missing the obvious? Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names Named Entity Extraction? But not all instances are proper nouns *by the river*, *customer*,*client*

7 Use context to disambiguate A lot of NPs are unambiguous “The corporation” A lot of contexts are also unambiguous Subsidiary of But as always, there are exceptions….and a LOT of them in this case customer, John Hancock, Washington

8 Bootstrapping Approaches Utilize Redundancy in Text Noun-Phrases New York, China, place we met last time Contexts Located in, Traveled to Learn two models Use NPs to label Contexts Use Contexts to label NPs

9 Interesting Dimensions for Bootstrapping Algorithms Incrementalvs.Iterative Symmetricvs.Asymmetric Probabilisticvs.Heuristic

10 Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999) Incremental, Asymmetric, Heuristic Co-Training (Blum & Mitchell, 1999) Incremental, Symmetric, Probabilistic(?) Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic Baseline Seed-Labeling: label all NPs that match the seeds Head-Labeling: label all NPs whose head matches the seeds

11 Data Set ~4200 corporate web pages (WebKB project at CMU) Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, none Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)

12 Seeds Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier

13 Intuition Behind Bootstrapping the dog australia france the canary islands ran away travelled to is beautiful Noun PhrasesContexts

14 Co-Training (Blum & Mitchell, 99) Incremental, symmetric, probabilistic 1. Initialize with pos and neg NP seeds 2. Use NPs to label all contexts 3. Add n top scoring contexts for both positive and negative class 4. Use new contexts to label all NPS 5. Add n top scoring NPs for both positive and negative class 6. Loop

15 Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic Similar to Co-Training Probabilistically labels and adds all NPs and contexts to the labeled set

16 Meta-Bootstrapping (Riloff & Jones, 99) Incremental, Asymmetric, Heuristic Two-level process NPs are used to score contexts according to co-occurring frequency and diversity After first level, all contexts are discarded and only the best NPs are retained

17 Common Assumptions Seeds Seed Density in the corpus Head-labeling Accuracy Syntactic-Semantic Agreement Redundancy Feature Sets are redundant and sufficient Labeling disagreement

18 Feature Set Ambiguity Feature Sets: NPs and Contexts If Feature Sets were redundantly sufficient, either of them alone would be enough to correctly classify the instance Calculate the ambiguity for each feature set Washington, Went to >, Visit >

19 2% NP Ambiguity Ambiguity TypeClass(es)Number of NPs 1 None Location Organization Person 3574 114 451 189 2 Location, None Organization, None Person, None Loc, Org Org, Person 6 31 25 6 13 3 Loc, Org, None Org, Person, None 1313

20 36% Context Ambiguity Ambiguity TypeClass(es)Number of Contexts 1 None Location Organization Person 1068 25 98 59 2 Location, None Organization, None Person, None Loc, Org Org, Person 51 271 206 5 50 3 Loc, Org, None Org, Person, None 18 83 4 Loc, Org, Per, None6

21 Labeling Disagreement Agreement among human labelers Same set of instances but different levels of information NP only Context Only NP and Context NP, Context and the entire sentence from the corpus

22 Labeling Disagreement 90.5% agreement when NP, context and sentence are given 88.5% when sentence is not given

23 Results Comparing Bootstrapping Algorithms Meta-Bootstrapping, Co-Training, co-EM Locations, Organizations, Person

24 Co-EM MetaBoot Co-Training

25 Co-EM MetaBoot Co-Training

26 Co-EM MetaBoot Co-Training

27 More Results Bootstrapping outperforms both baselines Improvement is less pronounced for “people” class Ambiguous classes don’t benefit as much from bootstrapping?

28 Why does co-EM work well? Co-EM outperforms Meta-bootstrapping & Co-Training Co-EM is probabilistic and does not do hard classifications Reflective of the ambiguity among classes

29 Summary Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes Co-EM performs robustly even when the underlying assumptions are violated

30 Ongoing Work Varying initial seed size and type Collecting Training Corpus automatically (from the Web) Incorporating the user in the loop (Active Learning)


Download ppt "Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With."

Similar presentations


Ads by Google