Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Similar presentations


Presentation on theme: "Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University."— Presentation transcript:

1 Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University of Utah)

2 The Vision Data Base Time Line Geo Display Link Analysis Tables Extractor Entities Models Training Program training sentences answers Relations Information Extraction Events

3 What is IE? Analyze unrestricted text in order to extract information about pre-specified types of events, entities or relationships

4 Practical / Commercial Applications Database of Job Postings extracted from corporate web paes (flipdog.com) Extracting specific fields from resumes to populate HR databases (mohomine.com) Information Integration (fetch.com) Shopping Portals

5 Where the world is now? MUC helped drive information extraction research but most systems were fine tuned for terrorist activities Commercial systems can detect names of people, locations, companies (only for proper nouns) Very costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction (Seymore et al 99) 7000 labeled examples to learn MUC extraction rules (Soderland 99)

6 IE Approaches Hand-Constructed Rules Supervised Learning Semi-Supervised Learning

7 Goal Can you start with 5-10 seeds and learn to extract other instances? Example tasks Locations Products Organizations People

8 Aren’t you missing the obvious? Not really! Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names But not all instances are proper nouns *by the river*, *customer*,*client*

9 Use context to disambiguate A lot of NPs are unambiguous “The corporation” A lot of contexts are also unambiguous Subsidiary of But as always, there are exceptions….and a LOT of them in this case “customer”, John Hancock, Washington

10 Bootstrapping Approaches Utilize Redundancy in Text Noun-Phrases New York, China, place we met last time Contexts Located in, Traveled to Learn two models Use NPs to label Contexts Use Contexts to label NPs

11 Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999) Co-Training (Blum & Mitchell, 1999) Co-EM (Nigam & Ghani, 2000)

12 Data Set ~5000 corporate web pages (4000 for training) Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, product, none Preprocessed (parsed) to generate extraction patterns using AutoSlog (Riloff, 1996)

13 Evaluation Criteria Every test NP is labeled with a confidence score by the learned model Calculate Precision and Recall at different thresholds Precision = Correct / Found Recall = Found / Max that can be found

14 Seeds

15 Results

16 Active Learning Can we do better by keeping the user in the loop? If we can ask the user to label any examples, which examples should they be? Selected randomly Selected according to their density/frequency Selected according to disagreement between NP and context (KL divergence to the mean weighted by density)

17 NP – Context Disagreement KL Divergence

18 Results

19

20 What if you’re really lazy? Previous experiments assumed a training set was available What if you don’t have a set of documents that can be used to train? Can we start from only the seeds?

21 Collecting Training Data from the Web Use the seed words to generate web queries Simple Approaches For each seed word, fetch all documents returned Only fetch documents, where N or more seed words appear

22 Collecting Training Data from the Web Query GeneratorWWW Seed Documents Text Filter

23 Interleaved Data Collection Select a seed word with uniform probability Get documents containing that seed word Run bootstrapping on the new documents Select new seedwords that are learned with high confidence Repeat

24 Seed-Word Density

25 Summary Starting with 10 seed words, extract NPs matching specific semantic classes Probabilistic Bootstrapping is an effective technique Asking the user helps only if done intelligently The Web is an excellent resource for training data that can be collected automatically => Personal Information Extraction Systems


Download ppt "Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University."

Similar presentations


Ads by Google