Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

Slides:

Advertisements

Similar presentations

Distant Supervision for Relation Extraction without Labeled Data CSE 5539.

Advertisements

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Coupled Semi-Supervised Learning for Information Extraction Carlson et al. Proceedings of WSDM 2010.

Problem Semi supervised sarcasm identification using SASI

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

Semi Supervised Learning Qiang Yang –Adapted from… Thanks –Zhi-Hua Zhou – ople/zhouzh/ –LAMDA.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff, Janyce Wiebe, Theresa Wilson Presenter: Gabriel Nicolae.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:

Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Learning Dictionaries for Information Extraction by Multi- Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.

A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.

Text Classification, Active/Interactive learning.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Report on Semi-supervised Training for Statistical Parsing Zhang Hao

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Semi-automatic Product Attribute Extraction from Store Website

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff School of Computing University of Utah Janyce Wiebe, Theresa Wilson Computing.

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.

Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Classification using Co-Training

Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

The Road to the Semantic Web Michael Genkin SDBI

Presentation transcript:

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With contributions from Tom Mitchell and Ellen Riloff)

What is Information Extraction? Analyze unrestricted text in order to extract pre- specified types of events, entities or relationships Recent Commercial Applications Database of Job Postings extracted from corporate web pages (flipdog.com) Extracting specific fields from resumes to populate HR databases (mohomine.com) Information Integration (fetch.com) Shopping Portals

IE Approaches Hand-Constructed Rules Supervised Learning Still costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction (Seymore et al 99) 7000 labeled examples to learn MUC extraction rules (Soderland 99) Semi-Supervised Learning

Semi-Supervised Approaches Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora Expectation-Maximization, Co-Training, CoBoost, Meta-Bootstrapping, Co-EM, etc. Goal: Systematically analyze and test The Assumptions underlying the algorithms The Effectiveness of the algorithms on a common set of problems and corpus

Tasks Extract Noun Phrases belonging to the following semantic classes Locations Organizations People

Aren’t you missing the obvious? Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names Named Entity Extraction? But not all instances are proper nouns *by the river*, *customer*,*client*

Use context to disambiguate A lot of NPs are unambiguous “The corporation” A lot of contexts are also unambiguous Subsidiary of But as always, there are exceptions….and a LOT of them in this case customer, John Hancock, Washington

Bootstrapping Approaches Utilize Redundancy in Text Noun-Phrases New York, China, place we met last time Contexts Located in, Traveled to Learn two models Use NPs to label Contexts Use Contexts to label NPs

Interesting Dimensions for Bootstrapping Algorithms Incrementalvs.Iterative Symmetricvs.Asymmetric Probabilisticvs.Heuristic

Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999) Incremental, Asymmetric, Heuristic Co-Training (Blum & Mitchell, 1999) Incremental, Symmetric, Probabilistic(?) Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic Baseline Seed-Labeling: label all NPs that match the seeds Head-Labeling: label all NPs whose head matches the seeds

Data Set ~4200 corporate web pages (WebKB project at CMU) Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, none Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)

Seeds Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier

Intuition Behind Bootstrapping the dog australia france the canary islands ran away travelled to is beautiful Noun PhrasesContexts

Co-Training (Blum & Mitchell, 99) Incremental, symmetric, probabilistic 1. Initialize with pos and neg NP seeds 2. Use NPs to label all contexts 3. Add n top scoring contexts for both positive and negative class 4. Use new contexts to label all NPS 5. Add n top scoring NPs for both positive and negative class 6. Loop

Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic Similar to Co-Training Probabilistically labels and adds all NPs and contexts to the labeled set

Meta-Bootstrapping (Riloff & Jones, 99) Incremental, Asymmetric, Heuristic Two-level process NPs are used to score contexts according to co-occurring frequency and diversity After first level, all contexts are discarded and only the best NPs are retained

Common Assumptions Seeds Seed Density in the corpus Head-labeling Accuracy Syntactic-Semantic Agreement Redundancy Feature Sets are redundant and sufficient Labeling disagreement

Feature Set Ambiguity Feature Sets: NPs and Contexts If Feature Sets were redundantly sufficient, either of them alone would be enough to correctly classify the instance Calculate the ambiguity for each feature set Washington, Went to >, Visit >

2% NP Ambiguity Ambiguity TypeClass(es)Number of NPs 1 None Location Organization Person Location, None Organization, None Person, None Loc, Org Org, Person Loc, Org, None Org, Person, None 1313

36% Context Ambiguity Ambiguity TypeClass(es)Number of Contexts 1 None Location Organization Person Location, None Organization, None Person, None Loc, Org Org, Person Loc, Org, None Org, Person, None Loc, Org, Per, None6

Labeling Disagreement Agreement among human labelers Same set of instances but different levels of information NP only Context Only NP and Context NP, Context and the entire sentence from the corpus

Labeling Disagreement 90.5% agreement when NP, context and sentence are given 88.5% when sentence is not given

Results Comparing Bootstrapping Algorithms Meta-Bootstrapping, Co-Training, co-EM Locations, Organizations, Person

Co-EM MetaBoot Co-Training

Co-EM MetaBoot Co-Training

Co-EM MetaBoot Co-Training

More Results Bootstrapping outperforms both baselines Improvement is less pronounced for “people” class Ambiguous classes don’t benefit as much from bootstrapping?

Why does co-EM work well? Co-EM outperforms Meta-bootstrapping & Co-Training Co-EM is probabilistic and does not do hard classifications Reflective of the ambiguity among classes

Summary Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes Co-EM performs robustly even when the underlying assumptions are violated

Ongoing Work Varying initial seed size and type Collecting Training Corpus automatically (from the Web) Incorporating the user in the loop (Active Learning)