Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter Luke Zettlemoyer Mausam Oren Etzioni 1.

Slides:



Advertisements
Similar presentations
Distant Supervision for Relation Extraction without Labeled Data CSE 5539.
Advertisements

Evaluating Classifiers
Linear Classifiers (perceptrons)
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Event Extraction Using Distant Supervision Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, Daniel Jurafsky 30 May 2014 Language.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Machine Learning Week 2 Lecture 1.
Machine Learning Week 1, Lecture 2. Recap Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set 5 0.
What is Statistical Modeling
Predicting domain-domain interactions using a parsimony approach Katia Guimaraes, Ph.D. NCBI / NLM / NIH.
Sparse vs. Ensemble Approaches to Supervised Learning
Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.
Expectation Maximization Algorithm
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Sparse vs. Ensemble Approaches to Supervised Learning
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Radial Basis Function Networks
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Today Evaluation Measures Accuracy Significance Testing
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CSE 5539: Web Information Extraction
Graphical models for part of speech tagging
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
CPSC 422, Lecture 21Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 21 Oct, 30, 2015 Slide credit: some slides adapted from Stuart.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Gaussian Processes For Regression, Classification, and Prediction.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Incremental Text Structuring with Hierarchical Ranking Erdong Chen Benjamin Snyder Regina Barzilay.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Automatically Labeled Data Generation for Large Scale Event Extraction
An Introduction to Markov Logic Networks in Knowledge Bases
Machine Learning – Classification David Fenyő
A Brief Introduction to Distant Supervision
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
ECE 5424: Introduction to Machine Learning
Factor Graph in DeepDive
CS 4/527: Artificial Intelligence
Janardhan Rao (Jana) Doppa, Alan Fern, and Prasad Tadepalli
Multi-Objective Optimization
Introduction Task: extracting relational facts from text
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS249: Neural Language Model
Presentation transcript:

Modeling Missing Data in Distant Supervision for Information Extraction Alan Ritter Luke Zettlemoyer Mausam Oren Etzioni 1

Distant Supervision For Information Extraction Input: Text + Database Output: relation extractor Motivation: – Domain Independence Doesn’t rely on annotations – Leverage lots of data Large existing text corpora + databases – Scale to lots of relations 2 [Bunescu and Mooney, 2007] [Snyder and Barzilay, 2007] [Wu and Weld, 2007] [Mintz et al., 2009] [Hoffmann et. al., 2011] [Surdeanu et. al. 2012] [Takamatsu et al. 2012] [Riedel et. al. 2013] …

Heuristics for Labeling Training Data PersonBirth Location Barack ObamaHonolulu Mitt RomneyDetroit Albert EinsteinUlm Nikola TeslaSmiljan …… “Barack Obama was born on August 4, 1961 at … in the city of Honolulu...” “Birth notices for Barack Obama were published in the Honolulu Advertiser…” “Born in Honolulu, Barack Obama went on to become…” … (Barack Obama, Honolulu) (Mitt Romney, Detroit) (Albert Einstein, Ulm) 3 e.g. [Mintz et. al. 2009]

Problem: Missing Data Most previous work assumes no missing data during training Closed world assumption – All propositions not in the DB are false Leads to errors in the training data – Missing in DB -> false negatives – Missing in Text -> false positives 4 [Xu et. al. 2013] [Min et. al. 2013] Let’s treat these as missing (hidden) variables

NMAR Example: Flipping a bent coin Flip a bent coin 1000 times Goal: estimate But! – Heads => hide the result – Tails => hide with probability 0.2 Need to model missing data to get an unbiased estimate of 5 [Little & Rubin 1986]

Distant Supervision: Not missing at random (NMAR) Prop is False=> hide the result Prop is True=> hide with some probability Distant supervision heuristic during learning: – Missing propositions are false Better idea: Treat as hidden variables – Problem: not missing at random 6 [Little & Rubin 1986] Solution: Jointly model Missing Data + Information Extraction

Distant Supervision (Binary Relations) … … … Local Extractors Deterministic OR (Barack Obama, Honolulu) 7 [Hoffmann et. al. 2011] Sentences Aggregate Relations (Born-In, Lived-In, children, etc…) Relation mentions Maximize Conditional Likelihood

Learning 8 Max assignment to Z’s (conditioned on Freebase) Max assignment to Z’s (unconstrained) Structured Perceptron (gradient based update) – MAP-based learning Online Learning Weighted Edge Cover Problem (can be solved exactly) Weighted Edge Cover Problem (can be solved exactly) Trivial

Missing Data Problems… 2 Assumptions Drive learning: – Not in DB -> not mentioned in text – In DB-> must be mentioned at least once Leads to errors in training data: – False positives – False negatives 9

Changes … … … 10

Modeling Missing Data … … … Mentioned in DB … Encourage Agreement Mentioned in Text 11 [Ritter et. al. TACL 2013]

Learning This is the difficult part! soft constraints No longer weighted edge-cover This is the difficult part! soft constraints No longer weighted edge-cover Old parameter updates: New parameter updates (Missing Data Model): Doesn’t make much difference… 12

MAP Inference Find z that maximizes – Optimization with soft constraints Exact Inference – A* Search – Slow, memory intensive Approximate Inference – Local Search – With Carefully Chosen Search operators 13 Database Sentences Aggregate “mentioned in text” Sentence level hidden variables Only missed an optimal solution in 3 out of > 100,000 cases

Exact Inference: A* Search Hypothesis – Partial assignment to z Heuristic – Upper bound on the best score with partial assignment – Pick relation mentions independently – For each aggregate factor: check if we can improve the overall score by flipping each can lead to inconsistencies, but gives good upper bound 14

Approximate Inference Local Search Start with full (random) assignment to Search operators define neighboring states At each step pick the highest scoring neighbor – Until there are none better Basic Search Operators: – Change each Aggregate Search Operators: – Change all ‘s assigned to relation to 15 Almost always finds exact solution!

Aggregate Search Operator: Intuition Allows for global moves in the search space Not as likely to get stuck in local optima Type-level MCMC [Liang et. al. 2010] 16

Side Information Entity coverage in database – Popular entities – Good coverage in Freebase Wikipedia – Unlikely to extract new facts 17 … … … …

Experiments Red: MultiR Black: Soft Constraints Green: Missing Data Model 18 [Hoffmann et. al. 2011]

Automatic Evaluation Hold out facts from freebase – Evaluate precision and recall Problems: – Extractions often missing from Freebase – Marked as precision errors – These are the extractions we really care about! New facts, not contained in Freebase 19

Automatic Evaluation 20

Automatic Evaluation: Discussion Correct predictions will be missing form DB – Underestimates precision This evaluation is biased – Systems which make predictions for more frequent entity pairs will do better. – Hard constraints => explicitly trained to predict facts already in Freebase 21 [Riedel et. al. 2013]

Distant Supervision for Twitter NER PRODUCT Lumina 925 iPhone Macbook pro Nexus 7 … Nokia parodies Apple’s “Every Day” iPhone ad to promote their Lumia 925 smartphone new LUMIA 925 phone is already running the next WINDOWS Buy the Lumina 925 :) … Lumina 925 iPhone Macbook Pro 22 [Ritter et. al. 2011]

Weakly Supervised Named Entity Classification 23

Experiments: Summary Big improvement in sentence-level evaluation compared against human judgments We do worse on aggregate evaluation – Constrained system is explicitly trained to predict only those things in Freebase – Using (soft) constraints we are more likely to extract infrequent facts missing from Freebase GOAL: extract new things that aren’t already contained in the database 24

Contributions New model which explicitly allows for missing data – Missing in text – Missing in database Inference becomes more difficult – Exact inference: A* search – Approximate inference: local search with carefully chose search operators Results: – Big improvement by allowing for missing data – Side information -> Even Better Lots of room for better missing data models 25