Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Slides:

Advertisements

Similar presentations

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California

Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Search-Based Structured Prediction

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

Machine learning continued Image source:

CS 6961: Structured Prediction Fall 2014 Introduction Lecture 1 What is structured prediction?

Making Choices using Structure at the Instance Level within a Case Based Reasoning Framework Cormac Gebruers*, Alessio Guerri †, Brahim Hnich* & Michela.

Supervised learning Given training examples of inputs and corresponding outputs, produce the “correct” outputs for new inputs Two main scenarios: –Classification:

Simple Neural Nets For Pattern Classification

1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.

LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

Active Learning with Support Vector Machines

LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Introduction to Machine Learning Approach Lecture 5.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Online Learning Algorithms

An Introduction to Support Vector Machines Martin Law.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Zhenghua Li, Jiayuan Chao, Min Zhang, Wenliang Chen {zhli13, minzhang, Soochow University, China Coupled Sequence.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Towards Ad-hoc Situation Determination Graham Thomson, Paddy Nixon and Sotirios Terzis.

Research Focus Textual Inference and Knowledge Representation Our research focuses on the computational foundations of intelligent behavior. We develop.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Person-Specific Domain Adaptation with Applications to Heterogeneous Face Recognition (HFR) Presenter: Yao-Hung Tsai Dept. of Electrical Engineering, NTU.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Machine Learning CSE 681 CH2 - Supervised Learning.

The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.

Open Information Extraction using Wikipedia

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

An Introduction to Support Vector Machines (M. Law)

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

December 2011 Technion, Israel With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape

June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Page 1 April 2010 Carnegie Mellon University With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Chapter 7. Classification and Prediction

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Boosting and Additive Trees (2)

Kai-Wei Chang University of Virginia

CIS 700 Advanced Machine Learning for NLP Inference Applications

Improving a Pipeline Architecture for Shallow Discourse Parsing

Margin-based Decomposed Amortized Inference

Leveraging Textual Specifications for Grammar-based Fuzzing of Network Protocols Samuel Jero, Maria Leonor Pacheco, Dan Goldwasser, Cristina Nita-Rotaru.

Natural Language to SQL(nl2sql)

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Dan Roth Department of Computer Science

Presentation transcript:

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding under the Bootstrap Learning and the Machine Reading Programs. Abstract Many natural language processing (NLP) tasks require making decisions with respect to interdependent output variables. Current approaches to this problem, known as structured prediction, rely on using annotated data to learn a model mapping inputs to a set of output variables. Unfortunately, providing this from of supervision is a difficult task often requiring highly specialized knowledge that is not commonly available. Thus alleviating the supervision effort is a major challenge in scaling machine learning techniques to real world natural language processing problems. We investigate methods that allow learning models for structured prediction using an indirect supervision signal which is considerably easier to obtain. We suggest three indirect learning protocols for common NLP learning scenarios. We further demonstrate how to obtain the indirect supervision signal for several common NLP tasks such as semantic parsing, textual entailment, paraphrase identification, transliteration, POS tagging and information extraction and show empirically how this signal can be effectively used for these learning tasks. III. Driving Semantic Parsing from the World’s Response In this work we aim to alleviate the supervision burden by using a new form of supervision derived from the response to the learned model’s actions. Consider, for example, the task of converting Natural Language Questions into Logical statements used to query a database. Current approaches to semantic parsing, rely on annotated training data mapping sentences to logical forms. We exploit the ability to query the database to get supervision, thus reducing the supervision effort to providing the correct responses to the text queries, rather than working at the logical form level. In this setting the learner has access to a feedback function providing weak binary supervision signal, indicating if the predicted query response is identical to the expected one. However, our learning task requires a stronger supervision signal indicating the correctness of the hypothesized query components rather than its result. We define the prediction task of mapping the input text (denoted x ) to a logical representation (denoted z ) by aligning their constituents (denoted y ): Our experimental results show that our algorithms can use this supervision to recover the correct queries (R250). “What is the largest state that borders Texas?" NL Query largest( state( next_to( const(texas)))) Logical Query Interactive Computer Environment New Mexico Query Result: Previous systems: learn using logical forms Our approach: use only the responses We propose two algorithms for this task: (1) Direct approach: uses the binary supervision signal directly by formulating the decision as a binary task. (2) Aggressive Approach: iteratively learns a structured model using positively labeled predictions. Our key challenge is therefore: how to use this weak binary signal to learn this complex structured predication task? Moreover, given new data (Q250) our learned model can generate the correct queries with high accuracy compared to fully supervised models. We present a novel approach for structured prediction that exploits the observation that structured output problems often have a companion learning problem – determining whether a given input possess a good structure. Ming-Wei Chang, James Clarke, Dan Goldwasser, Dan Roth and Vivek Srikumar I. Learning over Constrained Latent Representations (LCLR) intuition that negative binary examples cannot have a good structure, and uses it Our experimental results exhibit the significant contribution of the easy-to-get indirect binary supervision on three NLP structure learning problems. Phonetic alignmentInformation Extraction POS Tagging phonetic alignments. While obtaining direct supervision for structures is difficult, it is often very easy to obtain indirect supervision for the companion binary decision problem. Defined over the structured loss function L s We are interested in adding the binary information into the optimization process. Following the intuition that negative examples cannot have good structures, we formalize the connection between the binary decision and the structured one as follows: We proceed to define the loss function for the binary classification problem: Finally, we formulate the joint optimization problem: We develop a large margin framework that jointly learns from both direct and indirect forms of supervision. Our framework exploits the Learning Protocol: Our setting extends the standard structured learning setting (e.g., SSVM), where learning is defined as the following minimization problem: Many NLP tasks can be phrased as decision problems over complex linguistic structures which are not expressed directly in the input. Successful learning depends on correctly encoding these, often Empirical Evaluation To demonstrate the effectiveness of our learning framework we evaluate it on three NLP tasks – TE, transliteration and paraphrase identification. Results show that our framework outperforms existing systems taking both two-stage and joint approaches! Learning Framework We aim to learn a binary classification function (f(x)) defined over a weight vector (u) and a latent representation (h ) constrained by application specific semantics (C) latent structures, as features for the learning system. For example, deciding if one sentence paraphrases another depends on aligning the sentences constituents and extracting features from that alignment. Learning an Intermediate Representation: Most existing approaches take a two-stage approach separating the representation decisions from the classification decision. We follow the intuition that good representations are those that facilitate the learning task and suggest a joint learning framework that learns both tasks together. We present a generalized learning framework applicable for a wide range of applications, and show how to encode domain specific knowledge into the learning process using Integer Linear Programming formulation. In our formulation h is a binary vector indicating the features used to represent a given example. Learning in this framework corresponds to solving the following optimization problem minimizing the loss over the binary training data: However, since the prediction is defined over a maximization problem, this results in a non-convex optimization problem: We present a novel iterative optimization algorithm for solving it. Textual Entailment Transliteration IdentificationParaphrase Identification I T A L Y א י ט ל י ה f (איטליה,Italy) Yes No Phonetic alignmentTransliteration iden. For example, transliteration identification and phonetic alignment are companion problems — only positive binary examples have good to push the structured decisions towards better structures. Semantic Parsing