Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.

Slides:

Advertisements

Similar presentations

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California

Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

October 2014 Paul Kantor’s Fusion Fest Workshop Making Sense of Unstructured Data Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.

CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.

Structured SVM Chen-Tse Tsai and Siddharth Gupta.

A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.

CS 6961: Structured Prediction Fall 2014 Introduction Lecture 1 What is structured prediction?

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Page 1 Learning and Global Inference for Information Access and Natural Language Understanding Dan Roth Department of Computer Science University of Illinois.

LEARNING FROM OBSERVATIONS Yılmaz KILIÇASLAN. Definition Learning takes place as the agent observes its interactions with the world and its own decision-making.

Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.

Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Online Learning Algorithms

CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.

Crash Course on Machine Learning

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.

1 CS598 DNR FALL 2005 Machine Learning in Natural Language Dan Roth University of Illinois, Urbana-Champaign

Page 1 February 2008 University of Edinburgh With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons,

Page 1 Global Inference and Learning Towards Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign.

February 2012 Princeton Plasma Physics Laboratory With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov,

Machine Learning Queens College Lecture 1: Introduction.

Global Inference in Learning for Natural Language Processing Vasin Punyakanok Department of Computer Science University of Illinois at Urbana-Champaign.

Page 1 November 2007 Beckman Institute With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Mark Sammons, Scott Yih, Dav Zimak.

1 CS546 Spring 2009 Machine Learning in Natural Language Dan Roth & Ivan Titov SC Wed/Fri 9:30  What’s.

9/8/20151 Natural Language Processing Lecture Notes 1.

Page 1 March 2009 Brigham Young University With thanks to: Collaborators: Ming-Wei Chang, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Scott.

Machine Learning CUNY Graduate Center Lecture 1: Introduction.

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

CS 6961: Structured Prediction Fall 2014 Course Information.

June 2013 Inferning Workshop, ICML, Atlanta GA Amortized Integer Linear Programming Inference Dan Roth Department of Computer Science University of Illinois.

INTRODUCTIONCS446 Fall ’15 CS 446: Machine Learning Dan Roth University of Illinois, Urbana-Champaign

Page 1 Learning and Inference in Natural Language From Stand Alone Learning Tasks to Structured Representations Dan Roth Department of Computer Science.

Page 1 Global Inference in Learning for Natural Language Processing.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.

December 2011 Technion, Israel With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Michael Connor, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Multi-core Structural SVM Training Kai-Wei Chang Department of Computer Science University of Illinois at Urbana-Champaign Joint Work With Vivek Srikumar.

June 2011 Microsoft Research, Washington With thanks to: Collaborators: Scott Yih, Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications in Natural Language Processing Dan Roth Department of Computer Science.

Page 1 April 2010 Carnegie Mellon University With thanks to: Collaborators: Ming-Wei Chang, James Clarke, Dan Goldwasser, Lev Ratinov, Vivek Srikumar,

Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.

Page 1 January 2010 Saarland University, Germany. Constrained Conditional Models Learning and Inference for Natural Language Understanding Dan Roth Department.

Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

CIS 700 Advanced Machine Learning Structured Machine Learning: Theory and Applications in Natural Language Processing Dan Roth Department of Computer.

Lecture 7: Constrained Conditional Models

Who am I? Work in Probabilistic Machine Learning Like to teach 

Sentiment analysis algorithms and applications: A survey

Integer Linear Programming Formulations in Natural Language Processing

By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS

Overview of Machine Learning

Dan Roth Computer and Information Science University of Pennsylvania

Dan Roth Department of Computer Science

CS249: Neural Language Model

Presentation transcript:

Page 1 CS 546 Machine Learning in NLP Structured Prediction: Theories and Applications to Natural Language Processing Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign  What’s the class about  How I plan to teach it  Requirements  Questions/Time?

Page 2 Comprehension 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. A process that maintains and updates a collection of propositions about the state of affairs. This is an Inference Problem (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

Why is it difficult? Meaning Language Ambiguity Variability Page 3

Context Sensitive Paraphrasing He used a Phillips head to tighten the screw. The bank owner tightened security after a spat of local crimes. The Federal Reserve will aggressively tighten monetary policy. Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce Loosen Strengthen Step up Toughen Improve Fasten Impose Intensify Ease Beef up Simplify Curb Reduce Page 4

Beyond Single Variable Decisions Example: Relation Extraction: “Works_for” Jim Carpenter works for the U.S. Government. The American government employed Jim Carpenter. Jim Carpenter was fired by the US Government. Jim Carpenter worked in a number of important positions. …. As a press liaison for the IRS, he made contacts in the white house. Top Russian interior minister Yevgeny Topolov met yesterday with his US counterpart, Jim Carpenter. Former US Secretary of Defence Jim Carpenter spoke today… Page 5

Textual Entailment Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year Yahoo acquired Overture Is it true that…? (Textual Entailment)  Overture is a search company Google is a search company ………. Google owns Overture Page 6 A key problem in natural language understanding is to abstract over the inherent syntactic and semantic variability in natural language.

Page 7 Learning  The process of Abstraction has to be driven by statistical learning methods.  Over the last decade or so it became clear that machine learning methods are necessary in order to support this process of abstraction.  But—

8 Illinois’ bored of education [board] Nissan Car and truck plant; plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. He was taken to a veterinarian; a hospital Tiger was in Washington for the PGA Tour  Finance; Banking; World News; Sports Important or not important; love or hate Classification: Ambiguity Resolution

9  The goal is to learn a function f: X  Y that maps observations in a domain to one of several categories.  Task: Decide which of {board,bored } is more likely in the given context:  X: some representation of: The Illinois’ _______ of education met yesterday…  Y: {board,bored }  Typical learning protocol:  Observe a collection of labeled examples (x,y) 2 X £ Y  Use it to learn a function f:X  Y that is consistent with the observed examples, and (hopefully) performs well on new, previously unobserved examples. Classification

10  Theoretically: generalization bounds  How many example does one need to see in order to guarantee good behavior on previously unobserved examples.  Algorithmically: good learning algorithms for linear representations.  Can deal with very high dimensionality (10 6 features)  Very efficient in terms of computation and # of examples. On-line.  Key issues remaining:  Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking.  What are the features? No good theoretical understanding here.  Is it sufficient for making progress in NLP? Classification is Well Understood

Page 11 Coherency in Semantic Role Labeling Predicate-arguments generated should be consistent across phenomena The touchdown scored by Cutler cemented the victory of the Bears. VerbNominalizationPreposition Predicate: score A0: Cutler (scorer) A1: The touchdown (points scored) Predicate: win A0: the Bears (winner) Sense: 11(6) “the object of the preposition is the object of the underlying verb of the nominalization” Linguistic Constraints: A0: the Bears  Sense(of): 11(6) A0: Cutler  Sense(by): 1(1)

Page 12 Semantic Parsing Successful interpretation involves multiple decisions  What entities appear in the interpretation?  “New York” refers to a state or a city?  How to compose fragments together? state(next_to()) >< next_to(state()) X :“What is the largest state that borders New York and Maryland ?" Y: largest( state( next_to( state(NY) AND next_to (state(MD))))

Page 13 Learning and Inference Natural Language Decisions are Structured  Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the interdependencies into account. Joint, Global Inference.  Unlike “standard” classification problems, in most interesting NLP problems there is a need to predict values for multiple interdependent variables.  These are typically called Structured Output Problems – and will be the focus of this class.

Page 14 Statistics or Linguistics? Statistical approaches were very successful in NLP But, it has become clear that there is a need to move from strictly Data Driven approaches to Knowledge Driven approaches Knowledge: Linguistics, Background world knowledge How to incorporate Knowledge into Learning & Decision Making? In many respects Structured Prediction addresses this question.  This also distinguishes it from the “standard” study of probabilistic models.

Page 15 This Class Problems  that will motivate us Perspectives  we’ll develop What we’ll do  and how

Page 16 Some Examples Part of Speech Tagging  This is a sequence labeling problem  The simplest example where (it seems that) the decision with respect to one word depends on the decision with respect to others. Named Entity Recognition  This is a sequence segmentation problem  Not all segmentations are possible  There are dependencies among assignments of values to different segments. Relation Extraction  Works_for (Jim, US-government) ; co-reference resolution Semantic Role Labeling  Decisions here build on previous decisions (Pipeline Process)  Clear constraints among decisions

Page 17 Semantic Role Labeling I left my pearls to my daughter in my will. [ I ] A0 left [ my pearls ] A1 [ to my daughter ] A2 [ in my will ] AM-LOC. A0Leaver A1Things left A2Benefactor AM-LOCLocation I left my pearls to my daughter in my will. Overlapping arguments If A2 is present, A1 must also be present. Who did what to whom, when, where, why,… How to express the constraints on the decisions? How to “enforce” them?

Page 18 Algorithmic Approach Identify argument candidates  Pruning [Xue&Palmer, EMNLP’04]  Argument Identifier Binary classification (A-Perc) Classify argument candidates  Argument Classifier Multi-class classification (A-Perc) Inference  Use the estimated probability distribution given by the argument classifier  Use structural and linguistic constraints  Infer the optimal global output I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her candidate arguments

Page 19 Inference with General Constraint Structure Recognizing Entities and Relations Dole ’s wife, Elizabeth, is a native of N.C. E 1 E 2 E 3 R 12 R 23 other 0.05 per 0.85 loc 0.10 other 0.05 per 0.50 loc 0.45 other 0.10 per 0.60 loc 0.30 irrelevant 0.10 spouse_of 0.05 born_in 0.85 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.05 spouse_of 0.45 born_in 0.50 other 0.05 per 0.85 loc 0.10 other 0.10 per 0.60 loc 0.30 other 0.05 per 0.50 loc 0.45 irrelevant 0.05 spouse_of 0.45 born_in 0.50 irrelevant 0.10 spouse_of 0.05 born_in 0.85 other 0.05 per 0.50 loc 0.45 How to guide the global inference? Do we want to also learning jointly?

Page 20 A Lot More Problems POS/Shallow Parsing/NER SRL Parsing/Dependency Parsing Information Extraction/Relation Extraction Co-Reference Resolution Transliteration Textual Entailment

Page 21 Computational Issues Difficulty of Annotating Data Decouple? Joint Learning vs. Joint Inference The Inference Problem How to solve/make decisions ? The Learning Problem How to train the model ? Indirect Supervision Constraints Driven Learning Semi-supervised Learning Constraints Driven Learning A MODEL

Page 22 Perspective Models  Generative/ Descriminative Training  Supervised, Semi-Supervised, indirect supervision Knowledge  Features  Models Structure  Declarative Information Approach  Unify treatment  Demistify  Survey key ML techniques used in Structured NLP

Page 23 This Course Structure  Lectures by me; gradually, more lectures by you; a few external talks Teaching  On the board, formal, mathematically rigorous Assignments  Projects: Two (group) projects, reports and presentations  Presentations: group paper presentations/tutorials (1-2 papers each)  4 critical surveys  Assignment 0 – Machine Learning Preparation; short work sheets Expectations  This is an advanced course. I view my role as guiding you through the material and helping you in your first steps as an researcher. I expect that your participation in class, reading assignments and presentations will reflect independence, mathematical rigor and critical thinking Questions? Class Time?

Page 24 Encouraging Collaborative Learning Dimension I: Model  On Line, Perceptron Based  Exponential Models (Logistic Regression, HMMs, CRFs)  SVMs  Constrained Conditional Models (Including bank of Classifiers)  (Deep Neural Networks? Markov Logic Networks?) Dimension II: Tasks  Basic sequential and Structured Prediction  Training Paradigms (Unsupervised & Semi Supervised; Indirect Supervision)  Latent Representations  Inference and Approximate Inference  Applications Note that there is a lot of overlaps, several things can belong to several categories. It will make it more interesting.

Project Groups EXP Models/CRF Structured SVM MLN Page 25 CCM Structured Perceptron NN (deep learning) Project: Relation (and entity) extraction The project will be on identifying relations in text. It is likely that we will be using the ACE data There will be several tasks, and possibly some optional extensions. Next week we will introduce the task and give some background reading.

Tutorial Groups EXP Models/CRF Structured SVM MLN Optimization Features Page 26 CCM Structured Perceptron NN (deep learning) Inference Latent Representations