Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.

Slides:



Advertisements
Similar presentations
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Max-Margin Matching for Semantic Role Labeling David Vickrey James Connor Daphne Koller Stanford University.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Overview Full Bayesian Learning MAP learning
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Thanks to Nir Friedman, HU
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Advanced Multimedia Text Classification Tamara Berg.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Lexical Semantics CSCI-GA.2590 – Lecture 7A
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Example 16,000 documents 100 topic Picked those with large p(w|z)
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
1 CSA4050: Advanced Topics in NLP Spelling Models.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Unsupervised Word Sense Disambiguation REU, Summer, 2009.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Max-Margin Markov Networks by Ben Taskar, Carlos Guestrin, and Daphne Koller Presented by Michael Cafarella CSE574 May 25, 2005.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Logistic Regression William Cohen.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Finding Predominant Word Senses in Untagged Text Diana McCarthy & Rob Koeling & Julie Weeds & Carroll Department of Indormatics, University of Sussex {dianam,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
Daphne Koller Bayesian Networks Naïve Bayes Probabilistic Graphical Models Representation.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Coarse-grained Word Sense Disambiguation
Inference: Conclusion with Confidence
CSC 594 Topics in AI – Natural Language Processing
Multimodal Learning with Deep Boltzmann Machines
Statistical NLP: Lecture 9
Word embeddings (continued)
Discriminative Probabilistic Models for Relational Data
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller

Word Sense Disambiguation The electricity plant supplies 500 homes with power. A plant requires water and sunlight to survive. vs. Clues That plant produces bottled water. Tricky:

WSD as Classification Senses s 1, s 2, …, s k correspond to classes c 1, c 2, …, c k Features: properties of context of word occurrence Subject or verb of sentence Any word occurring within 4 words of occurrence Document: set of features corresponding to an occurrence The electricity plant supplies 500 homes with power.

Simple Approaches Only features are what words appear in context Naïve Bayes Discriminative, e.g. SVM Problems: Feature set not rich enough Data extremely sparse space occurs 38 times in corpus with 200,000 words

Available Data WordNet – electronic thesaurus –Words grouped by meaning into synsets –Slightly over 100,000 synsets –For nouns and verbs, hierarchy over synsets Mammal Dog, Hound, Canine RetrieverTerrier Animal Bird

Available Data Around 400,000 word corpus labeled with synsets from WordNet Sample sentences from WordNet Very sparse for most words

What Hasn’t Worked Intuition: context of “dog” similar to context of “retriever” Use hierarchy to determine possibly useful data Using cross-validation, learn what data is actually useful This hasn’t worked out very well

Why? Lots of parameters (not even counting parameters estimated using MLE) –> 100K for one model, ~ 20K for another Not much data (400K words) –a, the, and, of, to occur ~ 65K times (together) Hierarchy may not be very useful –Hand-built; not designed for this task Features not very expressive Luke is looking at this more closely using an SVM

Collective WSD Ideas: Determine senses of all words in a document simultaneously –Allows for richer features Train on unlabeled data as well as labeled –Lots and lots of unlabeled text available

Model Variables: –S 1,S 2, …, S n – synsets –W 1,W 2, …, W n – words, always observed S1S1 S3S3 S2S2 S4S4 S5S5 W1W1 W3W3 W2W2 W4W4 W5W5

Model Each synset generated from previous context – size of context a parameter (4) P(S,W) = ∏ i = 1 n P(W i | S i ) * P(S i | S i-3,S i-2,S i-1 ) P(S i =s | S i-3,S i-2,S i-1 ) = Z(s i-3,s i-2,s i-1 ) exp(λ s (s i-3 )+λ s (s i-2 )+λ s (s i-1 )+λ s ) P(W) = Σ P(S,W)

Learning Two sets of parameters –P(W i | S i ) – Given current estimates of marginals P(S i ), expected counts –λ s (s’) – For s’  Domain(S i-1 ), s  Domain(S i ), gradient descent on log likelihood gives: λ s (s’) + = Σ S i-3,S i-2 [ P(w,s i-3,s i-2,s’,s) – P(w,s i-3,s i-2,s’) * P(s | s i-3,s i-2,s’)]

Efficiency Only need to calculate marginals over contexts –Forwards-backwards Issue: some words have many possible synsets (40-50) – want very fast inference –Possibly prune values?

WordNet and Synsets Model uses WordNet to determine domain of S i –Synset information should be more reliable This allows us learn without any labeled data Consider synsets {eagle,hawk}, {eagle (golf shot)}, and {hawk(to sell)} –Since parameters depend only on synset, even without labeled data, can find correct clustering

Richer Features Heuristic: “One sense per discourse” = usually, within a document any given word only takes one of its possible senses Can capture this using long-range links –Could assume each word independent of all occurrences besides the ones immediately before and after –Or, could use approximate inference (Kikuchi)

Richer Features Can reduce feature sparsity using hierarchy (e.g., replace all occurrences of “dog” and “cat” with “animal”) –Need collective classification to do this Could add “global” hidden variables to try to capture document subject

Advanced Parameters Lots of parameters Regularization likely helpful Could tie parameters together based on similarity in the WordNet hierarchy –Ties in what I was working on before –More data in this situation (unlabeled)

Experiments Soon