Part 4: Supervised Methods of Word Sense Disambiguation.

Slides:



Advertisements
Similar presentations
Decision Trees Decision tree representation ID3 learning algorithm
Advertisements

Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Classification: Alternative Techniques
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
11 Chapter 20 Computational Lexical Semantics. Supervised Word-Sense Disambiguation (WSD) Methods that learn a classifier from manually sense-tagged text.
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation Saif Mohammad Ted Pedersen University of Toronto University of Minnesota.
Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
Three kinds of learning
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Word sense disambiguation (2) Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Issues with Data Mining
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Mehdi Ghayoumi Kent State University Computer Science Department Summer 2015 Exposition on Cyber Infrastructure and Big Data.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Text Classification, Active/Interactive learning.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
11 Chapter 20 Computational Lexical Semantics. Supervised Word-Sense Disambiguation (WSD) Methods that learn a classifier from manually sense-tagged text.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
A Language Independent Method for Question Classification COLING 2004.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
Classification Techniques: Bayesian Classification
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Classification Ensemble Methods 1
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Graph-based WSD の続き DMLA /7/10 小町守.
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Data Mining Practical Machine Learning Tools and Techniques
Statistical NLP: Lecture 9
Overview of Machine Learning
Computational Lexical Semantics
A task of induction to find patterns
A task of induction to find patterns
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Part 4: Supervised Methods of Word Sense Disambiguation

Outline What is Supervised Learning? Task Definition Single Classifiers –Naïve Bayesian Classifiers –Decision Lists and Trees Ensembles of Classifiers

What is Supervised Learning? Collect a set of examples that illustrate the various possible classifications or outcomes of an event. Identify patterns in the examples associated with each particular class of the event. Generalize those patterns into rules. Apply the rules to classify a new event.

Learn from these examples : “when do I go to the store?” DayGo to Store?Hot Outside? Slept Well? Ate Well? 1YES NO 2 YESNOYES 3 NO 4 YES

Outline What is Supervised Learning? Task Definition Single Classifiers –Naïve Bayesian Classifiers –Decision Lists and Trees Ensembles of Classifiers

Task Definition Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques. Resources –Sense Tagged Text –Dictionary (implicit source of sense inventory) –Syntactic Analysis (POS tagger, Chunker, Parser, …) Scope –Typically one target word per context –Part of speech of target word resolved –Lends itself to “lexical sample” formulation Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs

Sense Tagged Text Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers My bank/1 charges too much for an overdraft. I went to the bank/1 to deposit my check and get a new ATM card. The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River. My grandfather planted his pole in the bank/2 and got a great big catfish! The bank/2 is pretty muddy, I can’t walk there.

Two Bags of Words (Co-occurrences in the “window of context”) FINANCIAL_BANK_BAG: a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were RIVER_BANK_BAG: a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West

Simple Supervised Approach Given a sentence S containing “bank”: For each word W i in S If W i is in FINANCIAL_BANK_BAG then Sense_1 = Sense_1 + 1; If W i is in RIVER_BANK_BAG then Sense_2 = Sense_2 + 1; If Sense_1 > Sense_2 then print “Financial” else if Sense_2 > Sense_1 then print “River” else print “Can’t Decide”;

Supervised Methodology Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities. –One tagged word per instance/lexical sample disambiguation Select a set of features with which to represent context. –co-occurrences, collocations, POS tags, verb-obj relations, etc... Convert sense-tagged training instances to feature vectors. Apply a machine learning algorithm to induce a classifier. –Form – structure or relation among features –Parameters – strength of feature interactions Convert a held out sample of test data into feature vectors. –“correct” sense tags are known but not used Apply classifier to test instances to assign a sense tag.

Outline What is Supervised Learning? Task Definition Naïve Bayesian Classifier Decision Lists and Trees Ensembles of Classifiers

Naïve Bayesian Classifier Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997) …Word Sense Disambiguation is no exception Assumes conditional independence among features, given the sense of a word. –The form of the model is assumed, but parameters are estimated from training instances When applied to WSD, features are often “a bag of words” that come from the training data –Usually thousands of binary features that indicate if a word is present in the context of the target word (or not)

Bayesian Inference Given observed features, what is most likely sense? Estimate probability of observed features given sense Estimate unconditional probability of sense Unconditional probability of features is a normalizing term, doesn’t affect sense classification

Naïve Bayesian Model

The Naïve Bayesian Classifier –Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) P(S=1) = 1,500/2000 =.75 P(S=2) = 500/2,000 =.25 –Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. P(F1=“credit”) = 204/2000 =.102 P(F1=“credit”|S=1) = 200/1,500 =.133 P(F1=“credit”|S=2) = 4/500 =.008 –Given a test instance that has one feature “credit” P(S=1|F1=“credit”) =.133*.75/.102 =.978 P(S=2|F1=“credit”) =.008*.25/.102 =.020

Comparative Results (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line… (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line… (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words… …All found that Naïve Bayesian Classifier performed as well as any of the other methods!

Outline What is Supervised Learning? Task Definition Naïve Bayesian Classifiers Decision Lists and Trees Ensembles of Classifiers

Decision Lists and Trees Very widely used in Machine Learning. Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988). Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word. –List decides between two senses after one positive answer –Tree allows for decision among multiple senses after a series of answers Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes. –More descriptive and easier to interpret.

Decision List for WSD (Yarowsky, 1994) Identify collocational features from sense tagged data. Word immediately to the left or right of target : –I have my bank/1 statement. –The river bank/2 is muddy. Pair of words to immediate left or right of target : –The world’s richest bank/1 is here in New York. –The river bank/2 is muddy. Words found within k positions to left or right of target, where k is often : –My credit is just horrible because my bank/1 has made several mistakes with my account and the balance is very low.

Building the Decision List Sort order of collocation tests using log of conditional probabilities. Words most indicative of one sense (and not the other) will be ranked highly.

Computing DL score –Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) P(S=1) = 1,500/2,000 =.75 P(S=2) = 500/2,000 =.25 –Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. P(F1=“credit”) = 204/2,000 =.102 P(F1=“credit”|S=1) = 200/1,500 =.133 P(F1=“credit”|S=2) = 4/500 =.008 –From Bayes Rule… P(S=1|F1=“credit”) =.133*.75/.102 =.978 P(S=2|F1=“credit”) =.008*.25/.102 =.020 –DL Score = abs (log (.978/.020)) = 3.89

Using the Decision List Sort DL-score, go through test instance looking for matching feature. First match reveals sense… DL-scoreFeatureSense 3.89credit within bankBank/1 financial 2.20bank is muddyBank/2 river 1.09pole within bankBank/2 river 0.00of the bankN/A

Using the Decision List

Learning a Decision Tree Identify the feature that most “cleanly” divides the training data into the known senses. –“Cleanly” measured by information gain or gain ratio. –Create subsets of training data according to feature values. Find another feature that most cleanly divides a subset of the training data. Continue until each subset of training data is “pure” or as clean as possible. Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993) In Senseval-1 a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)

Supervised WSD with Individual Classifiers Most supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well. Features tend to differentiate among methods more than the learning algorithms. Good sets of features tend to include: –Co-occurrences or keywords (global) –Collocations (local) –Bigrams (local and global) –Part of speech (local) –Predicate-argument relations Verb-object, subject-verb, –Heads of Noun and Verb Phrases

Convergence of Results Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another. –Senseval-1, a number of systems in range of 74-78% accuracy for English Lexical Sample task. –Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task. –Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task… What to do next?

Outline What is Supervised Learning? Task Definition Naïve Bayesian Classifiers Decision Lists and Trees Ensembles of Classifiers

Classifier error has two components (Bias and Variance) –Some algorithms (e.g., decision trees) try and build a representation of the training data – Low Bias/High Variance –Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance Combining classifiers with different bias variance characteristics can lead to improved overall accuracy “Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996) –Sample with replacement from the training data to learn multiple decision trees. –Outliers in training data will tend to be obscured/eliminated.

Ensemble Considerations Must choose different learning algorithms with significantly different bias/variance characteristics. –Naïve Bayesian Classifier versus Decision Tree Must choose feature representations that yield significantly different (independent?) views of the training data. –Lexical versus syntactic features Must choose how to combine classifiers. –Simple Majority Voting –Averaging of probabilities across multiple classifier output –Maximum Entropy combination (e.g., Klein, et. al., 2002)

Ensemble Results (Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers. –Many Naïve Bayesian Classifiers trained on varying sized windows of context / bags of words. –Classifiers combined by a weighted vote (Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers. –Rich set of collocational and syntactic features. –Combined via linear combination of top three classifiers. Many Senseval-2 and Senseval-3 systems employed ensemble methods.

References (Black, 1988) E. Black (1988) An experiment in computational discrimination of English word senses. IBM Journal of Research and Development (32) pg (Breiman, 1996) L. Breiman. (1996) The heuristics of instability in model selection. Annals of Statistics (24) pg (Domingos and Pazzani, 1997) P. Domingos and M. Pazzani. (1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning (29) pg (Domingos, 2000) P. Domingos. (2000) A Unified Bias Variance Decomposition for Zero-One and Squared Loss. In Proceedings of AAAI. Pg (Florian an dYarowsky, 2002) R. Florian and D. Yarowsky. (2002) Modeling Consensus: Classifier Combination for Word Sense Disambiguation. In Proceedings of EMNLP, pp (Kelly and Stone, 1975). E. Kelly and P. Stone. (1975) Computer Recognition of English Word Senses, North Holland Publishing Co., Amsterdam. (Klein, et. al., 2002) D. Klein, K. Toutanova, H. Tolga Ilhan, S. Kamvar, and C. Manning, Combining Heterogeneous Classifiers for Word-Sense Disambiguation, Proceedings of Senseval-2. pg (Leacock, et. al. 1993) C. Leacock, J. Towell, E. Voorhees. (1993) Corpus based statistical sense resolution. In Proceedings of the ARPA Workshop on Human Language Technology. pg (Mooney, 1996) R. Mooney. (1996) Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of EMNLP. pg (Pedersen, 1998) T. Pedersen. (1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D. Dissertation. Southern Methodist University. (Pedersen, 2000) T. Pedersen (2000) A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation. In Proceedings of NAACL. (Quillian, 1986). J.R. Quillian (1986) Induction of Decision Trees. Machine Learning (1). pg (Quillian, 1993). J.R. Quillian (1993) C4.5 Programs for Machine Learning. San Francisco, Morgan Kaufmann. (Yarowsky, 1994) D. Yarowsky. (1994) Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of ACL. pp (Yarowsky, 2000) D. Yarowsky. (2000) Hierarchical decision lists for word sense disambiguation. Computers and the Humanities, 34.