Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen.

Similar presentations


Presentation on theme: "Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen."— Presentation transcript:

1 Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas http://www.cs.unt.edu/~rada Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse [Note: slides have been modified/deleted/added] [For those interested in lexical semantics, I suggest getting the entire tutorial]

2 2 Definitions Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. –Sense Inventory usually comes from a dictionary or thesaurus. –Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. –Unsupervised techniques

3 3 Word Senses The meaning of a word in a given context Word sense representations –With respect to a dictionary chair = a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down" chair = the position of professor; "he was awarded an endowed chair in economics" –With respect to the translation in a second language chair = chaise chair = directeur –With respect to the context where it occurs (discrimination) “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

4 4 Approaches to Word Sense Disambiguation Knowledge-Based Disambiguation –use of external lexical resources such as dictionaries and thesauri –discourse properties Supervised Disambiguation –based on a labeled training set –the learning system has: a training set of feature-encoded inputs AND their appropriate sense label (category) Unsupervised Disambiguation –based on unlabeled corpora –The learning system has: a training set of feature-encoded inputs BUT NOT their appropriate sense label (category)

5 5 All Words Word Sense Disambiguation Attempt to disambiguate all open-class words in a text “He put his suit over the back of the chair” Use information from dictionaries –Definitions / Examples for each meaning Find similarity between definitions and current context Position in a semantic network Find that “table” is closer to “chair/furniture” than to “chair/person” Use discourse properties A word exhibits the same sense in a discourse / in a collocation

6 6 All Words Word Sense Disambiguation Minimally supervised approaches –Learn to disambiguate words using small annotated corpora –E.g. SemCor – corpus where all open class words are disambiguated 200,000 running words Most frequent sense

7 7 Targeted Word Sense Disambiguation (we saw this in the previous lecture notes) Disambiguate one target word “Take a seat on this chair” “The chair of the Math Department” WSD is viewed as a typical classification problem –use machine learning techniques to train a system Training: –Corpus of occurrences of the target word, each occurrence annotated with appropriate sense –Build feature vectors: a vector of relevant linguistic features that represents the context (ex: a window of words around the target word) Disambiguation: –Disambiguate the target word in new unseen text

8 8 Unsupervised Disambiguation Disambiguate word senses: –without supporting tools such as dictionaries and thesauri –without a labeled training text Without such resources, word senses are not labeled –We cannot say “chair/furniture” or “chair/person” We can: –Cluster/group the contexts of an ambiguous word into a number of groups –Discriminate between these groups without actually labeling them

9 9 Unsupervised Disambiguation Hypothesis: same senses of words will have similar neighboring words Disambiguation algorithm –Identify context vectors corresponding to all occurrences of a particular word –Partition them into regions of high density –Assign a sense to each such region “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

10 10 Bounds on Performance Upper and Lower Bounds on Performance: –Measure of how well an algorithm performs relative to the difficulty of the task. Upper Bound: –Human performance –Around 97%-99% with few and clearly distinct senses –Inter-judge agreement: With words with clear & distinct senses – 95% and up With polysemous words with related senses – 65% – 70% Lower Bound (or baseline): –The assignment of a random sense / the most frequent sense 90% is excellent for a word with 2 equiprobable senses 90% is trivial for a word with 2 senses with probability ratios of 9 to 1

11 11 References (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating upper and lower bounds on the performance of word-sense disambiguation programs ACL 1992. (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop 1994. (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. (Senseval) Senseval evaluation exercises http://www.senseval.org

12 Part 3: Knowledge-based Methods for Word Sense Disambiguation

13 13 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods

14 14 Task Definition Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text Resources –Yes Machine Readable Dictionaries Raw corpora –No Manually annotated corpora Scope –All open-class words

15 15 Machine Readable Dictionaries In recent years, most dictionaries made available in Machine Readable format (MRD) –Oxford English Dictionary –Collins –Longman Dictionary of Ordinary Contemporary English (LDOCE) Thesauruses – add synonymy information –Roget Thesaurus Semantic networks – add more semantic relations –WordNet –EuroWordNet

16 16 MRD – A Resource for Knowledge-based WSD For each word in the language vocabulary, an MRD provides: –A list of meanings –Definitions (for all word meanings) –Typical usage examples (for most word meanings) WordNet definitions/examples for the noun plant 1.buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ 2.a living organism lacking the power of locomotion 3.something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" 4.an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience

17 17 MRD – A Resource for Knowledge-based WSD A thesaurus adds: –An explicit synonymy relation between word meanings A semantic network adds: –Hypernymy/hyponymy (IS-A), meronymy (PART-OF), antonymy, entailment, etc. WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}

18 18 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods

19 19 Lesk Algorithm (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: 1.Retrieve from MRD all sense definitions of the words to be disambiguated 2.Determine the definition overlap for all possible sense combinations 3.Choose senses that lead to highest overlap Example: disambiguate PINE CONE PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0

20 20 Lesk Algorithm: A Simplified Version Original Lesk definition: measure overlap between sense definitions for all words in context –Identify simultaneously the correct senses for all words in context Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context –Identify the correct sense for one word at a time Search space significantly reduced

21 21 Lesk Algorithm: A Simplified Version Example: disambiguate PINE in “Pine cones hanging in a tree” PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness Pine#1  Sentence = 1 Pine#2  Sentence = 0 Algorithm for simplified Lesk: 1.Retrieve from MRD all sense definitions of the word to be disambiguated 2.Determine the overlap between each sense definition and the current context 3.Choose the sense that leads to highest overlap

22 22 Evaluations of Lesk Algorithm Initial evaluation by M. Lesk –50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learner’s Dictionary Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004) –Original Lesk: 35% –Simplified Lesk: 47% Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004) –Original Lesk: 42% –Simplified Lesk: 58%

23 23 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Preferences Measures of Semantic Similarity Heuristic-based Methods

24 24 Selectional Preferences A way to constrain the possible meanings of words in a given context E.g. “Wash a dish” vs. “Cook a dish” –WASH-OBJECT vs. COOK-FOOD Capture information about possible relations between semantic classes –Common sense knowledge Alternative terminology –Selectional Restrictions –Selectional Preferences –Selectional Constraints

25 25 Acquiring Selectional Preferences From annotated corpora –We saw this in the previous lecture notes From raw corpora –Frequency counts –Information theory measures –Class-to-class relations

26 26 Preliminaries: Learning Word-to-Word Relations An indication of the semantic fit between two words 1. Frequency counts –Pairs of words connected by a syntactic relations 2. Conditional probabilities –Condition on one of the words

27 27 Learning Selectional Preferences (1) (p. 14 in Chapter 19; you won’t be responsible for this formula) Word-to-class relations (Resnik 1993) –Quantify the contribution of a semantic class using all the concepts subsumed by that class –where

28 28 Outline Task definition –Machine Readable Dictionaries Algorithms based on Machine Readable Dictionaries Selectional Restrictions Measures of Semantic Similarity Heuristic-based Methods

29 29 Semantic Similarity Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) Use this property for WSD – Identify related meanings for words that share a common context

30 30 See Figure 19.6 in the chapter Basic idea: the shorter the path between two senses in a semantic network, the more similar they are. So, you can see that nickel, dime are closer to budget than they are to Richter scale (in Figure 19.6)

31 31 Semantic Similarity Metrics (1) Input: two concepts (same part of speech) Output: similarity measure (Leacock and Chodorow 1998) –E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 (Resnik 1995) –Define information content, where P(C) = probability of seeing a concept of type C in a large corpus –Probability of seeing a concept = probability of seeing instances of that concept, D is the taxonomy depth

32 32 Semantic Similarity Metrics (2) Similarity using information content –(Resnik 1995) Define similarity between two concepts (LCS = Least Common Subsumer) –Alternatives (Jiang and Conrath 1997) Other metrics: –Similarity using information content (Lin 1998) –Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) –Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) –Adapted Lesk algorithm (Banerjee and Pedersen 2002)

33 33 Example: “plant/flora” is used more often than “plant/factory” - annotate any instance of PLANT as “plant/flora” Most Frequent Sense (1) Identify the most often used meaning and use this meaning by default Word meanings exhibit a Zipfian distribution –E.g. distribution of word senses in SemCor

34 34 Most Frequent Sense (2) (you aren’t responsible for this) Method 1: Find the most frequent sense in an annotated corpus Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) 1. Given a word w, find the top k distributionally similar words N w = {n 1, n 2, …, n k }, with associated similarity scores {dss(w,n 1 ), dss(w,n 2 ), … dss(w,n k )} 2. For each sense ws i of w, identify the similarity with the words n j, using the sense of n j that maximizes this score 3. Rank senses ws i of w based on the total similarity score

35 35 Most Frequent Sense(3) Word senses –pipe #1 = tobacco pipe –pipe #2 = tube of metal or plastic Distributional similar words –N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, …} For each word in N, find similarity with pipe#i (using the sense that maximizes the similarity) –pipe#1 – tube (#3) = 0.3 –pipe#2 – tube (#1) = 0.6 Compute score for each sense pipe#i –score (pipe#1) = 0.25 –score (pipe#2) = 0.73 Note: results depend on the corpus used to find distributionally similar words => can find domain specific predominant senses

36 36 E.g. The ambiguous word PLANT occurs 10 times in a discourse all instances of “plant” carry the same meaning One Sense Per Discourse A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) What does this mean? Evaluation: –8 words with two-way ambiguity, e.g. plant, crane, etc. –98% of the two-word occurrences in the same discourse carry the same meaning The grain of salt: Performance depends on granularity –(Krovetz 1998) experiments with words with more than two senses –Performance of “one sense per discourse” measured on SemCor is approx. 70%

37 37 The ambiguous word PLANT preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs One Sense per Collocation A word tends to preserve its meaning when used in the same collocation (Yarowsky 1993) –Strong for adjacent collocations –Weaker as the distance between words increases An example Evaluation: –97% precision on words with two-way ambiguity Finer granularity: –(Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses –70% precision on SemCor words

38 38 References (Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance. RANLP 1995. (Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional preferences. CONLL 2001. (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet. CICLING 2002. (Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical disambiguation using simulated annealing. COLING 2002. (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per discourse. DARPA workshop 1992. (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English. Longman. (Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sense disambiguation in lexical chaining. IJCAI 2003 (Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press. (Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. COLING 1997. (Krovetz, 1998) Krovetz, R. More than one sense per discourse. ACL-SIGLEX 1998. (Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986. (Lin 1998) Lin, D An information theoretic definition of similarity. ICML 1998.

39 39 References (Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and genre/topic variations. EMNLP 2000. (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop 1994. (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. (Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word sense disambiguation of unrestricted text. ACL 1999. (Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000. (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004. (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation. CICLING 2003. (Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1) 1989. (Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical Relationships. University of Pennsylvania 1993. (Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity. IJCAI 1995. (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluating variants of the Lesk approach for disambiguating words”, LREC 2004. (Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.

40 Part 4: Supervised Methods of Word Sense Disambiguation [This section has been deleted]

41 Part 5: Minimally Supervised Methods for Word Sense Disambiguation

42 42 Outline Task definition –What does “minimally” supervised mean? Bootstrapping algorithms –Co-training –Self-training –Yarowsky algorithm Using the Web for Word Sense Disambiguation –Web as a corpus –Web as collective mind

43 43 Task Definition Supervised WSD = learning sense classifiers starting with annotated data Minimally supervised WSD = learning sense classifiers from annotated data, with minimal human supervision Examples –Automatically bootstrap a corpus starting with a few human annotated examples –Use monosemous relatives / dictionary definitions to automatically construct sense tagged data

44 44 Outline Task definition –What does “minimally” supervised mean? Bootstrapping algorithms –Co-training –Self-training –Yarowsky algorithm Using the Web for Word Sense Disambiguation –Web as a corpus –Web as collective mind

45 45 Bootstrapping Recipe Ingredients –(Some) labeled data –(Large amounts of) unlabeled data –(One or more) basic classifiers Output –Classifier that improves over the basic classifiers

46 46 … plants#1 and animals … … industry plant#2 … … building the only atomic plant … … plant growth is retarded … … a herb or flowering plant … … a nuclear power plant … … building a new vehicle plant … … the animal and plant life … … the passion-fruit plant … Classifier 1 Classifier 2 … plant#1 growth is retarded … … a nuclear power plant#2 …

47 47 Co-training/Self-training 1. Create a pool of examples U' –choose P random examples from U 2. Loop for I iterations –Train C i on L and label U' –Select G most confident examples and add to L maintain distribution in L –Refill U' with examples from U keep U' at constant size P –A set L of labeled training examples –A set U of unlabeled examples –Classifiers C i

48 48 (Blum and Mitchell 1998) Two classifiers –independent views –[independence condition can be relaxed] Co-training in Natural Language Learning –Statistical parsing (Sarkar 2001) –Co-reference resolution (Ng and Cardie 2003) –Part of speech tagging (Clark, Curran and Osborne 2003) –... Co-training

49 49 Self-training (Nigam and Ghani 2000) One single classifier Retrain on its own output Self-training for Natural Language Learning –Part of speech tagging (Clark, Curran and Osborne 2003) –Co-reference resolution (Ng and Cardie 2003) several classifiers through bagging

50 50 Yarowsky Algorithm (Yarowsky 1995) Similar to co-training Relies on two heuristics and a decision list –One sense per collocation : Nearby words provide strong and consistent clues as to the sense of a target word –One sense per discourse : The sense of a target word is highly consistent within a single document

51 51 Learning Algorithm A decision list is used to classify instances of target word : “the loss of animal and plant species through extinction …” Classification is based on the highest ranking rule that matches the target context LogLCollocationSense ……… 9.31flower (within +/- k words)  A (living) 9.24job (within +/- k words)  B (factory) 9.03fruit (within +/- k words)  A (living) 9.02plant species  A (living)... …

52 52 Bootstrapping Algorithm All occurrences of the target word are identified A small training set of seed data is tagged with word sense Sense-B: factory Sense-A: life

53 53 Bootstrapping Algorithm Seed set grows and residual set shrinks ….

54 54 Bootstrapping Algorithm Convergence: Stop when residual set stabilizes

55 55 Bootstrapping Algorithm Iterative procedure: –Train decision list algorithm on seed set –Classify residual data with decision list –Create new seed set by identifying samples that are tagged with a probability above a certain threshold –Retrain classifier on new seed set Selecting training seeds –Initial training set should accurately distinguish among possible senses –Strategies: Select a single, defining seed collocation for each possible sense. Ex: “life” and “manufacturing” for target plant Use words from dictionary definitions Hand-label most frequent collocates

56 56 Evaluation Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.) Performance of multiple models compared with: –supervised decision lists –unsupervised learning algorithm of Schütze (1992), based on alignment of clusters with word senses WordSensesSupervisedUnsupervised Schütze Unsupervised Bootstrapping plantliving/factory97.79298.6 spacevolume/outer93.99093.6 tankvehicle/container97.19596.5 motionlegal/physical98.09297.9 ………-… Avg.-96.192.296.5

57 57 The Web as a Corpus [This topic has been deleted] Use the Web as a large textual corpus –Build annotated corpora using monosemous relatives –Bootstrap annotated corpora starting with few seeds Similar to (Yarowsky 1995) Use the (semi)automatically tagged data to train WSD classifiers

58 58 References (Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL 2002. (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. Proceedings of COLT 1998. (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD. (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003. (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora. Proceedings of AAAI 1999. (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedings of LREC 2002. (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL 2004. (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL 2003. (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM 2000. (Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing. Proceedings of NAACL 2001. (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995.

59 Part 6: Unsupervised Methods of Word Sense Discrimination

60 60 Outline What is Unsupervised Learning? Task Definition Agglomerative Clustering LSI/LSA Sense Discrimination Using Parallel Texts

61 61 What is Unsupervised Learning? Unsupervised learning identifies patterns in a large sample of data, without the benefit of any manually labeled examples or external knowledge sources These patterns are used to divide the data into clusters, where each member of a cluster has more in common with the other members of its own cluster than any other Note! If you remove manual labels from supervised data and cluster, you may not discover the same classes as in supervised learning –Supervised Classification identifies features that trigger a sense tag –Unsupervised Clustering finds similarity between contexts

62 62 Task Definition Word Sense Discrimination reduces to the problem of finding the targeted words that occur in the most similar contexts and placing them in a cluster

63 63 Agglomerative Clustering Create a similarity matrix of instances to be discriminated –Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances –Typically a first order representation, where similarity is based on the features observed in the pair of instances Apply Agglomerative Clustering algorithm to matrix –To start, each instance is its own cluster –Form a cluster from the most similar pair of instances –Repeat until the desired number of clusters is obtained Advantages : high quality clustering Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons

64 64 Measuring Similarity (you don’t need to know these) Integer Values –Matching Coefficient –Jaccard Coefficient –Dice Coefficient Real Values –Cosine

65 65 Evaluation of Unsupervised Methods If Sense tagged text is available, can be used for evaluation Assume that sense tags represent “true” clusters, and compare these to discovered clusters –Find mapping of clusters to senses that attains maximum accuracy Pseudo words are especially useful, since it is hard to find data that is discriminated –Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. –http://www.d.umn.edu/~kulka020/kanaghaName.html Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier

66 66 Sense Discrimination Using Parallel Texts There is controversy as to what exactly is a “word sense” (e.g., Kilgarriff, 1997) It is sometimes unclear how fine grained sense distinctions need to be to be useful in practice. Parallel text may present a solution to both problems! –Text in one language and its translation into another Resnik and Yarowsky (1997) suggest that word sense disambiguation concern itself with sense distinctions that manifest themselves across languages. –A “bill” in English may be a “pico” (bird jaw) in or a “cuenta” (invoice) in Spanish.

67 67 Parallel Text Parallel Text can be found on the Web and there are several large corpora available (e.g., UN Parallel Text, Canadian Hansards) Manual annotation of sense tags is not required! However, text must be word aligned (translations identified between the two languages). –http://www.cs.unt.edu/~rada/wpt/ Workshop on Parallel Text, NAACL 2003 Given word aligned parallel text, sense distinctions can be discovered. (e.g., Li and and Li, 2002, Diab, 2002)

68 68 References (Diab, 2002) Diab, Mona and Philip Resnik, An Unsupervised Method for Word Sense Tagging using Parallel Corpora, Proceedings of ACL, 2002. (Firth, 1957) A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, Oxford University Press, Oxford. (Kilgarriff, 1997) “I don’t believe in word senses”, Computers and the Humanities (31) pp. 91-113. (Li and Li, 2002) Word Translation Disambiguation Using Bilingual Bootstrapping. Proceedings of ACL. Pp. 343-351. (McQuitty, 1966) Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement (26) pp. 825-831. (Miller and Charles, 1991) Contextual correlates of semantic similarity. Language and Cognitive Processes, 6 (1) pp. 1 - 28. (Pedersen and Bruce, 1997) Distinguishing Word Sense in Untagged Text. In Proceedings of EMNLP2. pp 197-207. (Purandare and Pedersen, 2004) Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces. Proceedings of the Conference on Natural Language and Learning. pp. 41-48. (Resnik and Yarowsky, 1997) A Perspective on Word Sense Disambiguation Methods and their Evaluation. The ACL-SIGLEX Workshop Tagging Text with Lexical Semantics. pp. 79-86. (Schutze, 1998) Automatic Word Sense Discrimination. Computational Linguistics, 24 (1) pp. 97-123.

69 69 Outline [most of this section deleted] Where to get the required ingredients? –Machine Readable Dictionaries –Machine Learning Algorithms –Sense Annotated Data –Raw Data Where to get WSD software? How to get your algorithms tested? –Senseval

70 70 Senseval Evaluation of WSD systems http://www.senseval.org Senseval 1: 1999 – about 10 teams Senseval 2: 2001 – about 30 teams Senseval 3: 2004 – about 55 teams Senseval 4: 2007(?) Provides sense annotated data for many languages, for several tasks –Languages: English, Romanian, Chinese, Basque, Spanish, etc. –Tasks: Lexical Sample, All words, etc. Provides evaluation software Provides results of other participating systems

71 71 Thank You! Rada Mihalcea (rada@cs.unt.edu) –http://www.cs.unt.edu/~rada Ted Pedersen (tpederse@d.umn.edu) –http://www.d.umn.edu/~tpederse


Download ppt "Advances in Word Sense Disambiguation Tutorial at AAAI-2005 July 9, 2005 Rada Mihalcea University of North Texas Ted Pedersen."

Similar presentations


Ads by Google