Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter.

Similar presentations


Presentation on theme: "Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter."— Presentation transcript:

1 Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 8, 2013 CSCE 771 Natural Language Processing

2 – 2 – CSCE 771 Spring 2013 Overview Last Time (Programming) Features in NLTK NL queries  SQL NLTK support for Interpretations and Models Propositional and predicate logic support Prover9Today Last Lectures slides 25-29 Features in NLTK Computational Lexical SemanticsReadings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II

3 – 3 – CSCE 771 Spring 2013 ACL Anthology - http://aclweb.org/anthology-new/.

4 – 4 – CSCE 771 Spring 2013 Figure 20.8 Summary of Thesaurus Similarity measures

5 – 5 – CSCE 771 Spring 2013 Wordnet similarity functions path_similarity()? lch_similarity()? lch_similarity()? wup_similarity()? wup_similarity()? res_similarity()? res_similarity()? jcn_similarity()? jcn_similarity()? lin_similarity()? lin_similarity()?

6 – 6 – CSCE 771 Spring 2013 Examples: but first a Pop Quiz How do you get hypernyms from wordnet?

7 – 7 – CSCE 771 Spring 2013 Example: P(c) values entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired

8 – 8 – CSCE 771 Spring 2013 Example: counts (made-up) entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired

9 – 9 – CSCE 771 Spring 2013 Example: P(c) values entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired

10 – 10 – CSCE 771 Spring 2013 Example: entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired

11 – 11 – CSCE 771 Spring 2013 sim Lesk (cat, dog) ???  (42)S: (n) dog#1 (dog%1:05:00::), domestic dog#1 (domestic_dog%1:05:00::), Canis familiaris#1 (canis_familiaris%1:05:00::) (a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds) "the dog barked all night“ S:domestic dog#1 (domestic_dog%1:05:00::)Canis familiaris#1 (canis_familiaris%1:05:00::)S:domestic dog#1 (domestic_dog%1:05:00::)Canis familiaris#1 (canis_familiaris%1:05:00::)  (18)S: (n) cat#1 (cat%1:05:00::), true cat#1 (true_cat%1:05:00::) (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) S:true cat#1 (true_cat%1:05:00::)S:true cat#1 (true_cat%1:05:00::)  (1)S: (n) wolf#1 (wolf%1:05:00::) (any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs) S:

12 – 12 – CSCE 771 Spring 2013 Problems with thesaurus-based  don’t always have a thesaurus  Even so problems with recall  missing words  phrases missing  thesauri work less well for verbs and adjectives  less hyponymy structure Distributional Word Similarity D. Jurafsky

13 – 13 – CSCE 771 Spring 2013 Distributional models of meaning  vector-space models of meaning  offer higher recall than hand-built thesauri  less precision probably  intuition Distributional Word Similarity D. Jurafsky

14 – 14 – CSCE 771 Spring 2013 Word Similarity Distributional Methods 20.31 tezguino example (Nida) A bottle of tezguino is on the table.A bottle of tezguino is on the table. Everybody likes tezguino.Everybody likes tezguino. tezguino makes you drunk.tezguino makes you drunk. We make tezguino out of corn.We make tezguino out of corn. What do you know about tezguino?What do you know about tezguino?

15 – 15 – CSCE 771 Spring 2013 Term-document matrix  Collection of documents  Identify collection of important terms, discriminatory terms(words)  Matrix: terms X documents –  term frequency tf w,d =  each document a vector in Z V :  Z= integers; N=natural numbers more accurate but perhaps misleading  Example Distributional Word Similarity D. Jurafsky

16 – 16 – CSCE 771 Spring 2013 Example Term-document matrix Subset of terms = {battle, soldier, fool, clown} Distributional Word Similarity D. Jurafsky As you like it12 th NightJulius CaesarHenry V Battle11815 Soldier221236 fool375815 clown611700

17 – 17 – CSCE 771 Spring 2013 Figure 20.9 Term in context matrix for word similarity (Co-occurrence vectors) window of 20 words – 10 before 10 after from Brown corpus – words that occur together  non Brown example   The Graduate School requires that all PhD students to be admitted to candidacy at least one year prior to graduation. Passing …   Small table from the Brown 10 before 10 after

18 – 18 – CSCE 771 Spring 2013 Pointwise Mutual Information td-idf (inverse document frequency) rating instead of raw countstd-idf (inverse document frequency) rating instead of raw counts idf intuition again – pointwise mutual information (PMI)pointwise mutual information (PMI) Do events x and y occur more than if they were independent? PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) PMI between wordsPMI between words Positive PMI between two words (PPMI)Positive PMI between two words (PPMI)

19 – 19 – CSCE 771 Spring 2013 Computing PPMI  Matrix F with W (words) rows and C (contexts) columns  f ij is frequency of w i in c j,

20 – 20 – CSCE 771 Spring 2013 Example computing PPMI Need counts so lets make up someNeed counts so lets make up some we need to edit this table to have counts

21 – 21 – CSCE 771 Spring 2013 Associations PMI-assoc assoc PMI (w, f) = log 2 P(w,f) / P(w) P(f)assoc PMI (w, f) = log 2 P(w,f) / P(w) P(f) Lin- assoc - f composed of r (relation) and w’ assoc LIN (w, f) = log 2 P(w,f) / P(r|w) P(w’|w)assoc LIN (w, f) = log 2 P(w,f) / P(r|w) P(w’|w) t-test_assoc (20.41)

22 – 22 – CSCE 771 Spring 2013 Figure 20.10 Co-occurrence vectors  Dependency based parser – special case of shallow parsing  identify from “I discovered dried tangerines.” (20.32)  discover(subject I)I(subject-of discover)  tangerine(obj-of discover)tangerine(adj-mod dried)

23 – 23 – CSCE 771 Spring 2013 Figure 20.11 Objects of the verb drink Hindle 1990

24 – 24 – CSCE 771 Spring 2013 vectors review dot-productlengthsim-cosine

25 – 25 – CSCE 771 Spring 2013 Figure 20.12 Similarity of Vectors

26 – 26 – CSCE 771 Spring 2013 Fig 20.13 Vector Similarity Summary

27 – 27 – CSCE 771 Spring 2013 Figure 20.14 Hand-built patterns for hypernyms Hearst 1992

28 – 28 – CSCE 771 Spring 2013 Figure 20.15

29 – 29 – CSCE 771 Spring 2013 Figure 20.16

30 – 30 – CSCE 771 Spring 2013 http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk NLTK 3.0a1 released : February 2013 This version adds support for NLTK’s graphical user interfaces. http://nltk.org/nltk3-alpha/ This version adds support for NLTK’s graphical user interfaces. http://nltk.org/nltk3-alpha/ which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguisticshttp://en.wikipedia.org/wiki/Portal:Linguisticshttp://en.wikipedia.org/wiki/Yarowsky_algorithmhttp://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html


Download ppt "Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter."

Similar presentations


Ads by Google