Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frequency Estimates for Statistical Word Similarity Measures Presenter: Cosmin Adrian Bejan Egidio Terra and C.L.A. Clarke School of Computer Science University.

Similar presentations


Presentation on theme: "Frequency Estimates for Statistical Word Similarity Measures Presenter: Cosmin Adrian Bejan Egidio Terra and C.L.A. Clarke School of Computer Science University."— Presentation transcript:

1 Frequency Estimates for Statistical Word Similarity Measures Presenter: Cosmin Adrian Bejan Egidio Terra and C.L.A. Clarke School of Computer Science University of Waterloo

2 2 Introduction  A comparative study of two methods for estimating word cooccurence frequencies required by word similarity measures to solve human-oriented language tests.  Example of such tests:  determine the best synonym in a set of alternatives A={A 1, A 2, A 3, A 4 } for a specific target word TW in a context C={w 1 ’, w 2 ’, … w n ’} \ TW.  determine the best synonym when no context is available

3 3 Measuring Word Similarity  the notion for cooccurence of two words can be depicted by a contingency table:  each dimension represents a random discrete variable W i with range A = {w i,  w i };  each cell represent the joint frequency where N max is the maximum number of cooccurences.

4 4 Similarity between two words Pointwise Mutual Information Χ 2 - test Likelihood ratio Average Mutual Information

5 5 Context supported similarity Cosine of Pointwise Mutual Information L1 norm Contextual Average Mutual Information Contextual Jensen- Shanon Digergence Pointwise Mutual Infor- mation of Multiple words

6 6 Window-oriented approach  f w_i – frequency of w i  f w_1,w_2 – cooccurence frequency of w 1 and w 2  N – size of the corpus in words  P(w i ) = f w_i /N  f w_1,w_2 is estimated by the number of windows where the two words cooccur.  N wt – number of windows of size t  P(w 1, w 2 ) = f w_1,w_2 / N wt

7 7 Document-oriented approach  df w_i – frequency of a word w i. It corresponds to the number of documents in which the words appears.  D – the number of documents  P(w i ) = df w_i / D  df w_1,w_2 – cooccurence frequency of two words – is the number of documents where the words cooccur.  P(w 1, w 2 ) = df w_1,w_2 / D

8 8 Results for TOEFL test set

9 9 Results for TS1 and context


Download ppt "Frequency Estimates for Statistical Word Similarity Measures Presenter: Cosmin Adrian Bejan Egidio Terra and C.L.A. Clarke School of Computer Science University."

Similar presentations


Ads by Google