Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University.

Similar presentations


Presentation on theme: "Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University."— Presentation transcript:

1 Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University ACL 2006

2 Introduction Discovering word categories, sets of words sharing a significant aspect of their meaning context feature vectors pattern-based discovery Manually prepared pattern set (ex. x and y) requiring POS tagging or partial or full parsing

3 Pattern Candidates high frequency word (HFW) A word appearing more than T H times per million words Ex. and, or, from, to … content word (CW) word appearing less than T C times per a million words

4 Pattern Candidates meta-patterns obey the following constraints at most 4 words exactly two content words no two consecutive CWs Example CHC, CHCH, CHHC, and HCHC from x to y (HCHC), x and y (CHC), x and a y (CHHC)

5 Symmetric Patterns In order to find a usable subset from pattern candidates, we focus on the symmetric patterns Example x belongs to y (asymmetric relationships) X and y (asymmetric relationships)

6 Symmetric Patterns We use single pattern graph G(P) to identifying symmetric patterns there is a directed arc A(x, y) from node x to node y iff the words x and y both appear in an instance of the pattern P as its two CWs x precedes y in P SymG(P), the symmetric subgraph of G(P), containing only the bidirectional arcs and nodes of G(P)

7 Symmetric Patterns We compute three measures on G(P) M1 counts the proportion of words that can appear in both slots of the pattern M2, M3 measures count the proportion of the number of symmetric nodes and edges in G(P)

8 Symmetric Patterns We removed patterns that appear in the corpus less than T P times per million words We remove patterns that are not in the top ZT in any of the three lists We remove patterns that are in the bottom ZB in at least one of the lists

9 Discovery of Categories words that are highly interconnected are good candidates to form a category word relationship graph G merging all of the single-pattern graphs into a single unified graph clique 是一個圖中兩兩相鄰的一個點集,或是 一個完全子圖

10 The Clique-Set Method Strong n-cliques subgraphs containing n nodes that are all bidirectionally interconnected A clique Q defines a category that contains the nodes in Q plus all of the nodes that are (1) at least unidirectionally connected to all nodes in Q (2) bidirectionally connected to at least one node in Q

11 The Clique-Set Method we use 2-cliques For each 2-cliques, a category is generated that contains the nodes of all triangles that contain 2-cliques and at least one additional bidirectional arc. For example ‘ book and newspaper ’, ‘ newspaper and book ’, ‘ book and note ’, ‘ note and book ’ and ‘ note and newspaper ’

12 The Clique-Set Method a pair of nodes connected by a symmetric arc can appear in more than a single category define exactly three strongly connected triangles, ABC,ABD,ACE arc (A,B) yields a category {A,B,C,D} arc (A,C) yields a category {A,C,B,E}

13 Category Merging In order to cover as many words as possible, we use the smallest clique, a single symmetric arc. This creates redundant categories We use two simple merge heuristics If two categories are identical we treat them as one Given two categories Q,R, we merge them iff there ’ s more than a 50% overlap between them

14 Corpus Windowing In order to increase category quality and remove categories that are too context- specific Instead of running the algorithm of this section on the whole corpus, we divide the corpus into windows of equal size we select only those categories that appear in at least two of the windows

15 Evaluation three corpora, two for English and one for Russian BNC, containing about 100M words Dmoz, 68GB containing about 8.2G words from 50M web pages Russian corpus was assembled from many web sites and carefully filtered for duplicates, to yield 33GB and 4G words

16 Evaluation Thresholds thresholds T H, T C, T P,ZT,ZB ZB, were determined by memory size 100, 50, 20, 100, 100 window size 5% of the different words in the corpus assigned into categories V=number of different words in the corpus. W=number of words belonging to at least one categories. C= is the number of categories AS= is the average category size

17 Human Judgment Evaluation Part I given 40 triplets of words and were asked to rank them using 1-5 (1 is best) 40 triplets were obtained as follows 20 of our categories were selected at random from the non-overlapping categories 10 triplets were selected from the categories produced by k-means (Pantel and Lin, 2002) 10 triplets were generated by random selection of content words

18 Human Judgment Evaluation Part II, subjects were given the full categories of the triplets that were graded as 1 or 2 in Part I They were asked to grade the categories from 1 (worst) to 10 (best) according to how much the full category had met the expectations they had when seeing only the triplet

19 Human Judgment Evaluation shared meaning average percentage of triplets that were given scores of 1 or 2

20 WordNet-Based Evaluation Took 10 WN subsets referred to as ‘ subjects ’ selected at random 10 pairs of words from each subject For each pair we found the largest of our discovered categories containing it

21 WordNet-Based Evaluation For each found category C containing N words Precision: the number of words present in both C and WN divided by N Precision*: the number of correct words divided by N (Correct words are either words that appear in the WN subtree, or words whose entry in the American Heritage Dictionary) Recall: the number of words present in both C and WN divided by the number of (single) words in WN

22 WordNet-Based Evaluation Number of correctly discovered words (New) that are not in WN. The Table also shows the number of WN words (:WN)

23 WordNet-Based Evaluation BNC ‘ all words ’ precision of 90.47%. This metric was reported to be 82% in (Widdows and Dorow, 2002).

24 Discussion Evaluate the task is difficult Ex. {nightcrawlers, chicken,shrimp, liver, leeches} The first pattern-based lexical acquisition method that is fully unsupervised requiring no corpus annotation or manually provided patterns or words Categories are generated using a novel graph clique-set algorithm


Download ppt "Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University."

Similar presentations


Ads by Google