Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University.

Slides:



Advertisements
Similar presentations
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Advertisements

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Clustering.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Data Mining Techniques: Clustering
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Clustering Unsupervised learning Generating “classes”
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Outline Problem Background Theory Extending to NLP and Experiment
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Short Video Metadata Acquisition Game Aleš Mäsiar, Jakub Šimko
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Linguistic Graph Similarity for News Sentence Searching
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Erasmus University Rotterdam
Design and Analysis of Algorithm
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
A Fundamental Bi-partition Algorithm of Kernighan-Lin
Presentation transcript:

Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University ACL 2006

Introduction Discovering word categories, sets of words sharing a significant aspect of their meaning context feature vectors pattern-based discovery Manually prepared pattern set (ex. x and y) requiring POS tagging or partial or full parsing

Pattern Candidates high frequency word (HFW) A word appearing more than T H times per million words Ex. and, or, from, to … content word (CW) word appearing less than T C times per a million words

Pattern Candidates meta-patterns obey the following constraints at most 4 words exactly two content words no two consecutive CWs Example CHC, CHCH, CHHC, and HCHC from x to y (HCHC), x and y (CHC), x and a y (CHHC)

Symmetric Patterns In order to find a usable subset from pattern candidates, we focus on the symmetric patterns Example x belongs to y (asymmetric relationships) X and y (asymmetric relationships)

Symmetric Patterns We use single pattern graph G(P) to identifying symmetric patterns there is a directed arc A(x, y) from node x to node y iff the words x and y both appear in an instance of the pattern P as its two CWs x precedes y in P SymG(P), the symmetric subgraph of G(P), containing only the bidirectional arcs and nodes of G(P)

Symmetric Patterns We compute three measures on G(P) M1 counts the proportion of words that can appear in both slots of the pattern M2, M3 measures count the proportion of the number of symmetric nodes and edges in G(P)

Symmetric Patterns We removed patterns that appear in the corpus less than T P times per million words We remove patterns that are not in the top ZT in any of the three lists We remove patterns that are in the bottom ZB in at least one of the lists

Discovery of Categories words that are highly interconnected are good candidates to form a category word relationship graph G merging all of the single-pattern graphs into a single unified graph clique 是一個圖中兩兩相鄰的一個點集,或是 一個完全子圖

The Clique-Set Method Strong n-cliques subgraphs containing n nodes that are all bidirectionally interconnected A clique Q defines a category that contains the nodes in Q plus all of the nodes that are (1) at least unidirectionally connected to all nodes in Q (2) bidirectionally connected to at least one node in Q

The Clique-Set Method we use 2-cliques For each 2-cliques, a category is generated that contains the nodes of all triangles that contain 2-cliques and at least one additional bidirectional arc. For example ‘ book and newspaper ’, ‘ newspaper and book ’, ‘ book and note ’, ‘ note and book ’ and ‘ note and newspaper ’

The Clique-Set Method a pair of nodes connected by a symmetric arc can appear in more than a single category define exactly three strongly connected triangles, ABC,ABD,ACE arc (A,B) yields a category {A,B,C,D} arc (A,C) yields a category {A,C,B,E}

Category Merging In order to cover as many words as possible, we use the smallest clique, a single symmetric arc. This creates redundant categories We use two simple merge heuristics If two categories are identical we treat them as one Given two categories Q,R, we merge them iff there ’ s more than a 50% overlap between them

Corpus Windowing In order to increase category quality and remove categories that are too context- specific Instead of running the algorithm of this section on the whole corpus, we divide the corpus into windows of equal size we select only those categories that appear in at least two of the windows

Evaluation three corpora, two for English and one for Russian BNC, containing about 100M words Dmoz, 68GB containing about 8.2G words from 50M web pages Russian corpus was assembled from many web sites and carefully filtered for duplicates, to yield 33GB and 4G words

Evaluation Thresholds thresholds T H, T C, T P,ZT,ZB ZB, were determined by memory size 100, 50, 20, 100, 100 window size 5% of the different words in the corpus assigned into categories V=number of different words in the corpus. W=number of words belonging to at least one categories. C= is the number of categories AS= is the average category size

Human Judgment Evaluation Part I given 40 triplets of words and were asked to rank them using 1-5 (1 is best) 40 triplets were obtained as follows 20 of our categories were selected at random from the non-overlapping categories 10 triplets were selected from the categories produced by k-means (Pantel and Lin, 2002) 10 triplets were generated by random selection of content words

Human Judgment Evaluation Part II, subjects were given the full categories of the triplets that were graded as 1 or 2 in Part I They were asked to grade the categories from 1 (worst) to 10 (best) according to how much the full category had met the expectations they had when seeing only the triplet

Human Judgment Evaluation shared meaning average percentage of triplets that were given scores of 1 or 2

WordNet-Based Evaluation Took 10 WN subsets referred to as ‘ subjects ’ selected at random 10 pairs of words from each subject For each pair we found the largest of our discovered categories containing it

WordNet-Based Evaluation For each found category C containing N words Precision: the number of words present in both C and WN divided by N Precision*: the number of correct words divided by N (Correct words are either words that appear in the WN subtree, or words whose entry in the American Heritage Dictionary) Recall: the number of words present in both C and WN divided by the number of (single) words in WN

WordNet-Based Evaluation Number of correctly discovered words (New) that are not in WN. The Table also shows the number of WN words (:WN)

WordNet-Based Evaluation BNC ‘ all words ’ precision of 90.47%. This metric was reported to be 82% in (Widdows and Dorow, 2002).

Discussion Evaluate the task is difficult Ex. {nightcrawlers, chicken,shrimp, liver, leeches} The first pattern-based lexical acquisition method that is fully unsupervised requiring no corpus annotation or manually provided patterns or words Categories are generated using a novel graph clique-set algorithm