1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Slides:



Advertisements
Similar presentations
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Advertisements

Improved TF-IDF Ranker
Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Contextual Wisdom Social Relations and Correlations for Multimedia Event Annotation Amit Zunjarwad, Hari Sundaram and Lexing Xie.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Word sense induction using continuous vector space models
Scalable Text Mining with Sparse Generative Models
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Measuring the Similarity between Implicit Semantic Relations using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Web Search and.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Measuring Semantic Similarity between Words Using Web Search Engines WWW 07.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Presentation transcript:

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering, 23(7), Danushka Bollegala, Yutaka Matsuo, & Mitsuru Ishizuka

2 Outline 1.Introduction 2.Related Work 3.Method 4.Experiments 5.Conclusion

3 1. Introduction (1/5) Semantic Similarity –Web mining: community extraction, relation detection, & entity disambiguation. –Information retrieval: to retrieve a set of documents that is semantically related to a given user query. –Natural language processing: word sense disambiguation, textual entailment, & automatic text summarization.

4 1. Introduction (2/5) Web search engines –Page count: the number of pages that contain the query words. –Snippets: a brief window of text extracted by a search engine around the query term in a document.

5 1. Introduction (3/5) Page count –In Google, “apple” AND “computer” is 288,000,000; “banana” AND “computer” is 3,590,000. –“apple” AND “computer” is much similar than “banana” AND “computer”.

6 Snippets –“Jaguar” AND “cat” –Jaguar is the largest cat  X is the largest Y 1. Introduction (4/5)

7 1. Introduction (5/5) Web search engine (Google) + Page count + Snippets  Semantic Similarity

8 2. Related Work (1/2) Normalized Google Distance (NGD) –Cilibrasi & Vitanyi, P and Q: the two words; NGD(P,Q): the distance between P and Q; H(P),H(Q): the page count for the word P and Q; H(P,Q): the page count for the query “P AND Q”.

9 2. Related Work (2/2) Co-occurrence Double-Checking (CODC) –Chen et al., the number of occurrences of P in the top-ranking snippets for the query Q in Google; H(P): the page count for query P; α: a constant in this model, which is experimentally set to the value 0.15.

10 3. Method 1.Outline 2.Page Count-Based Co-Occurrence Measures 3.Lexical Pattern Extraction 4.Lexical Pattern Clustering 5.Measuring Semantic Similarity 6.Training

Outline

Page Count-Based Co-Occurrence Measures (1/2) P∩Q denotes the conjunction query “P AND Q”.

Page Count-Based Co-Occurrence Measures (2/2) N: the number of documents indexed by the search engine.

Lexical Pattern Extraction (1/2) Conditions: 1.A subsequence must contain exactly one occurrence of each X and Y. 2.The maximum length of a subsequence is L words. 3.A subsequence is allowed to skip one or more words. However, we do not skip more than g number of words consecutively. Moreover, the total number of words skipped in a subsequence should not exceed G. 4.We expand all negation contractions in a context. For example, didn’t is expanded to did not. We do not skip the word not when generating subsequences. For example, this condition ensures that from the snippet X is not a Y, we do not produce the subsequence X is a Y.

Lexical Pattern Extraction (2/2) X, a large Y X a flightless Y X, large Y lives A snippet retrieved for the query “ostrich*******bird.”

Lexical Pattern Clustering (1/2) word-pair frequency: total occurrence: a j : a pattern in pattern vector a. (P i,Q j ): a word pair.

Lexical Pattern Clustering (2/2)

Measuring Semantic Similarity (1/5) Weight to a pattern a i in a cluster c j : The jth feature for a word pair (P, Q):

Measuring Semantic Similarity (2/5) Feature vector for a word pair (P, Q) :

Measuring Semantic Similarity (3/5) Train a two-class SVM: ( synonymous / nonsynonymous ) Semantic similarity:

Measuring Semantic Similarity (4/5) Distance: b: the bias term and the hyperplane. a k : the Lagrange multiplier. f k : support vector. K(f k, f): the value of the kernel function. f : the instance to classify.

Measuring Semantic Similarity (5/5) The probability: Log likelihood:

Training (1/5) Number of Patterns Extracted for Training Data Synonymous (A, B) (C, D) Nonsynonymous (A, D) (C, B)

Training (2/5) L = 5, g = 2, G = 4, & T = 5, for lexical pattern extraction conditions. Distribution of patterns extracted from synonymous word pairs.

Training (3/5) Average similarity versus clustering threshold θ.

Training (4/5) The centroid vector of all feature vectors: The average Mahalanobis distance : |W|: the number of word pairs in W. C -1 : the inverse of the intercluster correlation Matrix.

Training (5/5) Distribution of patterns extracted from nonsynonymous word pairs.

28 4. Experiments 1.Benchmark Data Sets 2.Semantic Similarity 3.Community Mining

29 5. Conclusion (1/3) 1.A semantic similarity measure using both page counts and snippets retrieved from a web search engine for two words. 2.Four word co-occurrence measures were computed using page counts. 3.A lexical pattern extraction algorithm to extract numerous semantic relations that exist between two words.

30 5. Conclusion (2/3) 4.A sequential pattern clustering algorithm was proposed to identify different lexical patterns that describe the same semantic relation. 5.Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair. 6.A two-class SVM was trained using those features extracted for synonymous and nonsynonymous word pairs selected from WordNet synsets.

31 5. Conclusion (2/3) Experimental results on three benchmark data sets showed that the proposed method outperforms various baselines as well as previously proposed web- based semantic similarity measures, achieving a high correlation with human ratings. The proposed method improved the F-score in a community mining task, thereby underlining its usefulness in real-world tasks, that include named entities not adequately covered by manually created resources.

32 The End~ Thanks for your attention!!