Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Similar presentations


Presentation on theme: "1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,"— Presentation transcript:

1 1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering, 23(7), 2011. Danushka Bollegala, Yutaka Matsuo, & Mitsuru Ishizuka

2 2 Outline 1.Introduction 2.Related Work 3.Method 4.Experiments 5.Conclusion

3 3 1. Introduction (1/5) Semantic Similarity –Web mining: community extraction, relation detection, & entity disambiguation. –Information retrieval: to retrieve a set of documents that is semantically related to a given user query. –Natural language processing: word sense disambiguation, textual entailment, & automatic text summarization.

4 4 1. Introduction (2/5) Web search engines –Page count: the number of pages that contain the query words. –Snippets: a brief window of text extracted by a search engine around the query term in a document.

5 5 1. Introduction (3/5) Page count –In Google, “apple” AND “computer” is 288,000,000; “banana” AND “computer” is 3,590,000. –“apple” AND “computer” is much similar than “banana” AND “computer”.

6 6 Snippets –“Jaguar” AND “cat” –Jaguar is the largest cat  X is the largest Y 1. Introduction (4/5)

7 7 1. Introduction (5/5) Web search engine (Google) + Page count + Snippets  Semantic Similarity

8 8 2. Related Work (1/2) Normalized Google Distance (NGD) –Cilibrasi & Vitanyi, 2007. P and Q: the two words; NGD(P,Q): the distance between P and Q; H(P),H(Q): the page count for the word P and Q; H(P,Q): the page count for the query “P AND Q”.

9 9 2. Related Work (2/2) Co-occurrence Double-Checking (CODC) –Chen et al., 2006. F(P@Q): the number of occurrences of P in the top-ranking snippets for the query Q in Google; H(P): the page count for query P; α: a constant in this model, which is experimentally set to the value 0.15.

10 10 3. Method 1.Outline 2.Page Count-Based Co-Occurrence Measures 3.Lexical Pattern Extraction 4.Lexical Pattern Clustering 5.Measuring Semantic Similarity 6.Training

11 11 3.1 Outline

12 12 3.2 Page Count-Based Co-Occurrence Measures (1/2) P∩Q denotes the conjunction query “P AND Q”.

13 13 3.2 Page Count-Based Co-Occurrence Measures (2/2) N: the number of documents indexed by the search engine.

14 14 3.3 Lexical Pattern Extraction (1/2) Conditions: 1.A subsequence must contain exactly one occurrence of each X and Y. 2.The maximum length of a subsequence is L words. 3.A subsequence is allowed to skip one or more words. However, we do not skip more than g number of words consecutively. Moreover, the total number of words skipped in a subsequence should not exceed G. 4.We expand all negation contractions in a context. For example, didn’t is expanded to did not. We do not skip the word not when generating subsequences. For example, this condition ensures that from the snippet X is not a Y, we do not produce the subsequence X is a Y.

15 15 3.3 Lexical Pattern Extraction (2/2) X, a large Y X a flightless Y X, large Y lives A snippet retrieved for the query “ostrich*******bird.”

16 16 3.4 Lexical Pattern Clustering (1/2) word-pair frequency: total occurrence: a j : a pattern in pattern vector a. (P i,Q j ): a word pair.

17 17 3.4 Lexical Pattern Clustering (2/2)

18 18 3.5 Measuring Semantic Similarity (1/5) Weight to a pattern a i in a cluster c j : The jth feature for a word pair (P, Q):

19 19 3.5 Measuring Semantic Similarity (2/5) Feature vector for a word pair (P, Q) :

20 20 3.5 Measuring Semantic Similarity (3/5) Train a two-class SVM: ( synonymous / nonsynonymous ) Semantic similarity:

21 21 3.5 Measuring Semantic Similarity (4/5) Distance: b: the bias term and the hyperplane. a k : the Lagrange multiplier. f k : support vector. K(f k, f): the value of the kernel function. f : the instance to classify.

22 22 3.5 Measuring Semantic Similarity (5/5) The probability: Log likelihood:

23 23 3.6 Training (1/5) Number of Patterns Extracted for Training Data Synonymous (A, B) (C, D) Nonsynonymous (A, D) (C, B)

24 24 3.6 Training (2/5) L = 5, g = 2, G = 4, & T = 5, for lexical pattern extraction conditions. Distribution of patterns extracted from synonymous word pairs.

25 25 3.6 Training (3/5) Average similarity versus clustering threshold θ.

26 26 3.6 Training (4/5) The centroid vector of all feature vectors: The average Mahalanobis distance : |W|: the number of word pairs in W. C -1 : the inverse of the intercluster correlation Matrix.

27 27 3.6 Training (5/5) Distribution of patterns extracted from nonsynonymous word pairs.

28 28 4. Experiments 1.Benchmark Data Sets 2.Semantic Similarity 3.Community Mining

29 29 5. Conclusion (1/3) 1.A semantic similarity measure using both page counts and snippets retrieved from a web search engine for two words. 2.Four word co-occurrence measures were computed using page counts. 3.A lexical pattern extraction algorithm to extract numerous semantic relations that exist between two words.

30 30 5. Conclusion (2/3) 4.A sequential pattern clustering algorithm was proposed to identify different lexical patterns that describe the same semantic relation. 5.Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair. 6.A two-class SVM was trained using those features extracted for synonymous and nonsynonymous word pairs selected from WordNet synsets.

31 31 5. Conclusion (2/3) Experimental results on three benchmark data sets showed that the proposed method outperforms various baselines as well as previously proposed web- based semantic similarity measures, achieving a high correlation with human ratings. The proposed method improved the F-score in a community mining task, thereby underlining its usefulness in real-world tasks, that include named entities not adequately covered by manually created resources.

32 32 The End~ Thanks for your attention!!


Download ppt "1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,"

Similar presentations


Ads by Google