Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics Presenter : Cheng-Hui Chen Authors : Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen SIGIR, 2008

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines Motivation Objectives Methodology Experiments Conclusions Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation  Most traditional text clustering methods ignores the important information on the semantic relationships between key terms.  Lack of an effective word sense disambiguation method.  The synonymy and polysemy are not easy to handle the problems. 3

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives  Enhance the clustering result by obtaining a more accurate distance measure.  The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus. 4

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Wikipedia Thesaurus  Traditional text clustering  The framework 5

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Wikipedia Thesaurus ─ Wikipedia Concept ─ Synonymy ─ Polysemy ─ Hypernymy (Hierarchical Relation) ─ Associative relations Content based measure Out-linked category based measure Combination of the two measure 6

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Traditional text clustering ─ Traditional text similarity measure Compute cosine similarity ─ Traditional text representation enrichment strategies Generated new features replace or append to original document, and construct new vector representation 7

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  The framework ─ Mapping text documents into wikipedia concept sets Considering frequently occurred synonymy, polysemy and hypernymy in text documents, accurate allocation of terms in Wikipedia. ─ Enriching similarity measure with hierarchical relation Similarity measure using category vectors Considering the original document content, the similarity measure can be represented as: 8 Use the category decay factor μ

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Enriching similarity measure with synonym and associative relation ─ The expanded weighted concept set ─ The set C ext to C b and get the extended C b as: ─ Define the similarity 9 C a ={(CS,1),(ML,1)} C b ={(DM,1),(DB,1)} = 0.57

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  The combination 10 The set α and β to equal weights α =β =1/ 3

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Evaluation Criteria ─ M = M 1,M 2,...,M n represent the n manually labeled clusters, C = C 1,C 2,...,C n represent the n clusters generated using our algorithm. ─ Precision of C i and M j is defined: ─ The purity of the clustering result is defined: ─ The corresponding inverse purity is defined: 11 C M N 1 = 180 N 2 = 100 N 1 ∩N 2 = 30 Inpurity = 30%Purity = 33.33%

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments BASE1: Traditional text document similarity measure. BASE2: Improved with Gabrilovich’s feature generation technique on Wikipedia. BASE3: K-Means clustering with Hotho’s text document representation enrichment with WordNet. 12

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions  The clustering performance of our method is improved compared with previous methods.  The future work ─ Use the multilingual relations to explore the application in Cross language Information Retrieval and Cross- language Text Categorization. 14

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Comments  Advantages ─ Improved text clustering performance.  Applications ─ Clustering ─ Information Retrieval


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics."

Similar presentations


Ads by Google