Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Slides:

Advertisements

Similar presentations

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A 24-h forecast of solar irradiance using artificial neural.

Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Clustering data in an uncertain environment using an artificial.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Liang-Chu Chen, Ting-Jung Yu, Chia-Jung Hsieh ACM KeyGraph-based chance discovery.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Unsupervised pattern recognition models for mixed feature-type.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Human eye sclera detection and tracking using a modified.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A quantitative stock prediction system based on financial news Presenter : Chun-Jung Shih Authors :Robert.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Concept similarity in Formal Concept Analysis-An information.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Plagiarism Detection Technique for Java Program Using.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extensions of vector quantization for incremental clustering.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.

Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : CHRISTOS BOURAS, VASSILIS TSOGKAS 2012, KBS A clustering technique for news articles.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fraud detection in online consumer reviews Presenter: Tsai Tzung Ruei Authors: Nan Hu, Ling Liu, Vallabh.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Extreme Visualization: Squeezing a Billion Records into a Million Pixels Presenter : Jiang-Shan Wang.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Regularization in Matrix Relevance Learning Petra Schneider,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An initialization method to simultaneously find initial.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Enhanced neural gas network for prototype-based clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A new data clustering approach- Generalized cellular automata.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning multiple nonredundant clusterings Presenter :

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Mechanisms and Cluster Identification with TurSOM.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TIARA: A Visual Exploratory Text Analytic System Presenter.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Self Organizing Maps and Bit Signature: a study applied.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Towards comprehensive support for organizational mining Presenter : Yu-hui Huang Authors : Minseok Song,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Wei Xu,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Providing Justifications in Recommender Systems Presenter.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Study of Learning a Merge Model for Multilingual Information.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Predicting corporate bankruptcy using a self-organizing map: An empirical study to improve the forecasting.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author ： Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering ： integrating data clustering over optimization.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text Classification, Business Intelligence, and Interactivity:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. An Integrated Machine Learning Approach to Stroke Prediction Presenter: Tsai Tzung Ruei Authors: Aditya.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.

Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics Presenter : Cheng-Hui Chen Authors : Jian Hu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen SIGIR, 2008

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines Motivation Objectives Methodology Experiments Conclusions Comments

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation  Most traditional text clustering methods ignores the important information on the semantic relationships between key terms.  Lack of an effective word sense disambiguation method.  The synonymy and polysemy are not easy to handle the problems. 3

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives  Enhance the clustering result by obtaining a more accurate distance measure.  The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus. 4

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Wikipedia Thesaurus  Traditional text clustering  The framework 5

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Wikipedia Thesaurus ─ Wikipedia Concept ─ Synonymy ─ Polysemy ─ Hypernymy (Hierarchical Relation) ─ Associative relations Content based measure Out-linked category based measure Combination of the two measure 6

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Traditional text clustering ─ Traditional text similarity measure Compute cosine similarity ─ Traditional text representation enrichment strategies Generated new features replace or append to original document, and construct new vector representation 7

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  The framework ─ Mapping text documents into wikipedia concept sets Considering frequently occurred synonymy, polysemy and hypernymy in text documents, accurate allocation of terms in Wikipedia. ─ Enriching similarity measure with hierarchical relation Similarity measure using category vectors Considering the original document content, the similarity measure can be represented as: 8 Use the category decay factor μ

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  Enriching similarity measure with synonym and associative relation ─ The expanded weighted concept set ─ The set C ext to C b and get the extended C b as: ─ Define the similarity 9 C a ={(CS,1),(ML,1)} C b ={(DM,1),(DB,1)} = 0.57

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology  The combination 10 The set α and β to equal weights α =β =1/ 3

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments  Evaluation Criteria ─ M = M 1,M 2,...,M n represent the n manually labeled clusters, C = C 1,C 2,...,C n represent the n clusters generated using our algorithm. ─ Precision of C i and M j is defined: ─ The purity of the clustering result is defined: ─ The corresponding inverse purity is defined: 11 C M N 1 = 180 N 2 = 100 N 1 ∩N 2 = 30 Inpurity = 30%Purity = 33.33%

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments BASE1: Traditional text document similarity measure. BASE2: Improved with Gabrilovich’s feature generation technique on Wikipedia. BASE3: K-Means clustering with Hotho’s text document representation enrichment with WordNet. 12

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions  The clustering performance of our method is improved compared with previous methods.  The future work ─ Use the multilingual relations to explore the application in Cross language Information Retrieval and Cross- language Text Categorization. 14

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Comments  Advantages ─ Improved text clustering performance.  Applications ─ Clustering ─ Information Retrieval