Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,"— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu, Xin Liu, Yihong Gong Document Clustering Based On Non-negative Matrix Factorization ACM SIGIR,2003

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction The Proposed Method Performance Evaluations Conclusions Personal Opinion Review

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation  Traditional clustering method make harsh simplifying assumptions on the distribution of the document corpus to be clustered.  There have been research that perform document clustering using the latent semantic indexing method (LSI) or using the spectral clustering based on graph partitioning theories.  They all have some drawbacks.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective  Propose a novel document clustering method based on the non-negative factorization of the term- document matrix to improve above drawbacks.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  Document Representation  Let be the complete vocabulary set of the document corpus.  The term-frequency vector Xi of document di is defined as  where t ji,idf j denote the term frequency of word f j in document di, the number of documents containing word f j.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  Using Xi as the i’th column, we construct the m*n term-document matrix X.  This matrix will be used to conduct the non-negative fractorization.

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  Document Cluster Based on NMF  NMF is a matrix factorization algorithm that finds the positive factorization of a given positive matrix.  Here the goal of NMF is to factorize X into non- negative m*k matrix U and the non-negative k*n matrix V T that minimize the following objective func.

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  Here  The is a typical constrained optimization problem, and can be solved using the Lagrange multiplier method. Let U=[u ij ], V=[v ij ].

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  Note that solution to minimizing the criterion function J is not unique.  If U and V are the solution to J, then, UD,VD-1 will also form a solution for any positive diagonal matrix D.  To make the solution unique, we further require that the Euclidean length of the column vector in matrix U is one.

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  Each element u ij of matrix U represents the degree to which term fi belong to cluster j  Each element v ij of matrix V indicates to which degree document i is associated with cluster j

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  The algorithm is composed of the following steps:

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD  NMF VS SVD

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION  Data Corpora

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION  Evaluation Metrics  Accuracy (AC) VS Mutual information (MI)  where n denotes the total number of documents  l i and α i be the cluster label and the label provided by the document corpus.

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION  Mutual information  MI(C,C’) takes values between 0 and max(H(C),H(C’)) where H(C) and H(C’) are the entropies of C and C’.

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION  Performance Evaluations and Comparisons

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions  The important benefit of our algorithm is that  1. Each axis in the space derived by the NMF has a much more straightforward correspondence with each document cluster than in the space derived by the SVD.  2. document clustering results can be directly derived without additional clustering operations.  3. document clustering accuracy is higher than other document clustering methods.

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion ……

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Review


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,"

Similar presentations


Ads by Google