Document Clustering Based on Non-negative Matrix Factorization

Document Clustering Based on Non-negative Matrix Factorization
Wei Xu et al, SIGIR’03 Presented by Weilin Xu, 11/06/2014

Motivation Document clustering applications
Narrow Broad Variety of information needs Document clustering applications News clustering Extractive summarization … Why traditional clustering methods not applicable? Agglomerative clustering (bottom-up hierarchical) Cons: Computationally expensive Document partitioning (flat clustering), e.g. K-means Cons: Harsh simplifying assumptions on distributions Hierarchical: computationally expensive (O(n2logn)) Flat: based on weak assumptions, compact shape clusters in K-means, independent dimensions in Naïve Bayes, Gaussian distribution assumption in Gaussian mixture model.

Motivation Existing document clustering methods
Latent semantic indexing (LSI) Singular Value Decomposition Negative coefficients Graph-based spectral clustering Presented by Jinghe Leads to computing singular vectors or eigenvectors Equivalent to each other under certain conditions. Both need additional clustering methods. LSI: project each document into the singular vector space, then conducts document clustering using traditional clustering Graph: lead to the computation of singular vectors or eigenvectors of certain graph affinity matrices.

Proposed method Non-negative Matrix Factorization-based clustering
𝑋≈𝑈 𝑉 𝑇 U: term-concept matrix; V: document-concept matrix. Non-negative coefficients Latent semantic space is not necessarily orthogonal Factorization, Approximation

Roadmap Representations Method NMF vs. SVD Experiment Summary

Representations 𝑋 𝑚×𝑛 = 𝑥 11 ⋯ 𝑥 1𝑛 ⋮ ⋱ ⋮ 𝑥 𝑚1 ⋯ 𝑥 𝑚𝑛 , term-document matrix E.g. TF-IDF vector of document i 𝑋 𝑚×𝑛 ≈ 𝑈 𝑚×𝑘 𝑉 𝑛×𝑘 𝑇 𝑈 𝑚×𝑘 : term-topic matrix, 𝑉 𝑛×𝑘 : document-topic matrix

Method 𝑋≈𝑈 𝑉 𝑇 Minimize the objective function:
Derivation process (omitted) Updating equations: Normalize U: Normalize U to get unique solution of U and V, Frobenius norm Convergence is guaranteed

NMF vs. SVD

Experiment - Corpora #doc #cluster

Experiment – Compared methods
Spectral clustering: Average Association (AA) Equivalent to [LSI + K-means], if <Xi,Xj> similarity Spectral clustering: Normalized Cut (NC) Standard form of NMF (NMF) NC weighted form (NMF-NCW)

Experiment - Evaluations
Accuracy Mutual information Normalized: Map(), find by Kuhn-Munkres algorithm.

Experiment - Results 4 2 3 1 NMF-NCW > NC > NMF > AA

Summary NMF-based document partitioning method
Differ from SVD-based LSI and spectral clustering Latent semantic space not need to be orthogonal Non-negative values in latent semantic directions Without additional clustering Outperform the best methods.

Thanks

Document Clustering Based on Non-negative Matrix Factorization

Similar presentations

Presentation on theme: "Document Clustering Based on Non-negative Matrix Factorization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Document Clustering Based on Non-negative Matrix Factorization

Similar presentations

Presentation on theme: "Document Clustering Based on Non-negative Matrix Factorization"— Presentation transcript:

Similar presentations

About project

Feedback