Document Clustering Based on Non-negative Matrix Factorization

Slides:



Advertisements
Similar presentations
Partitional Algorithms to Detect Complex Clusters
Advertisements

Eigen Decomposition and Singular Value Decomposition
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Information retrieval – LSI, pLSI and LDA
Probabilistic Clustering-Projection Model for Discrete Data
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Hinrich Schütze and Christina Lioma
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Text mining.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Non Negative Matrix Factorization
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
A Convergent Solution to Tensor Subspace Learning.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Techniques for Dimensionality Reduction
About Me Swaroop Butala  MSCS – graduating in Dec 09  Specialization: Systems and Databases  Interests:  Learning new technologies  Application of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Natural Language Processing Topics in Information Retrieval August, 2002.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Clustering (1) Clustering Similarity measure Hierarchical clustering
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
LSI, SVD and Data Management
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Concept Decomposition for Large Sparse Text Data Using Clustering
Dimension reduction : PCA and Clustering
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Lecture 13: Singular Value Decomposition (SVD)
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Presentation transcript:

Document Clustering Based on Non-negative Matrix Factorization Wei Xu et al, SIGIR’03 Presented by Weilin Xu, 11/06/2014

Motivation Document clustering applications Narrow Broad Variety of information needs Document clustering applications News clustering Extractive summarization … Why traditional clustering methods not applicable? Agglomerative clustering (bottom-up hierarchical) Cons: Computationally expensive Document partitioning (flat clustering), e.g. K-means Cons: Harsh simplifying assumptions on distributions Hierarchical: computationally expensive (O(n2logn)) Flat: based on weak assumptions, compact shape clusters in K-means, independent dimensions in Naïve Bayes, Gaussian distribution assumption in Gaussian mixture model.

Motivation Existing document clustering methods Latent semantic indexing (LSI) Singular Value Decomposition Negative coefficients Graph-based spectral clustering Presented by Jinghe Leads to computing singular vectors or eigenvectors Equivalent to each other under certain conditions. Both need additional clustering methods. LSI: project each document into the singular vector space, then conducts document clustering using traditional clustering Graph: lead to the computation of singular vectors or eigenvectors of certain graph affinity matrices.

Proposed method Non-negative Matrix Factorization-based clustering 𝑋≈𝑈 𝑉 𝑇 U: term-concept matrix; V: document-concept matrix. Non-negative coefficients Latent semantic space is not necessarily orthogonal Factorization, Approximation

Roadmap Representations Method NMF vs. SVD Experiment Summary

Representations 𝑋 𝑚×𝑛 = 𝑥 11 ⋯ 𝑥 1𝑛 ⋮ ⋱ ⋮ 𝑥 𝑚1 ⋯ 𝑥 𝑚𝑛 , term-document matrix E.g. TF-IDF vector of document i 𝑋 𝑚×𝑛 ≈ 𝑈 𝑚×𝑘 𝑉 𝑛×𝑘 𝑇 𝑈 𝑚×𝑘 : term-topic matrix, 𝑉 𝑛×𝑘 : document-topic matrix

Method 𝑋≈𝑈 𝑉 𝑇 Minimize the objective function: Derivation process (omitted) Updating equations: Normalize U: Normalize U to get unique solution of U and V, Frobenius norm Convergence is guaranteed

NMF vs. SVD

Experiment - Corpora #doc #cluster

Experiment – Compared methods Spectral clustering: Average Association (AA) Equivalent to [LSI + K-means], if <Xi,Xj> similarity Spectral clustering: Normalized Cut (NC) Standard form of NMF (NMF) NC weighted form (NMF-NCW)

Experiment - Evaluations Accuracy Mutual information Normalized: Map(), find by Kuhn-Munkres algorithm.

Experiment - Results 4 2 3 1 NMF-NCW > NC > NMF > AA

Experiment - Results 4 2 3 1 NMF-NCW > NC > NMF > AA

Summary NMF-based document partitioning method Differ from SVD-based LSI and spectral clustering Latent semantic space not need to be orthogonal Non-negative values in latent semantic directions Without additional clustering Outperform the best methods.

Thanks