Presentation is loading. Please wait.

Presentation is loading. Please wait.

Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.

Similar presentations


Presentation on theme: "Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval."— Presentation transcript:

1 Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval

2 Outline Introduction Bipartite Graph Model Graph Partitioning Experimental Results Conclusions References 2

3 Introduction Clustering: grouping together of similar objects Document Clustering – Purpose: to facilitate future navigation and search – Features: words – Document collection: word-by-document matrix A A is very sparse with almost 99% of the matrix entries being zero – Existing Methods: Agglomerative clustering; k-means algorithm; projection based models (latent semantic analysis), etc. Graph-theoretic techniques: – vertices – documents; edges – similarity between vertices – Limitation: computationally prohibitive for large collections 3

4 Introduction (cont’d) Word Clustering – Clustered on the basis of documents in which they co-occurred – Assumption: words appear together should be associated with similar concepts. – Purpose: thesaurus construction Co-clustering of documents and words – Existing algorithms: to cluster documents based on word distributions; to cluster words by co-occurrence in documents – Proposed problem: dual clustering Finding minimum cut vertex partitions in a bipartite graph between documents and words Finding a globally optimal solution to graph partitioning problem: a spectral algorithm 4

5 Graph Model 5

6 Bipartite Graph Model 6

7 Graph Partitioning 7

8 Name# Docs# Words# Nonzeros (A) MedCran2,4335,042117,987 MedCran_All2,43317,162224,325 MedCisi2,4935,447109,119 MedCisi_All2,49319,194213,453 Classic33,8934,303176,347 Classic3_30docs301,0731,585 Classic3_150docs1503,6527,960 Yahoo_K52,3401,458237,969 Yahoo_K12,34021,839349,792 Experimental Results Experiments – Data Source Medline: 1,033 medical abstracts Cranfield: 1,400 aeronautical systems abstracts Cisi: 1,460 information retrieval abstracts – Preprocessing: stop word removal, as well as rare and very frequency words 8 Bipartitioning Multipartitioning Articles from 6 categories: business (142), entertainment (1,384), health (494), politics (114), sports (141), and technology (60)

9 Experimental Results (cont’d) 9 Table 1: Biparitioning results for MedCran and MedCisi Table 2: Biparitioning results for MedCran_All and MedCisi_All Table 3: Multiparitioning results for Classic3 Table 4: Multiparitioning results for Classic3_30docs and Classic3_150docs MedlineCranfield 1,0260 71,400 MedlineCisi 9700 631,460 MedlineCranfield 1,0140 191,400 MedlineCisi 9250 1081,460 MedlineCisiCranfield 96500 651,45810 321,390 MedlineCisiCranfield 900 0100 10 MedlineCisiCranfield 4900 0500 10

10 Experimental Results (cont’d) 10 Table 5: Multiparitioning results for Yahoo_K1 Table 6: Multiparitioning results for Yahoo_K5 BusEntertainHealthPoliticsSportsTech 12082052057 0833011000 02590000 222151026113 00392000 0000400 BusEntertainHealthPoliticsSportsTech 12011301059 01,175001360 199547351 16217000 00273000 2004000

11 Conclusions Introduced a novel idea of modeling a document collection as a bipartite graph Proposed a spectral algorithm for co-clustering words and documents Provided optimal solution to a real relaxation of NP-complete co-clustering objective Proposed algorithm performs well on real examples 11

12 References I.S. Dhillon. “Co-clustering documents and words using bipartite spectral graph partitioning.” In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269-274. ACM, 2001. C.J. Crouch. “A cluster-based approach to thesaurus construction.” In ACM SIGIR, pp 309-320, 1988. 12


Download ppt "Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval."

Similar presentations


Ads by Google