Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University

Contents Researches of Document Clustering Possible Applications of Document Clustering Document Clustering in a Networked Environment Conclusions

Researches of Document Clustering

Document Clustering Definition –Documents with some similar properties are assigned into automatically created groups Importance –To improve the efficiency and effectiveness of retrieval Time Space Quality –To determine the structure of the literatures of a field Exploration of latent information of documents Reduction of users’ cognition load

Block Diagram Document Set Feature Extraction Features Determining Clustering Parameters Clustering Clustered Documents Applications Cluster Structure Nonhierarchical Hierarchical Halting Criteria Number of Desired Clusters Number of Iteration

Researches on Document Clustering Features to represent documents –Linguistic structure in documents Co-occurrences of Terms Semantic structure –Meta-data of documents Authors Citation Co-citation: document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents

Researches on Document Clustering Measures of relevance between documents –Highly depending on the choice of features to represent documents –Several relevance measures Vector space model (VSM, Salton) Latent semantic indexing (LSI, Schütze) –Based on Singular Value Decomposition (SVD) algorithm –Reduction of dimensions of feature vectors in VSM –Exploiting latent semantic feature of documents Measure of relevance between document d i and d j w ik and w jk : weights of the kth term in d i and d j Frequency of the kth term in d i Inverse document frequency of the kth term L: vocabulary size

Researches on Document Clustering Clustering algorithms –Agglomerative hierarchical clustering algorithm (AHC) Algorithm 1. Put each document in the collection into one cluster 2. Identify the two closet clusters and combine these two clusters as a new cluster 3. Repeat Step 2 until that the halting criteria arrive O(N 2 ) –K-Means algorithm O(NK) –Buckshot algorithm Fast, linear time algorithm A K-Means algorithm where the initial cluster centroids are created by applying AHC to a sample of the document in the collection

Possible Applications of Document Clustering

Query Routing Documents distributed in several information servers –Relevant documents are clustered and put in one or proximate servers –Generating description to represent all of documents in a cluster When retrieval takes place –Identifying relevant clusters based on the relevance between queries and description of clusters –Forwarding queries to the servers for those clusters –Merging the results An example Query: document clustering Library ScienceComputer ScienceZoologyGeology

Cluster-based Browsing The problems of expressing a vague information need as a formal query Scatter/Gather (Cutting, et. al., SIGIR’92) –Clustering documents into topic-coherent groups –Presenting descriptive summaries of the clusters to users –Users can browse and determine possible clusters hierarchy –Documents in the selected clusters are clustered and summaries are generated –Finally, documents are retrieved Library ScienceComputer ScienceZoologyGeology Information Retrieval Library Automation

Result Set Clustering Users’ queries are often very short (about 1-3 words) –Result set included relevant documents and also irrelevant documents Clustering documents in the result set according to the degree of relevance –Helping users figure out their real information needs –Easily retrieving relevant documents An example Query: Multimedia HypermediaVirtual RealityVideo

Result Set Expansion Relevant documents may not match the input queries well Clustering relevant documents based on sophisticated features and clustering algorithms in data preparing phase Retrieving a core set of documents that match the query Expanding the results with documents not matching the query but clustered with the documents in the core set Query Core Set Expanding Result Set

Query Refinement Terms in queries do not match the information needs of users Dynamically computing and suggesting recall- and precision- enhancing terms for a given query Term suggestion –Grouping retrieved documents into topic-cohesive clusters –Terms in centroid documents: general concepts –Term in margin documents: specific concepts

Document Clustering in a Networked Environment

Web Pages vs. Plain Texts Lexical distributions of these two kinds of documents are significant different –Web pages including more proper nouns and terms but less verbs Information in web pages may be in a multimedia form –Difficult to represent and retrieve nowadays Web pages contain rich link information –More than 90% web pages include tags –Each web page contains 15 links in average Inapplicable to use term-based clustering techniques for plain texts to cluster web pages Link structure provides useful information to determine relevance among web pages

HTML Tags in Web Pages Tags provide helpful information to understand the meaning expressed by the pages –Tags for web composition Bold, Italic, Underline, Font –Tags for document structures Title Header Headline,, List Items, –Tags for link structures across pages Anchor –Terms with tags are information which the authors think important Terms with tags could be weighted to enhance effectiveness of retrieval

An Example of Web Page Anchor Text List Item Tag

Connectivity Analysis A link between two pages establishes a relation between the two pages The similarity between two pages could be estimated using –The length of the shortest path between the two pages –The length between the two pages and their least common ancestor –The length between the two pages and their greatest common descendants A DCB JIHEFG E is more similar to A than D

Information of Link Structure Authority page: One contains a lot of information about the topic –Authority: If a page p has a link to page q, the authors of page p confer authority on q –link popularity  page authority Hub page: One has links to authority pages Mutually reinforcing relationship –A good hub page points to many good authority pages –A good authority page is pointed to by many good hub pages HubsAuthorities

Information of Anchor Text The text around links pointing to a page is often a description of the page –The information of anchor text could be used to determine the relevance of the link Distribution of “Yahoo” in anchor texts of 5000 web pages pointing to Yahoo! From: http://decweb.ethz.ch/WWW7/1898/com1898.htm

Conclusions

Document clustering is an important technique to improve efficiency and effectiveness in information retrieval –Possible applications are wide Technologies of document clustering –Extraction of features to represent documents –Relevance functions between documents –Clustering algorithms Retrieval of web information rely more and more on the information of the web structure

Important References P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, 24(5), 577-597. E. Rasmussen, “Clustering Algorithms,” Information Retrieval: Data Structures and Algorithms, ed. by W. B. Frakes and R. Baeza-Yates, Chap. 16, 419-442. D. R. Cutting, D. Karger and J. O. Pedersen, “A Cluster-based Approach to Browsing Large Document Collection,” Proceedings of SIGIR’92, 318-329. J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, IBM Research Report RJ 10076, May, 1997.

Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Similar presentations

Presentation on theme: "Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Similar presentations

Presentation on theme: "Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University."— Presentation transcript:

Similar presentations

About project

Feedback

Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Presentation on theme: "Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University."— Presentation transcript: