Web Document Clustering

Web Document Clustering
한국해양대학교 공과대학 IT공학부 김재훈 KRISTAL 2006

Contents IR and Web Search Problems on Web Search Results
Document Clustering? Why Cluster Documents on Web Search Results? Applications of Document Clustering Clustering Approaches More Applications Discussions KRISTAL 2006

Typical IR Task Given: Find:
A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query. KRISTAL 2006

IR System Document corpus IR Query String System Ranked Documents
. KRISTAL 2006

Relevance a subjective judgment may include:
Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need). KRISTAL 2006

Keyword Search Simplest notion of relevance is that
the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). KRISTAL 2006

Intelligent IR Taking into account Adapting to the user based on
the meaning of the words used. the order of words in the query. the authority of the source. Adapting to the user based on the direct or indirect feedback. KRISTAL 2006

Web Search Application of IR to Differences with typical IR:
HTML documents on the World Wide Web. Differences with typical IR: Must assemble document corpus by spidering the web. Can use the structural layout information in HTML (XML). Can use the link structure of the web. Documents change uncontrollably. KRISTAL 2006

Web Search System Web Spider Document corpus IR Query String System
Ranked Documents 1. Page1 2. Page2 3. Page3 . KRISTAL 2006

Web users’ information needs
classification based on the type of answers expected by the Web user 1. Question answering type of answer : a very short answer (single sentence, part of the document) query example “What is the population of Korea?” 2. Named page finding - Homepage finding type of answer : a single document containing specific information query example “Authors instructions for the KISS journal” 3. Topic relevance type of answer: a range of documents relevant to the topic of their information need query example “KOREA’s immigration policy” 4. Online services type of answer: a range of documents that allow the user to access a particular service query example “download mp3” 5. Topic distillation type of answer: a range of documents that are relevant key resources on the topic (Relevance + Quality) query example “highway safety” KRISTAL 2006

Web search results Big problem Possible solution Evidence
A long list of search results, ranked by their relevancies to the given query Engine : search engine, car part, Engine Corp. A time consuming task when multiple sub-topics of the given query are mixed together. Possible solution To (online) cluster web search results Evidence Relationships between the results Cluster Hypothesis (van Rijsbergen 1979): “Closely related documents tend to be relevant to the same requests.” Potential results Aids user-engine interaction Browsing Help users express his need KRISTAL 2006

An example - Vivisimo KRISTAL 2006

What is clustering? Partition unlabeled examples into disjoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different the act of grouping similar objects into sets unsupervised classification no predefined classes Typical applications to get insight into data as a preprocessing step KRISTAL 2006

What is clustering? (cont.)
KRISTAL 2006

What is document clustering?
To group similar documents into sets To group similar or related documents into a common cluster. Eg) Scatter/Gather KRISTAL 2006

Why not just document clustering?
Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly unstructured and heterogeneous Additional information to consider (i.e. links, click-through data, etc.) KRISTAL 2006

Why cluster web documents?
1. For whole corpus analysis/navigation Better user interface 2. For improving recall in search applications Better search results 3. For better navigation of search results Effective “user recall” will be higher 4. For speeding up vector space retrieval Faster search KRISTAL 2006

1. Corpus analysis/navigation
Standard IR Index Aardvark, 15 Blueberry, 200 Capricorn, 1, Dog, Egypt, 65 Falafel, Giraffes, 45-59 … Document clusters Some time, TOC is very useful Table of Contents 1. Science of Cognition a. Motivations 1.a.i. Intellectual Curiosity a.ii. Practical Applications b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing KRISTAL 2006

1. Corpus analysis/navigation (cont.)
Document clustering Can induce a tree of topics To allow user to browse through corpus to find information Crucial need: meaningful labels for topic nodes. Yahoo!: manual hierarchy Often not available for new document collection KRISTAL 2006

For visualizing a document collection and its themes
Wise et al, “Visualizing the non-visual” PNNL ThemeScapes, Cartia [Mountain height = cluster size] KRISTAL 2006

2. Improvement of search recall
Cluster hypothesis Documents with similar text are related Therefore, to improve search recall: To cluster docs in corpus a priori (pre-clustering) To return other docs in the cluster containing D when a query matches a doc D. Hope if we do this: The query “car” will also return docs containing automobile Because clustering grouped together docs containing car with those containing automobile. KRISTAL 2006

3. Better navigation of search results
For grouping search results thematically clusty.com / Vivisimo KRISTAL 2006

3. Better navigation of search results - Kartoo.com -
KRISTAL 2006

3. Better navigation of search results - iboogie.com -
KRISTAL 2006

3. Better navigation of search results - mooter.com -
KRISTAL 2006

3. Better navigation of search results
One can also view grouping documents with the same sense of a word as clustering Given the results of a search (say Jaguar, or NLP), partition into groups of related docs Can be viewed as a form of word sense disambiguation E.g., jaguar may have senses: The car company The animal The football team The video game KRISTAL 2006

4. Speeding up vector space retrieval
In vector space retrieval, To find nearest doc vectors to query vector To entail finding the similarity of the query to every doc – slow (for some applications) By clustering docs in corpus a priori To find nearest docs in cluster close to query To be inexact but avoid exhaustive similarity computation KRISTAL 2006

What Is A Good Clustering?
Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measure used External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes Assessable with gold standard data KRISTAL 2006

Main issues Online or offline clustering? What to use as input
Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc. How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics How to group similar documents? How to label the groups? KRISTAL 2006

Components Of Clustering System
1. Feature selection/extraction the most effective subset from patterns 2. Feature representation one or more transformations of the input features (attributes) to produce new salient features. 3. Pattern proximity (measure similarity) The distance of pairs of patterns (features). 4. Clustering (Grouping) Hard (a partition of the data into groups) Fuzzy (each pattern has a variable degree) 5. Data abstraction Extracting a compact representation of a data set Automatic analysis, Human-oriented. 6. Evaluation to measure the performance of competing algorithms KRISTAL 2006

1. Feature Selection/Extraction
What is the representation of documents? document = a set of features What is the features? Linguistic structure in documents Co-occurrences of Terms : n-gram Semantic structure : case structure Meta-data of documents Authors Citation Co-citation document : documents cites the examined documents bibliographic coupling : documents are cited by the examined documents KRISTAL 2006

1. Feature Selection/Extraction
Which terms to use as axes for vector space? Better is to use highest weight mid-frequency words – the most discriminating terms Pseudo-linguistic heuristics, e.g., drop stop-words stemming/lemmatization use only nouns/noun phrases Good clustering should “figure out” some of these KRISTAL 2006

2. Feature representation
Documents are transformed into vectors of weighted terms Data matrix (two modes) How to compute xij tf*idf KRISTAL 2006

3. Pattern proximity (measure similarity)
Dissimilarity matrix (one mode) How to compute dij Dissimilarity - Distance Similarity - inverse of distance KRISTAL 2006

Proximity Measure Dissimilarity Similarity The Euclidean distance:
The Manhattan distance: Weighted distance: Similarity Inner Product Cos Coefficient KRISTAL 2006

4. Clustering Approach and/or Algorithm
(1) Using contents of documents (2) Using user’s usage logs (3) Using current search engines (4) Using hyperlinks (5) Other classical methods KRISTAL 2006

(1) Using Contents of Documents
Based on snippets returned by web search engines. Be as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm three logical steps: (1) document “cleaning” (2) identifying base clusters using a suffix tree (3) combining these base clusters into clusters KRISTAL 2006

(2) Using user’s usage logs
Relevancy information is objectively reflected by the usage logs An experimental result on Cluster 1 /shuttle/missions/41-c/news /shuttle/missions/61-b … Cluster 2 /history/apollo/sa-2/news/ /history/apollo/sa-2/images Cluster 3 /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm …. KRISTAL 2006

(3) Using current web search engines - Metacrawler -
KRISTAL 2006

(4) Using hyperlinks Cluster web documents based on both the textual and hyperlink The hyperlink structure is used as the dominant factor in the similarity metric Kleinberg’s HITS (Hypertext Induced Topic Selection) algorithm based purely on hyperlink information. authority and hub documents for a user query. only cover the most popular topics and leave out the less popular ones. KRISTAL 2006

Other classical approaches
KRISTAL 2006

Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merged together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down KRISTAL 2006

Hierarchical Clustering (Cent)
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e agglomerative divisive KRISTAL 2006

Agglomerative Example
B C D E 1 2 3 4 5 A B E C D Threshold of 1 2 3 4 5 A B C D E KRISTAL 2006

Distance Between Two Clusters
Single-Link Method : Nearest Neighbor Complete-Link : Furthest Neighbor Average-link : all cross-cluster pairs. Min distance Max distance Average distance KRISTAL 2006

Single-Link Method Euclidean Distance a a,b b a,b,c a,b,c,d c d c d d
(1) (2) (3) Distance Matrix KRISTAL 2006

Complete-Link Method Euclidean Distance a a,b a,b b a,b,c,d c,d c d c
(1) (2) (3) Distance Matrix KRISTAL 2006

Example of the Chaining Effect
Single-link (10 clusters) Complete-link (2 clusters) KRISTAL 2006

Effect of Bias towards Spherical Clusters
Single-link (2 clusters) Complete-link (2 clusters) KRISTAL 2006

Partitioning clustering
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster KRISTAL 2006

K-means algorithm Initial center of cluster are randomly selected
Assign objects to cluster using distances between center and object Re-compute the center of each cluster Return step2 until stopping criteria is satisfied KRISTAL 2006

K-means algorithm Example KRISTAL 2006

What is the problem of k-Means?
The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 1 2 3 4 5 6 7 8 9 10 KRISTAL 2006

5. Data Abstraction - Cluster labelling -
How do we generate meaningful descriptions, or labels, for clusters? it is an open issue Possible solution To list 5-10 most frequent features from the cluster features: index terms, noun phrases, proper names, … To show features that occur frequently in this cluster and not frequently in others Differential labeling To summarize multiple documents in the cluster KRISTAL 2006

6. Cluster validation How can we tell if a clustering is good or not?
ask users whether they agree with the clusters agreement with human clustering is problematic (Macskassy et al., 1998) use statistical techniques to measure qualities of the cluster, e.g. the purity, or the divergence from random clustering For information retrieval do we manage to cluster relevant documents together? This is the “holy grail” for cluster-based IR can be evaluated through cluster-based retrieval KRISTAL 2006

Approaches to evaluating
Anecdotal “I wrote this clustering algorithm and look what it found!” No benchmarks, no comparison possible User inspection Experts evaluate the results and score them Expensive / time-consuming Ground “truth” comparison To compare clustering results to a known taxonomy like Yahoo The static prior taxonomy may be incomplete/wrong in places Microeconomic / utility Net economic gain produced by an approach (vs. another approach) How do various clustering methods affect the quality of what’s retrieved? Compare two IR algorithms 1. send query, present ranked results 2. send query, cluster results, present clusters Purely quantitative measures Probability of generating clusters found Average distance between cluster members KRISTAL 2006

Cluster Validity Assessment
(Ci, Cj) : inter-cluster distance (Ck) : intra-cluster distance d ( C , C ) i j D ( C ) K KRISTAL 2006

Inter-cluster distance : (Ci, Cj)
Average linkage Centroid linkage Complete linkage Single linkage Average to centroids linkage KRISTAL 2006

Intra-cluster distance: (Ck)
Complete diameter Average diameter Centroid diameter KRISTAL 2006

DB(Davies-Bouldin) Index
K : the number of clusters Sij : Distance between the center zi of cluster i Ci & the center zj of cluster j Cj Si : Scatter between Ci and Cj KRISTAL 2006

Davies-Bouldin (DB) Index (2)
The objective: To minimize the index the number of clusters optimal number of clusters, K The clusters with minimized DB index Small values of DB correspond to good clusters KRISTAL 2006

Dunn’s Index The objective Large values of VD
To identify sets of clusters that are compact and well separated to maximize the inter-cluster distances and minimize the intra-cluster distances Large values of VD correspond to good clusters KRISTAL 2006

Other Cluster Validity Measures
Calinski Harabasz (CH) Index Index I Xie and Beni F-Measure -Measure Silhouette Symmetry Graph-Based Boundary Analysis Partition Coefficient Separation Index Classification Entropy (CE) Etc. KRISTAL 2006

Other clustering applications
Relevance feedback (Iwayama, 2000) Search engine query logs cluster queries on the basis of the documents that users selected from these queries (Beeferman & Berger, 2000) Recommender systems cluster users based on preferences, purchases, etc. (Sarwar et al., 2002) Topic detection and tracking (TDT) & Novelty Detection cluster news stories together and try to find emerging new topics (Allan, 2001) cluster previously seen information, and measure novelty of incoming information by its distance to that already seen (TREC Novelty track, 2002, 2003) KRISTAL 2006

Discussion In general, clustering presents problems and challenges:
Selection of attributes for clustering Selection of clustering method Generation of cluster representations Validity/quality of the generated clustering Updating clustering structures e.g. inserting or deleting a document from a hierarchy Effectiveness gains are not always evident Computational expenses KRISTAL 2006

Looking back Document clustering has been applied to IR for over 30 years the research focus has shifted over the years efficiency (70s), effectiveness (80s), other applications (e.g. browsing, visualisation (90s)), dynamic clustering (00s), …? But, there are still many unresolved issues: Cluster representation Cluster-based search strategies Dynamic clustering (per-query basis) Algorithmic aspects Document representation for clustering (e.g. reduce noise by using only the most important part of a document) New applications for clustering in IR KRISTAL 2006

As a conclusion Clustering is a useful tool for IR
It is theoretically solid and intuitively plausible Although it has been used for over 30 years there are still open issues It is still an interesting area to research KRISTAL 2006

Some references If you only read one article/reference:
Tombros, A., PhD Thesis, Chapter 3 (optionally 4 & 5) available at: Willett, P., Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5): , 1988. More than worth to have a look at: van Rijsbergen, C.J., Information Retrieval. London: Butterworths, 2nd Edition, 1979; available at Also recommended : Hearst, M.A. and Pedersen, J.O. Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of ACM SIGIR ‘96, pages 76-84, 1996. KRISTAL 2006

Some references Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In Proceedings of ACM SIGIR ‘09, pages 46-54, 1998. Iwayama, M. Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering. In Proceedings of ACM SIGIR ’00, pages 10-16, 2000. Beeferman, D. and Berger, A. Agglomerative clustering of a search engine log. In Proceedings of the 6th International Conference on Knowledge Discovery in Data, pages , 2000. B.M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender Systems for Large-Scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the 5th International Conference on Computer and Information Technology, 2002. J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang. Topic detection and tracking pilot study. In Topic Detection and Tracking Workshop Report, 2001. KRISTAL 2006

Some references Macskassy, S.A., Banerjee, A., Davidson, B.D., Hirsh, H. Human performance on clustering web pages: a preliminary study. In Proceedings of The 4th Knowledge Discovery and Data Mining Conference (KDD-98), pp , 1998. Goldszmidt, M. and Sahami, M. A Probabilistic Approach to Full-Text Document Clustering. Technical Report ITAD-433-MS , SRI International, Available at Bradley, P., Fayyad, U., Reina, C. Scaling EM (expectation maximization) algorithm to large databases, Microsoft Research Technical Report, MSR-TR-98, 1998. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. Scatter/Gather: A cluster based approach to browsing large document collections. In Proceedings of ACM SIGIR ‘92, pages , 1992. KRISTAL 2006

Acknowledgement C. Manning and P. Raghavan Why cluster documents I. Corna, Cluster Validation, T. Tombros, Clustering for IR, KRISTAL 2006

External Evaluation of Cluster Quality
Assesses clustering with respect to ground truth Assume that there are C gold standard classes, while our clustering algorithms produce k clusters, π1, π2, …, πk with ni members. Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi Others are entropy of classes in clusters (or mutual information between classes and clusters) KRISTAL 2006

Purity                  Cluster I Cluster II
                 Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 KRISTAL 2006 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Web Document Clustering

Similar presentations

Presentation on theme: "Web Document Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Document Clustering

Similar presentations

Presentation on theme: "Web Document Clustering"— Presentation transcript:

Similar presentations

About project

Feedback