Presentation on theme: "Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London
Hierarchical Clustering Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. Divisive (top-down) Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. Recursive application of a (flat) partitional clustering algorithm, e.g., k Means ( k =2) Bi-secting k Means. Agglomerative (bottom-up) Start with each document being a single cluster. Eventually all documents belong to the same cluster.
Dendrogram Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. The number of clusters k is not required in advance.
Dendrogram – Example Clusters of News Stories: Reuters RCV1
Dendrogram – Example Clusters of Things that People Want: ZEBO
HAC Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster: Among the current clusters, determine the pair of clusters, c i and c j, that are most similar. (Single-Link, Complete-Link, etc.) Then merges c i and c j to a single cluster. The history of merging forms a binary tree or hierarchy.
Single-Link The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine- similarity) between their members: After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is: