# Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.

## Presentation on theme: "Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Yahoo! Hierarchy http://dir.yahoo.com/science dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculturebiologyphysicsCSspace... … (30)...

Hierarchical Clustering Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents.  Divisive (top-down) Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. Recursive application of a (flat) partitional clustering algorithm, e.g., k Means ( k =2)  Bi-secting k Means.  Agglomerative (bottom-up) Start with each document being a single cluster. Eventually all documents belong to the same cluster.

Dendrogram Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster. The number of clusters k is not required in advance.

Dendrogram – Example Clusters of News Stories: Reuters RCV1

Dendrogram – Example Clusters of Things that People Want: ZEBO

HAC Hierarchical Agglomerative Clustering  Starts with each doc in a separate cluster.  Repeat until there is only one cluster: Among the current clusters, determine the pair of clusters, c i and c j, that are most similar.  (Single-Link, Complete-Link, etc.) Then merges c i and c j to a single cluster.  The history of merging forms a binary tree or hierarchy.

Single-Link The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine- similarity) between their members: After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:

HAC – Example

d1 d2 d3 d4 d5 d1,d2 d4,d5 d3 d3,d4,d 5 As clusters agglomerate, docs are likely to fall into a dendrogram.