Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Similar presentations


Presentation on theme: "1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can."— Presentation transcript:

1 1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can we ease the documents selection process for a user?

2 2 Motivation Use clusters to represent each topic! –User can quickly disambiguate the query or drill down into a specific topic

3 3 Motivation Moreover, with precomputed clustering of the corpus, the search for documents similar to a query can be computed efficiently. –Cluster pruningCluster pruning We will learn how to cluster a collection of documents into groups.We will learn how to cluster a collection of documents into groups.

4 4 Cluster pruning: preprocessing Pick  N docs at random: call these leaders For every other doc, pre-compute nearest leader –Docs attached to a leader: its followers; –Likely: each leader has ~  N followers.

5 5 Cluster pruning: query processing Process a query as follows: –Given query Q, find its nearest leader L. –Seek K nearest docs from among L’s followers.

6 6 Visualization Query LeaderFollower

7 7 What is Cluster Analysis? Cluster: a collection of data objects –Similar to one another within the same cluster –Dissimilar to the objects in other clusters Cluster analysis –Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes

8 8 Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with –high intra-class similarity –low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation

9 9 Similarity measures Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal, ratio-scaled, and vector variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “ similar enough ” or “ good enough ” – the answer is typically highly subjective.

10 10 Vector Objects Vector objects: keywords in documents. Cosine measure (similarity)

11 11 Cluster and between the clusters

12 12 Centroid, Radius and Diameter of a Cluster (for numerical data sets) Centroid: the “ middle ” of a cluster Radius: square root of average distance from any point of the cluster to its centroid Diameter: square root of average mean squared distance between all pairs of points in the cluster

13 13 Typical Alternatives to Calculate the Similarity between Clusters Single link: largest similarity between an element in one cluster and an element in the other. Complete link: smallest similarity between an element in one cluster and an element in the other Average: avg similarity between an element in one cluster and an element in the other Centroid: distance between the centroids of two clusters, i.e., dis(K i, K j ) = dis(C i, C j )

14 14 Documents Clustering

15 15 Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

16 16 Hierarchical agglomerative clustering (HAC) HAC is widely used in document clustering

17 17 Nearest Neighbor, Level 2, k = 7 clusters. From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

18 18 Nearest Neighbor, Level 3, k = 6 clusters.

19 19 Nearest Neighbor, Level 4, k = 5 clusters.

20 20 Nearest Neighbor, Level 5, k = 4 clusters.

21 21 Nearest Neighbor, Level 6, k = 3 clusters.

22 22 Nearest Neighbor, Level 7, k = 2 clusters.

23 23 Nearest Neighbor, Level 8, k = 1 cluster.

24 24 Calculate the similarity between all possible combinations of two profiles Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Hierarchical Clustering Keys Similarity Clustering

25 25 HAC The hierarchical merging process leads to a tree called a dendrogram. –The earlier mergers happen between groups with a large similarity –This value becomes lower and lower for later merges. –The user can cut across the dendrogram at a suitable level to get any desired number of clusters

26 26 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion –Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means –k-means (MacQueen ’ 67): Each cluster is represented by the center of the cluster Hard assignment Soft assignment

27 27 K-Means with hard assignment Given k, the k-means algorithm is implemented in four steps: –Partition objects into k nonempty subsets –Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) –Assign each object to the cluster with the nearest seed point –Go back to Step 2, stop when no more new assignment

28 28 The K-Means Clustering Method Example 0 1 2 3 4 5 6 7 8 9 10 0123456789 0 1 2 3 4 5 6 7 8 9 0123456789 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign

29 29 K-means with “soft” assignment Each cluster c is represented as a vector  c in term space. –It is not necessarily the centroid of some documents. The goal of soft k-means is to find a  c so as to minimize the quantization error

30 30 K-means with “soft” assignment A simple strategy is to reduce the errors among the mean vectors and the documents that they are closed to iteratively. We can repeatedly through the documents, and for each document d, accumulate a “correction”  c for the  c that is closest to d: After scanning once through all documents, all  c s are updated in a batch –  c <-  c +  c  is called the learning rate.

31 31 K-means with “soft” assignment The contribution from d need not be limited to only that  c that is closest to it. –The contribution can be shared among many clusters, the portion for cluster c being directly related to the current similarity between  c and d. –For example

32 32 Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comment: Often terminates at a local optimum.


Download ppt "1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can."

Similar presentations


Ads by Google