Clustering Documents. Overview It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common.

Clustering Documents

Overview It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common trait. It helps a user understand the natural grouping or structure in a data set. Unsupervised Learning ◦ Cluster, category, group, class No training data that the classifier use to learn how to group Documents that share same properties are categorized into same clusters Cluster Size, Number of Clusters, similarity measure ◦ Square root of n if n is the number of documents ◦ LSI

What is clustering? A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Outliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinality In some applications we are interested in discovering outliers, not clusters (outlier analysis) cluster outliers

Why do we cluster? Clustering : given a collection of data objects group them so that ◦ Similar to one another within the same cluster ◦ Dissimilar to the objects in other clusters Clustering results are used: ◦ As a stand-alone tool to get insight into data distribution  Visualization of clusters may unveil important information ◦ As a preprocessing step for other algorithms  Efficient indexing or compression often relies on clustering

Applications of clustering? Image Processing ◦ cluster images based on their visual content Web ◦ Cluster groups of users based on their access patterns on webpages ◦ Cluster webpages based on their content Bioinformatics ◦ Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) Many more…

The clustering task Group observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different Basic questions: ◦ What does “similar” mean ◦ What is a good partition of the objects? I.e., how is the quality of a solution measured ◦ How to find a good partition of the observations

Observations to cluster Real-value attributes/variables ◦ e.g., salary, height Binary attributes ◦ e.g., gender (M/F), has_cancer(T/F) Nominal (categorical) attributes ◦ e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) Ordinal/Ranked attributes ◦ e.g., military rank (soldier, sergeant, lutenant, captain, etc.) Variables of mixed types ◦ multiple attributes with various types

Aim of Clustering Partition unlabeled examples into subsets of clusters, such that: ◦ Examples within a cluster are very similar ◦ Examples in different clusters are very different 9

Clustering Example 10................................

Cluster Organization For “small” number of documents simple/flat clustering is acceptable Search a smaller set of clusters for relevancy If cluster is relevant, documents in the cluster are also relevant Problem: Look for a broader or more specific documents Hierarchical clustering has a tree-like structure

12 Dendogram A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Visualization of Dendogram

Example D1 Human machine interface for computer applications D2 A survey of user opinion of computer system response time D3 The EPS user interface management system D4 System and human system engineering testing of the EPS system D5 The generation of the random binary and ordered trees D6 The intersection graphs of paths in a tree D7 Graph minors: A survey

D3 D4 D5D6D7D2 D1 Broad Specific

Cluster Parameters A minimum and maximum size of clusters ◦ Large cluster size one cluster attracting many documents Multi topic themes A matching threshold value for including documents in a cluster ◦ Minimum degree of similarity ◦ Affects the number of clusters ◦ High threshold Fewer documents can join a cluster Larger number of clusters The degree of overlap between clusters ◦ Some documents deal with more than one topic ◦ Low degree of overlap Greater separation of clusters A maximum number of clusters

Cluster-Based Search Inverted file organization ◦ Query keywords must exactly match word occurrences Clustered file organization matches a keyword against a set of cluster representatives Each cluster representative consists of popular words related to a common topic In flat clustering, query compared against the centroids of the clusters ◦ Centroid : average representative of a group of documents built from the composite text of all member documents

Automatic Document Classification Searching vs. Browsing Disadvantages in using inverted index files ◦ information pertaining to a document is scattered among many different inverted-term lists ◦ information relating to different documents with similar term assignment is not in close proximity in the file system Approaches ◦ inverted-index files (for searching) + clustered document collection (for browsing) ◦ clustered file organization (for searching and browsing)

CentroidsDocuments Typical Search path Highest-level centroid Supercentroids Centroids Documents Typical Clustered File Organization

Cluster Generation vs. Cluster Search Cluster generation ◦ Cluster structure is generated only once. ◦ Cluster maintenance can be carried out at relatively infrequent intervals. ◦ Cluster generation process may be slower and more expensive. Cluster search ◦ Cluster search operations may have to be performed continually. ◦ Cluster search operations must be carried out efficiently.

Hierarchical Cluster Generation Two strategies ◦ pairwise item similarities ◦ heuristic methods Models ◦ Divisive Clustering (top down)  The complete collection is assumed to represent one complete cluster.  Then the collection is subsequently broken down into smaller pieces. ◦ Hierarchical Agglomerative Clustering (bottom up)  Individual item similarities are used as a starting point.  A gluing operation collects similar items, or groups, into larger group.

Searching with a taxonomy Two ways to search a document collection organized in a taxonomy Top Down Search ◦ Start at the root ◦ Progressively compare query with cluster representative Single error at higher levels => wrong path => incorrect cluster Bottom Up Search ◦ Compare query with the most specific cluster at the lowest level High number of low level clusters increase computation time ◦ Use an inverted index for low level representatives

Aim of Clustering again? Partitioning data into classes with high intra-class similarity low inter-class similarity Is it well-defined?

What is Similarity? Clearly, subjective measure or problem-dependent

How Similar Clusters are? Ex1: Two clusters or one clusters?

How Similar Clusters are? Ex2: Cluster or outliers

Similarity Measures Most cluster methods ◦ use a matrix of similarity computations ◦ Compute similarities between documents Home work: What are the similarity measures used in text mining. Discuss advantages disadvantages. Whenever appropriate comment on the application areas for each similarity measure. List the references and use your own words.

Linking Methods Clique Star String

Clustering Methods Many methods to compute clusters NP complete problem Each solution can be evaluated quickly but exhaustive evaluation of all solutions is not feasible Each trial may produce a different cluster organization

Stable Clustering Results should be independent of the initial order of documents Clusters should not be substantially different when new documents are added to the collection Results from consecutive runs should not differ significantly

K-Means Heuristic with complexity O(nlogn) ◦ Matrix based algorithms O(n 2 ) Begins with an initial set of clusters ◦ Pic the cluster centroids randomly ◦ Use matrix based similarity on a small subset ◦ Use density test to pick cluster centers from sample data  Di is cluster center if at least n other ds have similarity greater than threshold  A set of documents that are sufficiently dissimal must exist in collection

K means Algorithm 1. Select k documents from the collection to form k initial singleton clusters 2. Repeat Until termination conditions are satisfied i.For every document d, find the cluster i whose centroid is most similar, assign d to cluster i. ii.For every cluster i, recompute the centroid based on the current member documents iii.Check for termination—minimal or no changes in the assignment of documents to clusters 3. Return a list of clusters

Simulated Annealing Avoids local optima by randomly searching ◦ Downhill move  New solution with higher (better) value than the previous solution ◦ Uphill move  A worse solution is accepted to avoid local minima  The frequency decreases during “life cycle” Analogy for crystal formation

Simulated Annealing Algorithm 1. Get initial set of cluster and set the temperature to T 2. Repeat until the temperature is reduced to the minimum ◦ Run a loop x times  Find a new set of clusters by altering the membership of some documents  Compare the difference between the values of the new and old set of clusters. If there is an improvement, accept the new set of clusters, otherwise accept the new set of clusters with probability p. ◦ Reduce the temperature based on cooling schedule 3. Return the final set of clusters 2.1 2.2 2.1.2 2.1.1

Simulated Annealing Simple to implement Solutions are reasonable good and avoid local minima Successful in other optimazition tasks Initial set very important Adjusting the size of clusters is difficult

Genetic Algorithm

Pick two parent solutions x and y from the set of all solutions with preference for solutions with higher fitness score. Use crossover operation to combine x and y to generate a new solution z. Periodically mutate a solution by randomly exchanging two documents in a solution.

Learn scatter/gather algorithm

Extra material

Hierarchical Agglomerative Clustering Basic procedure 1. Place each of N documents into a class of its own. 2. Compute all pairwise document-document similarity coefficients. (N(N-1)/2 coefficients) 3. Form a new cluster by combining the most similar pair of current clusters i and j; update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j. 4. Repeat step 3 if the number of clusters left is great than 1. 2.1.2

How to Combine Clusters? Intercluster similarity ◦ Single-link ◦ Complete-link ◦ Group average link Single-link clustering ◦ Each document must have a similarity exceeding a stated threshold value with at least one other document in the same class. ◦ similarity between a pair of clusters is taken to be the similarity between the most similar pair of items ◦ each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster

How to Combine Clusters? (Continued) Complete-link Clustering ◦ Each document has a similarity to all other documents in the same class that exceeds the the threshold value. ◦ similarity between the least similar pair of items from the two clusters is used as the cluster similarity ◦ each cluster member is more similar to the most dissimilar member of that cluster than to the most dissimilar member of any other cluster

How to Combine Clusters? (Continued) Group-average link clustering ◦ a compromise between the extremes of single-link and complete-link systems ◦ each cluster member has a greater average similarity to the remaining members of that cluster than it does to all members of any other cluster

A-F (6 items) 6(6-1)/2 (15) pairwise similarities decreasing order Example for Agglomerative Clustering

1. AF 0.9 A F A B C D E F A..3.5.6.8.9 B.3..4.5.7.8 C.5.4..3.5.2 D.6.5.3..4.1 E.8.7.5.4..3 F.9.8.2.1.3. AF B C D E AF..8.5.6.8 B.8..4.5.7 C.5.4..3.5 D.6.5.3..4 E.8.7.5.4. 0.9 2. AE 0.8 A F E 0.9 0.8 sim(AF,X)=max(sim(A,X),sim(F,X)) sim(AEF,X)=max(sim(AF,X),sim(E,X)) Single Link Clustering

3. BF 0.8 A F AEF B C D AEF..8.5.6 B.8..4.5 C.5.4..3 D.6.5.3. E 0.9 0.8 B 4. BE 0.7 A F E 0.9 0.8 B ABEF C D ABEF..5.6 C.5..3 D.6.3. sim(ABEF,X)=max(sim(AEF,X), sim(B,X)) Note E and B are on the same level. sim(ABDEF,X)=max(sim(ABEF,X), sim(D,X)) Single Link Clustering (Cont.)

5. AD 0.6 A F E 0.9 0.8 B D 6. AC 0.5 A F ABDEF C ABDEF..5 C.5. E 0.9 0.8 B D C 0.6 0.5 Single Link Clustering (Cont.)

Single-Link Clusters Similarity level 0.7 (i.e., similarity threshold) ◦ ABEF ◦ C ◦ D Similarity level 0.5 (i.e., similarity threshold) ◦ ABEFCD

1. AF 0.9 A F A B C D E F A..3.5.6.8.9 B.3..4.5.7.8 C.5.4..3.5.2 D.6.5.3..4.1 E.8.7.5.4..3 F.9.8.2.1.3. 0.9 2. AE 0.8(A,E) (A,F) new check EF  3. BF 0.8check AB  (A,E) (A,F) (B,F) Step Number Check Operations Similarity Pair Complete Link Structure & Pairs Covered Similarity Matrix sim(AF,X)=min(sim(A,X), sim(F,X)) Complete-Linke Cluster Generation

4. BE 0.7 new B E 0.7 AF B C D E AF..3.2.1.3 B.3..4.5.7 C.2.4..3.5 D.1.5.3..4 E.3.7.5.4. 5. AD 0.6 check DF  (A,D)(A,E)(A,F) (B,E)(B,F) 6. AC 0.6check CF  (A,C)(A,D)(A,E)(A,F) (B,E)(B,F) 7. BD 0.5check DE  (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F) Step Number Similarity Pair Check Operations Complete Link Structure & Pairs Covered Similarity Matrix Complete-Linke Cluster Generation (Cont.)

8. CE 0.5 B E C 0.7 0.4 check BC  AF BE C D AF..3.2.1 BE.3..4.4 C.2.4..3 D.1.4.3. 9. BC 0.4 check CE0.5 10. DE 0.4Check BD0.5 DE  (A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E) 11. AB 0.3 Check AC0.5 AE0.8 BF0.8 CF , EF  (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,E)(D,E) Step Number Similarity Pair Check Operations Similarity Matrix (A,C)(A,D)(A,E)(A,F) (B,D)(B,E)(B,F)(C,E) (in the checklist) Complete-Linke Cluster Generation (Cont.)

B E C 0.7 0.4 D 0.3 AF BCE D AF..2.1 BCE.2..3 D.1.3. 12. CD 0.3Check BD0.5 DE0.4 13. EF 0.3 Check BF0.8 CF  DF  (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(D,E)(E,F) 14. CF 0.2 Check BF0.8 EF0.3 DF  (A,B)(A,C)(A,D)(A,E)(A,F) (B,C)(B,D)(B,E)(B,F) (C,D)(C,E)(C,F)(D,E)(E,F) Step Number Similarity Pair Check Operations Similarity Matrix Complete-Linke Cluster Generation (Cont.)

B E C 0.7 0.4 D AF BCDE AF..1 BCDE.1. A F 0.3 0.1 0.9 15. DF 0.1last pair Complete-Linke Cluster Generation (Cont.)

Similarity level 0.7 AF 0.9 BE 0.7 CD Similarity level 0.4 AF 0.9 BE 0.7 D C 0.4 0.5 Similarity level 0.3 AF 0.9 BE D C 0.50.4 0.3 0.7 0.4 0.5 Complete Link Clusters

Group Average Link Clustering Group average link clustering ◦ use the average values of the pairwise links within a cluster to determine similarity ◦ all objects contribute to intercluster similarity ◦ resulting in a structure intermediate between the loosely bound single link cluster and tightly bound complete link clusters

Comparison The Behavior of Single-Link Cluster ◦ The single-link process tends to produce a small number of large clusters that are characterized by a chaining effect. ◦ Each element is usually attached to only one other member of the same cluster at each similarity level. ◦ It is sufficient to remember the list of previously clustered single items.

Comparison The Behavior of Complete-Link Cluster ◦ Complete-link process produces a much larger number of small, tightly linked groupings. ◦ Each item in a complete-link cluster is guaranteed to resemble all other items in that cluster at the stated similarity level. ◦ It is necessary to remember the list of all item pairs previously considered in the clustering process. Comparison ◦ The complete-link clustering system may be better adapted to retrieval than the single-link clusters. ◦ A complete-link cluster generation is more expensive to perform than a comparable single-link process.

D i =(w 1,i, w 2,i,..., w t,i ) document vector for D i L j =(l j,1, l j,2,..., l j,n j ) inverted list for term T j l ji denotes document identifier of ith document listed under term T j n j denote number of postings for term T j for j=1 to t (for each of t possible terms) for i=1 to n j (for all n j entries on the jth list) compute sim(D lj,i,D lj,k ) i+1<=k<=n j end for How to Generate Similarity

set S ji =0, 1<=j<=N for j=1 to N (fore each document in collection) for each term k in document D j take up inverted list L k for i=1 to n k (for each document identifier on list L k ) if j<l k,i or if S ji =1 take up next document D i else compute sim(D j,D l k,i ) set S ji =1 end for j i Similarity without Recomputation

Heuristic Clustering Methods Hierarchical clustering strategies ◦ use all pairwise similarities between items ◦ the clustering-generation are relatively expensive ◦ produce a unique set of well-formed clusters for each set of data, regardless of the order in which the similarity pairs are introduced into the clustering process Heuristic clustering methods ◦ produce rough cluster arrangements at relatively little expense ◦ single-pass clustering

Single-Pass Clustering Heuristic Methods Item 1 is first taken and placed into a cluster of its own. Each subsequent item is then compared against all existing clusters. It is placed in a previously existing cluster whenever it is similar to any existing cluster. ◦ Compute the similarities between all existing centroids and the new incoming item. ◦ When an item is added to an existing cluster, the corresponding centroid must then be appropriately updated. If a new item is not sufficiently similar to any existing cluster, the new item forms a cluster of its own.

Single-Pass Clustering Heuristic Methods (Continued) Characteristics ◦ Produce uneven cluster structures. Solutions ◦ cluster splitting: cluster sizes ◦ variable similarity thresholds: the number of clusters, and the overlap among clusters Produce cluster arrangements that vary according to the order of individual items.

Cluster Splitting Addition of one more item to cluster A Splitting cluster A into two pieces A’ and A’’ Splitting superclusters S into two pieces S’ and S’’

Cluster Searching Cluster centroid the average vector of all the documents in a given cluster strategies ◦ top down the query is first compared with the highest-level centroids ◦ bottom up only the lowest-level centroids are stored, the higher-level cluster structure is disregarded

Top-down entire-clustering search 1. Initialized by adding top item to active node list; 2. Take centroid with highest-query similarity from active node list; if the number of singleton items in subtree headed by that centroid is not larger than number of items wanted, then retrieve these singleton items and eliminate the centroid from active node list; else eliminate the centroid with highest query similarity from active node list and add its sons to active node list; 3. if number of retrieved  number wanted then stop else repeat step 2

Active node listNumber of singleRetrieved items in subtreeitems (1,0.2)14 (too big) (2,0.5), (4,0.7), (3,0)6 (too big) (2,0.5), (8,0.8), (9,0.3),(3,0)2I, J (2,0.5), (9,0.3), (3,0)4 (too big) (5,0.6), (6,0.5), (9,0.3), (3,0)2A,B

Bottom-up Individual-Cluster Search Take a specified number of low-level centroids if there are enough singleton items in those clusters to equal the number of items wanted, then retrieve the number of items wanted in ranked order; else add additional low-level centroids to list and repeat test

Active centroid list: (8,.8), (4,.7), (5,.6) Ranked documents from clusters: (I,.9), (L,.8), (A,.8), (K,.6), (B,.5), (J,.4), (N,.4), (M,.2) Retrieved items: I, L, A

Clustering Documents. Overview It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common.

Similar presentations

Presentation on theme: "Clustering Documents. Overview It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Documents. Overview It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common.

Similar presentations

Presentation on theme: "Clustering Documents. Overview It is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass shares a common."— Presentation transcript:

Similar presentations

About project

Feedback