Information Organization: Clustering

Information Organization: Clustering

Clustering Unsupervised Learning
Similar items are grouped together into clusters. A C B D A B C D E F t1 (1=blue) 1 t2 (1=circle) t3 (1=small) F E A = (blue, circle, small) B = (red, square, large) C = (blue, square, large) D = (red, circle, small) E = (red, circle, small) F = (blue, circle, small) A B C D E F 1 2 3 Search Engine

Clustering: Procedure
Construct Object-Attribute array array element = presence/absence, occurrence frequency, measure of importance (e.g. tf*idf, term relevance weight) Compute measure of association between objects e.g. Dice, Jaccard, Cosine similarity Construct Object-Object Association Array array element = similarity measure Identify the pairwise associations above a given threshold two objects are related if the strength of association (e.g. similarity) is greater than or equal to some threshold value Apply clustering criterion (rules for combining related objects) Single Link Clustering Criterion Object is related to at least one member of an existing cluster i.e. similarity to the closest element in a cluster > threshold Complete Link Clustering Criterion Object is related to all the members of an existing cluster i.e. similarity to the farthest element in a cluster > threshold A B C D E F t1 (1=blue) 1 t2 (1=circle) t3 (1=small) A B C D E F 1 2 3 Search Engine

Clustering: Example Object-Attribute array
Object-Object Association array D1 D2 D3 D4 D5 D6 t1 1 3 t2 2 t3 t4 t5 t6 D1 D2 D3 D4 D5 D6 2/6 4/6 4/5 2/5 4/8 6/8 4/7 2/7 Threshold = 0.6 D1 D2 D3 D4 D5 D6 2/6 4/6 4/5 2/5 4/8 6/8 4/7 2/7 Single Link Clusters: C1 = (D1, D3, D5) C2 = (D2, D4) Complete Link Clusters: C1 = (D1, D3) or (D1, D5) C2 = (D2, D4) Search Engine

Document-Document Similarity Matrix
Clustering: Problem Document-Document Similarity Matrix Threshold = 0.67 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 Single & Complete Link: C1 = (D2, D7) Threshold = 0.5 Threshold = 0.57 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 Single Link: C1 = (D1, D2, D3, D4, D6, D7) Complete Link: C1 = (D1, D4) or (D1, D6) or (D1, D7), C2 = (D2, D7) or (D3, D7) Single Link: C1 = (D1, D4, D6), C2 = (D2, D7) Complete Link: C1 = (D1, D4) or (D1, D6), C2 = (D2, D7) Search Engine

Clustering Algorithms: Type
Hierarchical vs. Flat Hierarchical: induce a hierarchy of clusters of decreasing generality (less efficient than flat) Flat (Non-hierarchical): all clusters are the same Hard vs. soft Hard: assign each item to a cluster (binary decision) Soft: assign each item a probability of belonging to a cluster. Disjunctive vs. non-disjunctive Disjunctive: an item must belong to only one cluster Non-disjunctive: items can be part of more than one clusters Iterative Start with initial set of clusters Reassign items to improve clusters Repeat step 2 until convergence Linkage vs. Non-linkage Linkage: link together similar items to identify clusters Non-linkage: start with clusters, assign items to clusters, evaluate and reassign items Search Engine

Clustering: Examples C1 C2 C3 D1 0.123 0.543 0.231 D2 0.434 0.232 D3
hard, non-hierarchical, disjunctive hard, non-hierarchical, non-disjunctive C1 C2 C3 D1 0.123 0.543 0.231 D2 0.434 0.232 D3 0.013 0.512 D4 0.444 0.277 0.435 soft, non-hierarchical, non-disjunctive hard, hierarchical, disjunctive Search Engine

Clustering: Hierarchical vs Non-Hierarchical
Agglomerative (Bottom-Up)  Example Start at bottom and merge a pair of clusters into a single cluster Algorithm create doc-doc similarity matrix initially, each document is its own cluster combine two most similar clusters into one update doc-doc similarity matrix Goto 2 & repeat until there is only one cluster left Linkage Methods Single-linkage: proximity to the closest element in another cluster (max. similarity) Complete-linkage: proximity to the most distant element (min. similarity) Mean: proximity to the mean (centroid) Divisive (top-down) Start at the top and split one cluster into two new clusters split the cluster to produce two new clusters with largest dissimilarity Non-Hierarchical find the best grouping of items into k clusters e.g. k-means clustering Search Engine

k-means clustering Features Algorithm Issues
Iterative, Hard, Flat (non-hierarchical), non-linkage n items are assigned to k clusters so that average distance to the cluster mean is minimized. Uses Euclidian distance Algorithm Select k (number of clusters) Select k initial cluster centers c1,…, ck randomly assign each item to a cluster calculate the centroid for each cluster For each item calculate the distance to each cluster centroid assign the item to the closest cluster Go to 2.b and repeat until convergence Issues must select a number of k must initialize the clusters random selection of k documents as clusters Search Engine

Automatic Thesaurus Typical Use Term Clustering  Example
Query Refinement heteronym words that are spelled the same way but differ in pronunciation (e.g. bow) homonym words that are pronounced or spelled the same way but have distinctly different meanings homograph -- spelled the same differ in meaning (e.g. fair, bank) – different concept homophone --pronounced the same but differ in meaning (e.g. bare and bear) polyseme words with multiple related meanings (e.g., mole, branch, bank) – similar concept Query Expansion synonym: e.g. bank  financial institution hypernym: e.g. car  vehicle hyponym: e.g. car  SUV, van, sedan Term Clustering  Example Build Term vectors rows in the inverted index Cluster by Term-Term similarity Premise terms are related if they often appear in the same document (Term-Term Co-occurrence) Problems A very frequent term will co-occur with everything Very general terms will co-occur with other general terms Search Engine

Information Organization: Clustering

Similar presentations

Presentation on theme: "Information Organization: Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Organization: Clustering

Similar presentations

Presentation on theme: "Information Organization: Clustering"— Presentation transcript:

Similar presentations

About project

Feedback