Information Organization: Clustering

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Basic Concepts and Algorithms
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
4. Ad-hoc I: Hierarchical clustering
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Clustering C.Watters CS6403.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
E.G.M. PetrakisText Clustering1 Clustering  “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Analysis of Massive Data Sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
Clustering.
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Hierarchical and Ensemble Clustering
Revision (Part II) Ke Chen
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Revision (Part II) Ke Chen
Hierarchical and Ensemble Clustering
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Information Organization: Clustering

Clustering Unsupervised Learning Similar items are grouped together into clusters. A C B D A B C D E F t1 (1=blue) 1 t2 (1=circle) t3 (1=small) F E A = (blue, circle, small) B = (red, square, large) C = (blue, square, large) D = (red, circle, small) E = (red, circle, small) F = (blue, circle, small) A B C D E F 1 2 3 Search Engine

Clustering: Procedure Construct Object-Attribute array array element = presence/absence, occurrence frequency, measure of importance (e.g. tf*idf, term relevance weight) Compute measure of association between objects e.g. Dice, Jaccard, Cosine similarity Construct Object-Object Association Array array element = similarity measure Identify the pairwise associations above a given threshold two objects are related if the strength of association (e.g. similarity) is greater than or equal to some threshold value Apply clustering criterion (rules for combining related objects) Single Link Clustering Criterion Object is related to at least one member of an existing cluster i.e. similarity to the closest element in a cluster > threshold Complete Link Clustering Criterion Object is related to all the members of an existing cluster i.e. similarity to the farthest element in a cluster > threshold A B C D E F t1 (1=blue) 1 t2 (1=circle) t3 (1=small) A B C D E F 1 2 3 Search Engine

Clustering: Example Object-Attribute array Object-Object Association array D1 D2 D3 D4 D5 D6 t1 1 3 t2 2 t3 t4 t5 t6 D1 D2 D3 D4 D5 D6 2/6 4/6 4/5 2/5 4/8 6/8 4/7 2/7 Threshold = 0.6 D1 D2 D3 D4 D5 D6 2/6 4/6 4/5 2/5 4/8 6/8 4/7 2/7 Single Link Clusters: C1 = (D1, D3, D5) C2 = (D2, D4) Complete Link Clusters: C1 = (D1, D3) or (D1, D5) C2 = (D2, D4) Search Engine

Document-Document Similarity Matrix Clustering: Problem Document-Document Similarity Matrix Threshold = 0.67 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 Single & Complete Link: C1 = (D2, D7) Threshold = 0.5 Threshold = 0.57 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 D2 D3 D4 D5 D6 D7 D1 0.40 0.33 0.57 0.10 0.50 0.67 0.29 0.11 0.25 0.00 Single Link: C1 = (D1, D2, D3, D4, D6, D7) Complete Link: C1 = (D1, D4) or (D1, D6) or (D1, D7), C2 = (D2, D7) or (D3, D7) Single Link: C1 = (D1, D4, D6), C2 = (D2, D7) Complete Link: C1 = (D1, D4) or (D1, D6), C2 = (D2, D7) Search Engine

Clustering Algorithms: Type Hierarchical vs. Flat Hierarchical: induce a hierarchy of clusters of decreasing generality (less efficient than flat) Flat (Non-hierarchical): all clusters are the same Hard vs. soft Hard: assign each item to a cluster (binary decision) Soft: assign each item a probability of belonging to a cluster. Disjunctive vs. non-disjunctive Disjunctive: an item must belong to only one cluster Non-disjunctive: items can be part of more than one clusters Iterative Start with initial set of clusters Reassign items to improve clusters Repeat step 2 until convergence Linkage vs. Non-linkage Linkage: link together similar items to identify clusters Non-linkage: start with clusters, assign items to clusters, evaluate and reassign items Search Engine

Clustering: Examples C1 C2 C3 D1 0.123 0.543 0.231 D2 0.434 0.232 D3 hard, non-hierarchical, disjunctive hard, non-hierarchical, non-disjunctive C1 C2 C3 D1 0.123 0.543 0.231 D2 0.434 0.232 D3 0.013 0.512 D4 0.444 0.277 0.435 soft, non-hierarchical, non-disjunctive hard, hierarchical, disjunctive Search Engine

Clustering: Hierarchical vs Non-Hierarchical Agglomerative (Bottom-Up)  Example Start at bottom and merge a pair of clusters into a single cluster Algorithm create doc-doc similarity matrix initially, each document is its own cluster combine two most similar clusters into one update doc-doc similarity matrix Goto 2 & repeat until there is only one cluster left Linkage Methods Single-linkage: proximity to the closest element in another cluster (max. similarity) Complete-linkage: proximity to the most distant element (min. similarity) Mean: proximity to the mean (centroid) Divisive (top-down) Start at the top and split one cluster into two new clusters split the cluster to produce two new clusters with largest dissimilarity Non-Hierarchical find the best grouping of items into k clusters e.g. k-means clustering Search Engine

k-means clustering Features Algorithm Issues Iterative, Hard, Flat (non-hierarchical), non-linkage n items are assigned to k clusters so that average distance to the cluster mean is minimized. Uses Euclidian distance Algorithm Select k (number of clusters) Select k initial cluster centers c1,…, ck randomly assign each item to a cluster calculate the centroid for each cluster For each item calculate the distance to each cluster centroid assign the item to the closest cluster Go to 2.b and repeat until convergence Issues must select a number of k must initialize the clusters random selection of k documents as clusters Search Engine

Automatic Thesaurus Typical Use Term Clustering  Example Query Refinement heteronym words that are spelled the same way but differ in pronunciation (e.g. bow) homonym words that are pronounced or spelled the same way but have distinctly different meanings homograph -- spelled the same differ in meaning (e.g. fair, bank) – different concept homophone --pronounced the same but differ in meaning (e.g. bare and bear) polyseme words with multiple related meanings (e.g., mole, branch, bank) – similar concept Query Expansion synonym: e.g. bank  financial institution hypernym: e.g. car  vehicle hyponym: e.g. car  SUV, van, sedan Term Clustering  Example Build Term vectors rows in the inverted index Cluster by Term-Term similarity Premise terms are related if they often appear in the same document (Term-Term Co-occurrence) Problems A very frequent term will co-occur with everything Very general terms will co-occur with other general terms Search Engine