Lecture 5: Similarity and Clustering (Chap 4, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Similarity/Clustering 인공지능연구실 문홍구 Content  What is Clustering  Clustering Method  Distance-based -Hierarchical -Flat  Geometric embedding.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Supervised Learning Recap
Clustering II.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Latent Dirichlet Allocation a generative model for text
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering Unsupervised learning Generating “classes”
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Similarity and clustering. Clustering2 Motivation Problem: Query word could be ambiguous: –Eg: Query“Star” retrieves documents about astronomy, plants,
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Computing & Information Sciences Kansas State University Monday, 24 Mar 2008CIS 732 / 830: Machine Learning / Advanced Topics in AI Lecture 24 of 42 Monday,
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Flat clustering approaches
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
KNN & Naïve Bayes Hongning Wang
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Data Mining and Text Mining. The Standard Data Mining process.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Latent Variables, Mixture Models and EM
Probabilistic Models with Latent Variables
Information Organization: Clustering
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
Presentation transcript:

Lecture 5: Similarity and Clustering (Chap 4, Charkrabarti) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/10/21

Similarity and Clustering

Clustering3 Motivation Problem 1: Query word could be ambiguous: –Eg: Query“Star” retrieves documents about astronomy, plants, animals etc. –Solution: Visualisation Clustering document responses to queries along lines of different topics. Problem 2: Manual construction of topic hierarchies and taxonomies –Solution: Preliminary clustering of large samples of web documents. Problem 3: Speeding up similarity search –Solution: Restrict the search for documents similar to a query to most representative cluster(s).

Clustering4 Example Scatter/Gather, a text clustering system, can separate salient topics in the response to keyword queries. (Image courtesy of Hearst)

Clustering5 Example: Concept Clustering

Clustering6 Task : Evolve measures of similarity to cluster a collection of documents/terms into groups within which similarity within a cluster is larger than across clusters. Cluster Hypothesis: G iven a `suitable‘ clustering of a collection, if the user is interested in document/term d/t, he is likely to be interested in other members of the cluster to which d/t belongs. Similarity measures –Represent documents by TFIDF vectors –Distance between document vectors –Cosine of angle between document vectors Issues –Large number of noisy dimensions –Notion of noise is application dependent

Clustering7 Clustering (cont…) Collaborative filtering: Clustering of two/more objects which have bipartite relationship Two important paradigms: –Bottom-up agglomerative clustering –Top-down partitioning Visualisation techniques: Embedding of corpus in a low-dimensional space Characterising the entities: –Internally : Vector space model, probabilistic models –Externally: Measure of similarity/dissimilarity between pairs Learning: Supplement stock algorithms with experience with data

Clustering8 Clustering: Parameters Similarity measure: –cosine similarity: Distance measure: –eucledian distance: Number “k” of clusters Issues –Large number of noisy dimensions –Notion of noise is application dependent

Clustering9 Clustering: Formal specification Partitioning Approaches –Bottom-up clustering –Top-down clustering Geometric Embedding Approaches –Self-organization map –Multidimensional scaling –Latent semantic indexing Generative models and probabilistic approaches –Single topic per document –Documents correspond to mixtures of multiple topics

Clustering10 Partitioning Approaches Partition document collection into k clusters Choices: –Minimize intra-cluster distance –Maximize intra-cluster semblance If cluster representations are available –Minimize –Maximize Soft clustering –d assigned to with `confidence’ –Find so as to minimize or maximize Two ways to get partitions - bottom-up clustering and top-down clustering

Clustering11 Bottom-up clustering(HAC) HAC: Hierarchical Agglomerative Clustering Initially G is a collection of singleton groups, each with one document Repeat –Find ,  in G with max similarity measure, s(  ) –Merge group  with group  For each  keep track of best  Use above info to plot the hierarchical merging process (DENDROGRAM) To get desired number of clusters: cut across any level of the dendrogram

Clustering12 Dendrogram A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Clustering13 Similarity measure Typically s(  ) decreases with increasing number of merges Self-Similarity –Average pair wise similarity between documents in  – = inter-document similarity measure (say cosine of tfidf vectors) –Other criteria: Maximium/Minimum pair wise similarity between documents in the clusters

Clustering14 Computation Un-normalized group profile vector: Can show: O(n 2 logn) algorithm with n 2 space

Clustering15 Similarity Normalized document profile: Profile for document group  :

Clustering16 Switch to top-down Bottom-up –Requires quadratic time and space Top-down or move-to-nearest –Internal representation for documents as well as clusters –Partition documents into `k’ clusters –2 variants “Hard” (0/1) assignment of documents to clusters “soft” : documents belong to clusters, with fractional scores –Termination when assignment of documents to clusters ceases to change much OR When cluster centroids move negligibly over successive iterations

Clustering17 Top-down clustering Hard k-Means: Repeat… –Initially, Choose k arbitrary ‘centroids’ –Assign each document to nearest centroid –Recompute centroids Soft k-Means : –Don’t break close ties between document assignments to clusters –Don’t make documents contribute to a single cluster which wins narrowly Contribution for updating cluster centroid from document d related to the current similarity between and d.

Clustering18 Combining Approach: Seeding `k’ clusters Randomly sample documents Run bottom-up group average clustering algorithm to reduce to k groups or clusters : O(knlogn) time Top-down clustering: Iterate assign-to- nearest O(1) times –Move each document to nearest cluster –Recompute cluster centroids Total time taken is O(kn) Total time: O(knlogn)

Clustering19 Choosing `k’ Mostly problem driven Could be ‘data driven’ only when either –Data is not sparse –Measurement dimensions are not too noisy Interactive –Data analyst interprets results of structure discovery

Clustering20 Choosing ‘k’ : Approaches Hypothesis testing: –Null Hypothesis (H o ): Underlying density is a mixture of ‘k’ distributions –Require regularity conditions on the mixture likelihood function (Smith’85) Bayesian Estimation –Estimate posterior distribution on k, given data and prior on k. –Difficulty: Computational complexity of integration –Autoclass algorithm of (Cheeseman’98) uses approximations –(Diebolt’94) suggests sampling techniques

Clustering21 Choosing ‘k’ : Approaches Penalised Likelihood –To account for the fact that L k (D) is a non- decreasing function of k. –Penalise the number of parameters –Examples : Bayesian Information Criterion (BIC), Minimum Description Length(MDL), MML. –Assumption: Penalised criteria are asymptotically optimal (Titterington 1985) Cross Validation Likelihood –Find ML estimate on part of training data –Choose k that maximises average of the M cross- validated average likelihoods on held-out data D test –Cross Validation techniques: Monte Carlo Cross Validation (MCCV), v-fold cross validation (vCV)

Clustering22 Visualisation techniques Goal: Embedding of corpus in a low- dimensional space Hierarchical Agglomerative Clustering (HAC) –lends itself easily to visualisaton Self-Organization map (SOM) –A close cousin of k-means Multidimensional scaling (MDS) –minimize the distortion of interpoint distances in the low-dimensional embedding as compared to the dissimilarity given in the input data. Latent Semantic Indexing (LSI) –Linear transformations to reduce number of dimensions

Clustering23 Self-Organization Map (SOM) Like soft k-means –Determine association between clusters and documents –Associate a representative vector with each cluster and iteratively refine Unlike k-means –Embed the clusters in a low-dimensional space right from the beginning –Large number of clusters can be initialised even if eventually many are to remain devoid of documents Each cluster can be a slot in a square/hexagonal grid. The grid structure defines the neighborhood N(c) for each cluster c Also involves a proximity function between clusters and

Clustering24 SOM : Update Rule Like Neural network –Data item d activates neuron (closest cluster) as well as the neighborhood neurons –Eg Gaussian neighborhood function –Update rule for node under the influence of d is: –Where is the ndb width and is the learning rate parameter

Clustering25 SOM : Example I SOM computed from over a million documents taken from 80 Usenet newsgroups. Light areas have a high density of documents.

Clustering26 SOM: Example II Another example of SOM at work: the sites listed in the Open Directory have beenorganized within a map of Antarctica at

Clustering27 Multidimensional Scaling(MDS) Goal –“Distance preserving” low dimensional embedding of documents Symmetric inter-document distances –Given apriori or computed from internal representation Coarse-grained user feedback –User provides similarity between documents i and j. –With increasing feedback, prior distances are overridden Objective : Minimize the stress of embedding

Clustering28 MDS: issues Stress not easy to optimize Iterative hill climbing 1.Points (documents) assigned random coordinates by external heuristic 2.Points moved by small distance in direction of locally decreasing stress For n documents –Each takes time to be moved –Totally time per relaxation

Clustering29 Fast Map [Faloutsos ’95] No internal representation of documents available Goal –find a projection from an ‘n’ dimensional space to a space with a smaller number `k‘’ of dimensions. Iterative projection of documents along lines of maximum spread Each 1D projection preserves distance information

Clustering30 Best line Pivots for a line: two points (a and b) that determine it Avoid exhaustive checking by picking pivots that are far apart First coordinate of point on “best line” h a (origin)b x x1x1

Clustering31 Iterative projection For i = 1 to k 1.Find a next (i th ) “best” line  A “best” line is one which gives maximum variance of the point-set in the direction of the line 2.Project points on the line 3.Project points on the “hyperspace” orthogonal to the above line

Clustering32 Projection Purpose –To correct inter-point distances between points by taking into account the components already accounted for by the first pivot line. Project recursively upto 1-D space Time: O(nk) time

Clustering33 Issues Detecting noise dimensions –Bottom-up dimension composition too slow –Definition of noise depends on application Running time –Distance computation dominates –Random projections –Sublinear time w/o losing small clusters Integrating semi-structured information –Hyperlinks, tags embed similarity clues –A link is worth a  ?  words

Clustering34 Issues Expectation maximization (EM): –Pick k arbitrary ‘distributions’ –Repeat: Find probability that document d is generated from distribution f for all d and f Estimate distribution parameters from weighted contribution of documents

Clustering35 Extended similarity Where can I fix my scooter? A great garage to repair your 2-wheeler is at … auto and car co-occur often Documents having related words are related Useful for search and clustering Two basic approaches –Hand-made thesaurus (WordNet) –Co-occurrence and associations … car … … auto … … auto …car … car … auto … auto …car … car … auto … auto …car … car … auto car  auto 

Clustering36 k k-dim vector Latent semantic indexing A Documents Terms U d t r DV d SVD TermDocument car auto

Clustering37 SVD: Singular Value Decomposition

Clustering38 Probabilistic Approaches to Clustering There will be no need for IDF to determine the importance of a term Capture the notion of stopwords vs. content-bearing words There is no need to define distances and similarities between entities Assignment of entities to clusters need not be “hard”; it is probabilistic

Clustering39 Generative Distributions for Documents Patterns (documents, images, audio) are generated by random process that follow specific distributions Assumption: term occurrences are independent events Given  (parameter set), the probability of generating document d: W is the vocabulary, thus, 2 |W| possible documents

Clustering40 Generative Distributions for Documents Model term counts: multinomial distribution Given  (parameter set) –l d : document length –n(d,t): times of term t appearing in document d –  t n(d,t) = l d Document event d comprises l d and the set of counts {n(d,t)} Probability of d :

Clustering41 Mixture Models & Expectation Maximization (EM) Estimate the Web:  web Probability of Web page d : Pr(d|  web )  web = {  arts,  science,  politics,…} Probability of d belonging to topic y: Pr(d|  y )

Clustering42 Mixture Model Given observations X= {x 1, x 2, …, x n } Find  to maximize Challenge: considering unknown (hidden) data Y = {y i }

Clustering43 Expectation Maximization (EM) algorithm Classic approach to solving the problem –Maximize L(  |X,Y) = Pr(X,Y|  ) Expectation step: initial guess  g

Clustering44 Expectation Maximization (EM) algorithm Maximization step: Lagrangian optimization Lagrange multiplier Condition:  =1

Clustering45 The EM algorithm

Clustering46 Multiple Cause Mixture Model (MCMM) Soft disjunction: –c: topics or clusters –a d,c : activation of document d to cluster c –  c,t : normalized measure of causation of t by c –Goodness of beliefs for document d with binary model For document collection {d} the aggregate goodness:  d g (d) Fix  c,t and improve a d,t ; fix a d,c and improve  c,t –i.e. find Iterate c

Clustering47 Aspect Model Generative model for multitopic documents [Hofmann] –Induce cluster (topic) probability Pr(c) EM-like procedure to estimate the parameters Pr(c), Pr(d|c), Pr(t|c) –E-step: M-step:

Clustering48 Aspect Model Documents & queries are folded into the clusters

Clustering49 Aspect Model Similarity between documents and queries

Clustering50 Collaborative recommendation People=record, movies=features People and features to be clustered –Mutual reinforcement of similarity Need advanced models From Clustering methods in collaborative filtering, by Ungar and Foster

Clustering51 A model for collaboration People and movies belong to unknown classes P k = probability a random person is in class k P l = probability a random movie is in class l P kl = probability of a class-k person liking a class-l movie Gibbs sampling: iterate –Pick a person or movie at random and assign to a class with probability proportional to P k or P l –Estimate new parameters

Clustering52 Hyperlinks as similarity indicators