Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Information Retrieval Search Engine Technology (8) http://tangra.si.umich.edu/clair/ir09 http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev radev@umich.edu

SET/IR – W/S 2009 … 13. Clustering …

Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis –Documents in the same cluster are relevant to the same query –How do we use it in practice?

Representations for document clustering Typically: vector-based –Words: “cat”, “dog”, etc. –Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)

Scatter-gather Introduced by Cutting, Karger, and Pedersen Iterative process –Show terms for each cluster –User picks some of them –System produces new clusters Example: –http://www.ischool.berkeley.edu/~hearst/imag es/sg-example1.htmlhttp://www.ischool.berkeley.edu/~hearst/imag es/sg-example1.html

k-means Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat Needed: small number k of desired clusters hard decisions Example: Weka

k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

K-means (cont’d) In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.

Example Cluster the following vectors into two groups: –A = –B = –C = –D = –E = –F =

Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank www.cs.waikato.ac.nz/ml/weka cd /data2/tools/weka-3-4-7 export CLASSPATH=$CLASSPATH:./weka.jar java weka.clusterers.SimpleKMeans -t ~/e.arff java weka.clusterers.SimpleKMeans -p 1-2 -t ~/e.arff

Demos http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM. htmlhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM. html http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means http://www.cs.washington.edu/research/imagedatabase/demo/kmclu sterhttp://www.cs.washington.edu/research/imagedatabase/demo/kmclu ster http://www.cc.gatech.edu/~dellaert/html/software.html http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf http://www.ece.neu.edu/groups/rpl/projects/kmeans/

Probability and likelihood Example: What is in this case?

Bayesian formulation Posterior ∞ likelihood x prior

E-M algorithms [Dempster et al. 77] Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values. [McCallum & Nigam 98]

E-M algorithm Initialize probability model Repeat –E-step: use the best available current classifier to classify some datapoints –M-step: modify the classifier based on the classes produced by the E-step. Until convergence Soft clustering method

EM example Figure from Chris Bishop

Demos http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html http://lcn.epfl.ch/tutorial/english/gaussian/html/ http://www.cs.cmu.edu/~alad/em/ http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps

“Online” centroid method

Centroid method

Online centroid-based clustering sim ≥ Tsim < T

Sample centroids

Evaluation of clustering Formal definition Objective function Purity (considering the majority class in each cluster)

RAND index Accuracy when preserving object-object relationships. RI=(TP+TN)/TP+FP+FN+TN In the example:

RAND index Same cluster Same classTP=20FN=24 FP=20TN=72 RI = 0.68

Hierarchical clustering methods Single-linkage –One common pair is sufficient –disadvantages: long chains Complete-linkage –All pairs have to match –Disadvantages: too conservative Average-linkage Demo

Non-hierarchical methods Also known as flat clustering Centroid method (online) K-means Expectation maximization

Hierarchical clustering 2165 4387 Single link produces straggly clusters (e.g., ((12)(56)))

Hierarchical agglomerative clustering Dendrograms http://odur.let.rug.nl/~kleiweg/clustering/clustering.html http://odur.let.rug.nl/~kleiweg/clustering/clustering.html /data2/tools/clustering E.g., language similarity:

Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A

Paper reading Mark Newman paper “The structure and function of complex networks” (sections I, II, III, IV, VI, VII, and VIIIa)

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Similar presentations

Presentation on theme: "Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Similar presentations

Presentation on theme: "Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev."— Presentation transcript:

Similar presentations

About project

Feedback