5Algorithm 0 Recommend to you the most popular restaurants say # positive votes minus # negative votesIgnores your culinary preferencesAnd judgements of those with similar preferencesHow can we exploit the wisdom of “like-minded” people?
9Similarity between two people Similarity between their preference vectors.Inner products are a good start.Dave has similarity 3 with Estiebut -2 with Cindy.Perhaps recommend Black Buoy to Daveand Bakehouse to Bob, etc.
10Algorithm 1.1You give me your preferences and I need to give you a recommendation.I find the person “most similar” to you in my database and recommend something he likes.Aspects to consider:No attempt to discern cuisines, etc.What if you’ve been to all the restaurants he has?Do you want to rely on one person’s opinions?
11Algorithm 1.kYou give me your preferences and I need to give you a recommendation.I find the k people “most similar” to you in my database and recommend what’s most popular amongst them.Issues:A priori unclear what k should beRisks being influenced by “unlike minds”
12Slightly more sophisticated attempt Group similar users together into clustersYou give your preferences and seek a recommendation, thenFind the “nearest cluster” (what’s this?)Recommend the restaurants most popular in this clusterFeatures:avoids data sparsity issuesstill no attempt to discern why you’re recommended what you’re recommendedhow do you cluster?
14How do you cluster? Must keep similar people together in a cluster Separate dissimilar peopleFactors:Need a notion of similarity/distanceVector space? Normalization?How many clusters?Fixed a priori?Completely data driven?Avoid “trivial” clusters - too large or small
15Looking beyond Clustering people for restaurant recommendations Clustering other things(documents, web pages)Other approachesto recommendationAmazon.comGeneral unsupervised machine learning.
17Navigating search results Given the results of a search (say jaguar), partition into groups of related docssense disambiguationApproach followed by Uni Essex siteKruschwitz / al Bakouri / LungleyOther examples: IBM InfoSphere Data Explorer(was: vivisimo.com)
18Results list clustering example Jaguar Motor Cars’ home pageMike’s XJS resource pageVermont Jaguar owners’ clubCluster 2:Big catsMy summer safari tripPictures of jaguars, leopards and lionsCluster 3:Jacksonville Jaguars’ Home PageAFC East Football Teams
20Why cluster documents? For improving recall in search applications For speeding up vector space retrievalCorpus analysis/navigationSense disambiguation in search results
21Improving search recall Cluster hypothesis - Documents with similar text are relatedErgo, to improve search recall:Cluster docs in corpus a prioriWhen a query matches a doc D, also return other docs in the cluster containing DHope: docs containing automobile returned on a query for car becauseclustering grouped together docs containing car with those containing automobile.Why might this happen?
22Speeding up vector space retrieval In vector space retrieval, must find nearest doc vectors to query vectorThis would entail finding the similarity of the query to every doc - slow!By clustering docs in corpus a priorifind nearest docs in cluster(s) close to queryinexact but avoids exhaustive similarity computationExercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.
23Corpus analysis/navigation Given a corpus, partition it into groups of related docsRecursively, can induce a tree of topicsAllows user to browse through corpus to home in on informationCrucial need: meaningful labels for topic nodes.
24CLUSTERING DOCUMENTS IN A (VERY) LARGE COLLECTION: GOOGLE NEWS Notice traders traders
25CLUSTERING DOCUMENTS IN A VERY LARGE COLLECTION: JRC’S NEWS EXPLORER Notice traders traders
26What makes docs “related”? Ideal: semantic similarity.Practical: statistical similarityWe will use cosine similarity.Docs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We will describe algorithms in terms of cosine similarity.
27DOCUMENTS AS BAGS OF WORDS INDEXDOCUMENTbroad may rally ralliedsignal stock stocks tech technology traders traders trendbroad tech stock rally may signal trend - traders.technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.Notice traders traders
28Doc as vectorEach doc j is a vector of tfidf values, one component for each term.Can normalize to unit length.So we have a vector spaceterms are axes - aka featuresn docs live in this spaceeven with stemming, may have dimensionsdo we really want to use all terms?
29TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE Also: t-score, chi-squareFREQUENCY of term i in document kNumber of documents with term i
30Intuitiont 3D2D3D1xyt 1t 2D4Postulate: Documents that are “close together”in vector space talk about the same things.
32Two flavors of clustering Given n docs and a positive integer k, partition docs into k (disjoint) subsets.Given docs, partition into an “appropriate” number of subsets.E.g., for query results - ideal value of k not known up front - though UI may impose limits.Can usually take an algorithm for one flavor and convert to the other.
33Thought experimentConsider clustering a large set of computer science documentswhat do you expect to see in the vector space?
34Thought experimentConsider clustering a large set of computer science documentswhat do you expect to see in the vector space?NLPGraphicsAITheoryArch.
35Decision boundariesCould we use these blobs to infer the subject of a new document?NLPGraphicsAITheoryArch.
36Deciding what a new doc is about Check which region the new doc falls intocan output “softer” decisions as well.NLPGraphicsAITheoryArch.= AI
37Setup Given “training” docs for each category Theory, AI, NLP, etc.Cast them into a decision spacegenerally a vector space with each doc viewed as a bag of wordsBuild a classifier that will classify new docsEssentially, partition the decision spaceGiven a new doc, figure out which partition it falls into
38Supervised vs. unsupervised learning This setup is called supervised learning in the terminology of Machine LearningIn the domain of text, various namesText classification, text categorizationDocument classification/categorization“Automatic” categorizationRouting, filtering …In contrast, the earlier setting of clustering is called unsupervised learningPresumes no availability of training samplesClusters output may not be thematically unified.
39“Which is better?” Depends Can use in combination on your settingon your applicationCan use in combinationAnalyze a corpus using clusteringHand-tweak the clusters and label themUse clusters as training input for classificationSubsequent docs get classifiedComputationally, methods quite different
40What more can these methods do? Assigning a category label to a document is one way of adding structure to it.Can add others, e.g.,extract from the docpeopleplacesdatesorganizations …This process is known as information extractioncan also be addressed using supervised learning.
41Clustering: a bit of terminology REPRESENTATIVECENTROIDOUTLIER
42Key notion: cluster representative In the algorithms to follow, will generally need a notion of a representative point in a clusterRepresentative should be some sort of “typical” or central point in the cluster, e.g.,point inducing smallest radii to docs in clustersmallest squared distances, etc.point that is the “average” of all docs in the clusterNeed not be a document
43Key notion: cluster centroid Centroid of a cluster = component-wise average of vectors in a cluster - is a vector.Need not be a doc.Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5).Centroid
44(Outliers in centroid computation) Can ignore outliers when computing centroid.What is an outlier?Lots of statistical definitions, e.g.moment of point to centroid > M some cluster moment.Say 10.CentroidOutlier
45Clustering algorithms Partitional vs. hierarchicalAgglomerativeK-means
46Partitional Clustering A Partitional ClusteringOriginal Points
47Hierarchical Clustering Traditional Hierarchical ClusteringTraditional DendrogramNon-traditional Hierarchical ClusteringNon-traditional Dendrogram
48Agglomerative clustering Given target number of clusters k.Initially, each doc viewed as a clusterstart with n clusters;Repeat:while there are > k clusters, find the “closest pair” of clusters and merge them.
49“Closest pair” of clusters Many variants to defining closest pair of clustersClusters whose centroids are the most cosine-similar… whose “closest” points are the most cosine-similar… whose “furthest” points are the most cosine-similar
50Example: n=6, k=3, closest pair of centroids Centroid aftersecond step.d1d2Centroid after first step.
51Issues Have to support finding closest pairs continually compare all pairs?Potentially n3 cosine similarity computationsTo avoid: use approximations.“points” are changing as centroids change.Changes at each step are not localizedon a large corpus, memory management an issuesometimes addressed by clustering a sample.Why?What’sthis?
52ExerciseConsider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?
53“Using approximations” In standard algorithm, must find closest pair of centroids at each stepApproximation: instead, find nearly closest pairuse some data structure that makes this approximation easier to maintainsimplistic example: maintain closest pair based on distances in projection on a random lineRandom line
54A partitional clustering algorithm: K-MEANS Given k - the number of clusters desired.Each cluster associated with a centroid.Each point assigned to the cluster with the closest centroid.Iterate.
55Different algorithm: k-means Given k - the number of clusters desired.Iterative algorithm.More locality within each iteration.Hard to get good bounds on the number of iterations.
56Basic iteration At the start of the iteration, we have k centroids. Each doc assigned to the nearest centroid.All docs assigned to the same centroid are averaged to compute a new centroid;thus have k new centroids.
60K-means Clustering – Details Initial centroids are often chosen randomly.Clusters produced vary from one run to another.The centroid is (typically) the mean of the points in the cluster.‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.K-means will converge for common similarity measures mentioned above.Most of the convergence happens in the first few iterations.Often the stopping condition is changed to ‘Until relatively few points change clusters’Complexity is O( n * K * I * d )n = number of points, K = number of clusters, I = number of iterations, d = number of attributes
61Effect of initial choice of centroids: Two different K-means Clusterings Original PointsOptimal ClusteringSub-optimal Clustering
63k-means clustering Begin with k docs as centroids could be any k docs, but k random docs are better.Repeat Basic Iteration until termination condition satisfied.
64Termination conditions Several possibilities, e.g.,A fixed number of iterations.Doc partition unchanged.Centroid positions don’t change.Does this mean that the docs in a cluster are unchanged?
65Convergence Why should the k-means algorithm ever reach a fixed point? A state in which clusters don’t change.k-means is a special case of a general procedure known as the EM algorithm.Under reasonable conditions, known to converge.Number of iterations could be large.
66ExerciseConsider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see?Is agglomerative clustering likely to produce different results?
67k not specified in advance Say, the results of a query.Solve an optimization problem: penalize having lots of clustersapplication dependant, e.g., compressed summary of search results list.Tradeoff between having more clusters (better focus within each cluster) and having too many clusters
68k not specified in advance Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroidDefine the Total Benefit to be the sum of the individual doc Benefits.Why is there always a clustering of Total Benefit n?
69Penalize lots of clusters For each cluster, we have a Cost C.Thus for a clustering with k clusters, the Total Cost is kC.Define the Value of a cluster to be =Total Benefit - Total Cost.Find the clustering of highest Value, over all choices of k.
70Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE)For each point, the error is the distance to the nearest clusterTo get SSE, we square these errors and sum them.x is a data point in cluster Ci and mi is the representative point for cluster Cican show that mi corresponds to the center (mean) of the clusterGiven two clusters, we can choose the one with the smallest errorOne easy way to reduce SSE is to increase K, the number of clustersA good clustering with smaller K can have a lower SSE than a poor clustering with higher K
71Back to agglomerative clustering In a run of agglomerative clustering, we can try all values of k=n,n-1,n-2, … 1.At each, we can measure our Value, then pick the best choice of k.
72ExerciseSuppose a run of agglomerative clustering finds k=7 to have the highest Value amongst all k. Have we found the highest-Value clustering amongst all clusterings with k=7?
73Limitations of K-means K-means has problems when clusters are of differingSizesDensitiesNon-globular shapesK-means has problems when the data contains outliers.
74Limitations of K-means: Differing Sizes Original PointsK-means (3 Clusters)
75Limitations of K-means: Differing Density Original PointsK-means (3 Clusters)
76Limitations of K-means: Non-globular Shapes Original PointsK-means (2 Clusters)
77Hierarchical clustering As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.d3d5d1d3,d4,d5d4d2d1,d2d4,d5d3
78Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical treeCan be visualized as a dendrogramA tree like diagram that records the sequences of merges or splits
79Strengths of Hierarchical Clustering Do not have to assume any particular number of clustersAny desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper levelThey may correspond to meaningful taxonomiesExample in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
80Hierarchical Clustering Two main types of hierarchical clusteringAgglomerative:Start with the points as individual clustersAt each step, merge the closest pair of clusters until only one cluster (or k clusters) leftDivisive:Start with one, all-inclusive clusterAt each step, split a cluster until each cluster contains a point (or there are k clusters)Traditional hierarchical algorithms use a similarity or distance matrixMerge or split one cluster at a time
81Agglomerative vs. Divisive Clustering Agglomerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.Divisive (partitional, top-down) separate all examples immediately into clusters.
82Hierarchical Agglomerative Clustering (HAC) More popular hierarchical clustering techniqueAssumes a similarity function for determining the similarity of two instances.Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.The history of merging forms a binary tree or hierarchy.
83Agglomerative Clustering Algorithm Basic algorithm is straightforwardCompute the proximity matrixLet each data point be a clusterRepeatMerge the two closest clustersUpdate the proximity matrixUntil only a single cluster remainsKey operation is the computation of the proximity of two clustersDifferent approaches to defining the distance between clusters distinguish the different algorithms
84Starting SituationStart with clusters of individual points and a proximity matrixp1p3p5p4p2. . ..Proximity Matrix
85Intermediate Situation After some merging steps, we have some clustersC2C1C3C5C4C3C4Proximity MatrixC1C5C2
86Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.C2C1C3C5C4C3C4Proximity MatrixC1C5C2
87After Merging The question is “How do we update the proximity matrix?” C2 U C5C1C3C4C1?? ? ? ?C2 U C5C3C3?C4?C4Proximity MatrixC1C2 U C5
88How to Define Inter-Cluster Similarity p1p3p5p4p2. . ..Similarity?MINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWard’s Method uses squared errorProximity Matrix
89How to Define Inter-Cluster Similarity p1p3p5p4p2. . ..MINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWard’s Method uses squared errorProximity Matrix
90How to Define Inter-Cluster Similarity p1p3p5p4p2. . ..MINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWard’s Method uses squared errorProximity Matrix
91How to Define Inter-Cluster Similarity p1p3p5p4p2. . ..MINMAXGroup AverageDistance Between CentroidsOther methods driven by an objective functionWard’s Method uses squared errorProximity Matrix
93List of issues/applications covered Term vs. document space clusteringMulti-lingual docsFeature selectionSpeeding up scoringBuilding navigation structures“Automatic taxonomy induction”Labeling
94Term vs. document spaceThus far, we clustered docs based on their similarities in terms spaceFor some applications, e.g., topic analysis for inducing navigation structures, can “dualize”:use docs as axesrepresent (some) terms as vectorsproximity based on co-occurrence of terms in docsnow clustering terms, not docs
95Term vs. document space If terms carefully chosen (say nouns) fixed number of pairs for distance computationindependent of corpus sizeclusters have clean descriptions in terms of noun phrase co-occurrence - easier labeling?left with problem of binding docs to these clusters
96Multi-lingual docs E.g., News Explorer, Canadian government docs. Every doc in English and equivalent French.Must cluster by concepts rather than languageSimplest: pad docs in one lang with dictionary equivalents in the otherthus each doc has a representation in both languagesAxes are terms in both languages
97Feature selection Which terms to use as axes for vector space? Huge body of (ongoing) researchIDF is a form of feature selectioncan exaggerate noise e.g., mis-spellingsPseudo-linguistic heuristics, e.g.,drop stop-wordsstemming/lemmatizationuse only nouns/noun phrasesGood clustering should “figure out” some of these
98Clustering to speed up scoring From CS276a, recall sampling and pre-groupingWanted to find, given a query Q, the nearest docs in the corpusWanted to avoid computing cosine similarity of Q to each of n docs in the corpus.
99Sampling and pre-grouping First run a clustering phasepick a representative leader for each cluster.Process a query as follows:Given query Q, find its nearest leader L.Seek nearest docs from L’s followers onlyavoid cosine similarity to all docs.
100Navigation structure Given a corpus, agglomerate into a hierarchy Throw away lower layers so you don’t have n leaf topics each having a single docMany principled methods for this pruning such as MDL.d3d5d1d3,d4,d5d4d2d1,d2d4,d5d3
101Navigation structureCan also induce hierarchy top-down - e.g., use k-means, then recur on the clusters.Need to figure out what k should be at each point.Topics induced by clustering need human ratificationcan override mechanical pruning.Need to address issues like partitioning at the top level by language.
102Major issue - labelingAfter clustering algorithm finds clusters - how can they be useful to the end user?Need pithy label for each clusterIn search results, say “Football” or “Car” in the jaguar example.In topic trees, need navigational cues.Often done by hand, a posteriori.
103How to Label Clusters Show titles of typical documents Titles are easy to scanAuthors create them for quick scanning!But you can only show a few titles which may not fully represent clusterShow words/phrases prominent in clusterMore likely to fully represent clusterUse distinguishing words/phrasesBut harder to scan
104LabelingCommon heuristics - list 5-10 most frequent terms in the centroid vector.Drop stop-words; stem.Differential labeling by frequent termsWithin the cluster “Computers”, child clusters all have the word computer as frequent terms.Discriminant analysis of sub-tree centroids.
105The biggest issue in clustering? How do you compare two alternatives?Computation (time/space) is only one metric of performanceHow do you look at the “goodness” of the clustering produced by a method