Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Bottleneck Method & Double Clustering + α 2003. 4. 29. Summarized by Byoung Hee, Kim.

Similar presentations


Presentation on theme: "Information Bottleneck Method & Double Clustering + α 2003. 4. 29. Summarized by Byoung Hee, Kim."— Presentation transcript:

1 Information Bottleneck Method & Double Clustering + α 2003. 4. 29. Summarized by Byoung Hee, Kim

2 (C) 2000-2003 SNU CSE BioIntelligence Lab 2 Double Clustering First obtain word-clusters to represent documents in a reduced dimensional space. Then cluster the documents using the word-cluster representation.

3 (C) 2000-2003 SNU CSE BioIntelligence Lab 3 Traditional Clustering Methods (for text classification) Represent a document as a vector of weights for the terms that occur in the document. w 1 w 2 w 3... w 124080 w 124081 … word n doc 1 : 0.0 0.75 0.0... 0.0 0.13... 0.0 doc 2 : 0.6 0.21 0.0... 0.36 0.0... 0.0 This representation has many disadvantages:  High dimensionality  Sparseness  Loss of word ordering information Clustering documents using the distances between pairs of vectors is troublesome.  The Information Bottleneck is an alternative method that does not rely on vector distances.

4 (C) 2000-2003 SNU CSE BioIntelligence Lab 4 Dimensionality Reduction Dimensionality reduction is beneficial for improved accuracy and efficiency when clustering documents.  Latent semantic indexing (LSI)  Information Gain and Mutual Information Measures  Chi-Squared Statistic  Term Strength Algorithm  Distributional Clustering  Cluster words based on their distribution across documents  The Information Bottleneck is a distributional clustering method

5 (C) 2000-2003 SNU CSE BioIntelligence Lab 5 Distributional Clustering Most Clustering algorithms start either  From pairwirse ‘distances’ between points (pairwise clustering)  With a distortion measure between a data point and a class centroid (vector quantization)  Main problem  Arbitrary choice of the distance or distortion measures Distributional clustering method  Finding a cluster hierarchy of the members of set Y (e.g. documents),  Based on the similarity of their conditional distribution w.r.t the members of another set X (e.g. words)

6 (C) 2000-2003 SNU CSE BioIntelligence Lab 6 The Information Bottleneck Method – idea The Information Bottleneck:  Find a mapping between and, characterized by a conditional probability distribution  For example, if is the set of words, is a new representation of words where. The Information Bottleneck:  Suppose the variable X is an observation of where Y is the variable of interest.  x ∈ X is evidence concerning the outcome y ∈ Y  For example, x ∈ X is a word and y ∈ Y is a document  We want the mapping from to to preserve as much information about Y as possible.

7 (C) 2000-2003 SNU CSE BioIntelligence Lab 7 Entropy Entropy measures the uncertainty about a discrete random variable X: Entropy defines the lower bound on the number of bits needed to accurately encode X. Conditional entropy of X given Y describes the amount of remaining uncertainty about X after an observation of Y: Relative entropy, or Kullback-Leibler (KL) distance, measures the distance between two distributions:

8 (C) 2000-2003 SNU CSE BioIntelligence Lab 8 Mutual Information The Mutual Information of X and Y measures the amount of uncertainty about X that is resolved by observing Y: This is also the relative entropy between the joint distribution of X and Y and the product of the distributions of X and Y. Note that I(X,Y) = I(Y,X)

9 (C) 2000-2003 SNU CSE BioIntelligence Lab 9 Hierarchical Agglomerative Clustering X1X1 X2X3X4X5X6 Hard clustering ● Input: Joint probability distribution ● Output: A partition of X into m clusters ● Initialization: Construct: Initialize ‘ cost matrix ’ :, calculate Loop: For m=|X|-1 … 1 Find the indices {i,j} for which is minmized Merge: Update: Update costs w.r.t. End For

10 (C) 2000-2003 SNU CSE BioIntelligence Lab 10 Clustering on the ‘ NCI60 Data Set ’ Source: Sherf, et al. (2000) Left: Single clustering Right: Double Clustering (G=100)

11 (C) 2000-2003 SNU CSE BioIntelligence Lab 11 Disadvantages of Agglomerative Clustering Time complexity is. (can be reduced to) Space complexity is Solutions are not guaranteed to be even locally optimal (greedy) A simple sequential clustering approach can successfully solve these difficulties. N d ~ #documents, N w ~ #words

12 (C) 2000-2003 SNU CSE BioIntelligence Lab 12 Sequential IB Advantages of sequential clustering  Time complexity is linear in Nd: O(L Nt Nd Nw) << O(Nd Nw)  Space complexity is O(Nt ) << O(Nd )  Locally optimal solutions are guaranteed In contrast to agglomerative clustering, sequential optimization provides (locally) optimal solutions with much better complexity. Applying this idea in the IB framework yields the sIB algorithm, with promising results for document clustering. 3 # clusters# loops 22

13 (C) 2000-2003 SNU CSE BioIntelligence Lab 13 참고문헌 Pereira, F.C., Tishby, N., and Lee, L., "Distributional clustering of English words", In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pp.183-190, 1993. Scherf, U., Ross, D.T., Waltham, M., Smith, L.H., Lee, J.K., Tanabe, L., Kohn, K.W., Reinhold, W.C., Myers, T.G., Andrews, D.T., Scudiero, D.A., Eisen, M.B., Sausville, E.A., Pommier, Y., Botstein, D., Brown, P.O., and Weinstein, J.N., “A gene expression database for the molecular pharmacology of cancer”, Nature Genetics, Vol. 24, pp. 236-244, 2000. Slonim, N. and Tishby, N., "Document clustering using word clusters via the information bottleneck method", In Proceedings of SIGIR-2000, pp.208-215, 2000. Tishby, N., Pereira, F.C., and Bialek, W., "The Information bottleneck method", In Proceedings of the 37th Allerton Conference on Communication and Computation, pp.368-377, 1999. Slonim, N., Friedman, N., Tishby, N., “Unsupervised Document Classification Using Sequential Information Maximization”, In Proc.25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002.


Download ppt "Information Bottleneck Method & Double Clustering + α 2003. 4. 29. Summarized by Byoung Hee, Kim."

Similar presentations


Ads by Google