Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
CMU SCS Copyright: C. Faloutsos (2012)# : Multimedia Databases and Data Mining Lecture #27: Graph mining - Communities and a paradox Christos.
MIT and James Orlin © Game Theory 2-person 0-sum (or constant sum) game theory 2-person game theory (e.g., prisoner’s dilemma)
Using Sparse Matrix Reordering Algorithms for Cluster Identification Chris Mueller Dec 9, 2004.
Dimensionality Reduction PCA -- SVD
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.
Data Mining Techniques: Clustering
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HCS Clustering Algorithm
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Cluster Analysis (1).
1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)
1 Clustering Applications at Yahoo! Deepayan Chakrabarti
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
1 Tools for Large Graph Mining Thesis Committee:  Christos Faloutsos  Chris Olston  Guy Blelloch  Jon Kleinberg (Cornell) - Deepayan Chakrabarti.
DATA MINING LECTURE 8 Clustering Validation Minimum Description Length Information Theory Co-Clustering.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
Information Flow using Edge Stress Factor Communities Extraction from Graphs Implied by an Instant Messages Corpus Franco Salvetti University of Colorado.
BrainStorming 樊艳波 Outline Several papers on icml15 & cvpr15 PALM Information Theory Learning.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
I MPROVING C O -C LUSTER Q UALITY WITH A PPLICATION TO P RODUCT R ECOMMENDATIONS Michail Vlachos et al. Distributed Application Systems Presentation by.
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference.
CS654: Digital Image Analysis
DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Some Computation Problems in Coding Theory
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
About Me Swaroop Butala  MSCS – graduating in Dec 09  Specialization: Systems and Databases  Interests:  Learning new technologies  Application of.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
CLUSTERING PARTITIONING METHODS Elsayed Hemayed Data Mining Course.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Zhu Han University of Houston Thanks for Dr. Hung Nguyen’s Slides
Minimum Description Length Information Theory Co-Clustering
Tools for Large Graph Mining
Large Graph Mining: Power Tools and a Practitioner’s guide
Representing Images 2.6 – Data Representation.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Discovering Functional Communities in Social Media
Foundation of Video Coding Part II: Scalar and Vector Quantization
Concept Decomposition for Large Sparse Text Data Using Clustering
Group 9 – Data Mining: Data
Matrix Multiplication Sec. 4.2
Presentation transcript:

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …

Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large graphs

Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.

Related Work K-means and variants: “Frequent itemsets”: Information Retrieval: Graph Partitioning: Dimensionality curse Choosing the number of clusters User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

What makes a cross-association “good”? versus Column groups Row groups Better Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

Main Idea Good Compression Better Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi

Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi

What makes a cross-association “good”? versus Column groups Row groups Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi

Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost

Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost

Fixed k and l Column groups Row groups Swaps: for each row: swap it to the row group which minimizes the code cost

Fixed k and l Column groups Row groups Ditto for column swaps … and repeat …

Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l

Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1

Choosing k and l l = 5 k = 5 Split: Similar for column groups too.

Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost Swaps Splits

Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

Experiments “Caveman” graph with Zipfian cave sizes, noise=10% l = 8 col groups k = 6 row groups

Experiments “White Noise” graph l = 3 col groups k = 2 row groups

Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28

Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user

Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages

Experiments Number of non-zeros Time (secs) Splits Swaps Linear on the number of “ones”: Scalable

Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs

Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps

Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups

Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost

Main Idea How well does a cross-association compress the matrix?  Encode the matrix in a lossless fashion  Compute the encoding cost  Low encoding cost  good compression  good clustering Good Compression Better Clustering implies