Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Similar presentations


Presentation on theme: "Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)"— Presentation transcript:

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

2 Problem Definition Products Customers Customer Groups Product Groups Simultaneously group customers and products, or, documents and words, or, users and preferences …

3 Problem Definition Desiderata: 1.Simultaneously discover row and column groups 2.Fully Automatic: No “magic numbers” 3.Scalable to large graphs

4 Cross-Associations ≠ Co-clustering ! Information-theoretic co-clustering Cross-Associations 1.Lossy Compression. 2.Approximates the original matrix, while trying to minimize KL- divergence. 3.The number of row and column groups must be given by the user. 1.Lossless Compression. 2.Always provides complete information about the matrix, for any number of row and column groups. 3.Chosen automatically using the MDL principle.

5 Related Work K-means and variants: “Frequent itemsets”: Information Retrieval: Graph Partitioning: Dimensionality curse Choosing the number of clusters User must specify “support” Choosing the number of “concepts” Number of partitions Measure of imbalance between clusters

6 What makes a cross-association “good”? versus Column groups Row groups Better Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression Why is this better? implies

7 Main Idea Good Compression Better Clustering implies Column groups Row groups p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi Binary Matrix +Σi+Σi

8 Examples One row group, one column group highlow m row group, n column group highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi

9 What makes a cross-association “good”? versus Column groups Row groups Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1 and n i 0 Code Cost Description Cost ΣiΣi +Σi+Σi

10 Algorithms k = 5 row groups k=1, l=2 k=2, l=2 k=2, l=3 k=3, l=3 k=3, l=4 k=4, l=4 k=4, l=5 l = 5 col groups

11 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost

12 Fixed k and l l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost

13 Fixed k and l Column groups Row groups Swaps: for each row: swap it to the row group which minimizes the code cost

14 Fixed k and l Column groups Row groups Ditto for column swaps … and repeat …

15 Choosing k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l

16 Choosing k and l l = 5 k = 5 Split: 1.Find the row group R with the maximum entropy per row 2.Choose the rows in R whose removal reduces the entropy per row in R 3.Send these rows to the new row group, and set k=k+1

17 Choosing k and l l = 5 k = 5 Split: Similar for column groups too.

18 Algorithms l = 5 k = 5 Start with initial matrix Find good groups for fixed k and l Choose better values for k and l Final cross- associations Lower the encoding cost Swaps Splits

19 Experiments l = 5 col groups k = 5 row groups “Customer-Product” graph with Zipfian sizes, no noise

20 Experiments “Caveman” graph with Zipfian cave sizes, noise=10% l = 8 col groups k = 6 row groups

21 Experiments “White Noise” graph l = 3 col groups k = 2 row groups

22 Experiments “CLASSIC” graph of documents & words: k=15, l=19 Documents Words

23 Experiments NSF Grant Proposals Words in abstract “GRANTS” graph of documents & words: k=41, l=28

24 Experiments “Who-trusts-whom” graph from epinions.com: k=18, l=16 Epinions.com user

25 Experiments “Clickstream” graph of users and websites: k=15, l=13 Users Webpages

26 Experiments Number of non-zeros Time (secs) Splits Swaps Linear on the number of “ones”: Scalable

27 Conclusions Desiderata: Simultaneously discover row and column groups Fully Automatic: No “magic numbers” Scalable to large graphs

28 Fixed k and l l = 5 k = 5 Start with initial matrix Choose better values for k and l Final cross- associations Lower the encoding cost Find good groups for fixed k and l swaps

29 Experiments l = 5 col groups k = 5 row groups “Caveman” graph with Zipfian cave sizes, no noise

30 Aim Given any binary matrixa “good” cross-association will have low cost But how can we find such a cross-association? l = 5 col groups k = 5 row groups

31 Main Idea size i * H(p i ) + Cost of describing cross-associations Code Cost Description Cost ΣiΣi Total Encoding Cost = Good Compression Better Clustering implies Minimize the total cost

32 Main Idea How well does a cross-association compress the matrix?  Encode the matrix in a lossless fashion  Compute the encoding cost  Low encoding cost  good compression  good clustering Good Compression Better Clustering implies


Download ppt "Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)"

Similar presentations


Ads by Google