Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.

Similar presentations


Presentation on theme: "1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson."— Presentation transcript:

1 1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson Research Center

2 2 A Motivating Example Questions: Q1: How to cluster each type of objects? Q2: How to define similarity between each type of objects? Tom sigmod03 Mike Cathy John sigmod04 sigmod05 vldb03 vldb04 vldb05 sigmod vldb Mary aaai04 aaai05 aaai AuthorsProceedingsConferences

3 3 Link-based Similarities Two objects are similar if they are linked with similar objects Tom sigmod03 sigmod04 sigmod05 sigmod Tom Mike Cathy John sigmod03 sigmod04 sigmod05 vldb03 vldb04 vldb05 sigmod vldb Jeh & Widom, 2002 - SimRank The similarity between two objects x and y is defined as the average similarity between objects linked with x and those with y. Very expensive to compute: For a dataset of N objects and M links, it takes O(N 2 ) space and O(M 2 ) time to compute all similarities.

4 4 Observation 1: Hierarchical Structures Hierarchical structures often exist naturally among objects (e.g., taxonomy of animals) All electronicsgroceryapparel DVDcameraTV A hierarchical structure of products in Walmart Articles Words Relationships between articles and words (Chakrabarti, Papadimitriou, Modha, Faloutsos, 2004)

5 5 Observation 2: Distribution of Similarity Power law distribution exists in similarities –56% of similarity entries are in [0.005, 0.015] –1.4% of similarity entries are larger than 0.1 –Our goal: Design a data structure that stores the significant similarities and compresses insignificant ones Distribution of SimRank similarities among DBLP authors

6 6 Our Data Structure: SimTree Each leaf node represents an object Each non-leaf node represents a group of similar lower-level nodes Similarities between siblings are stored Consumer electronics Apparels Canon A40 digital camera Sony V3 digital camera Digital Cameras TVs

7 7 Similarity Defined by A SimTree sim p (n 7,n 8 ) = –Path-based node similarity Similarity between two nodes is the average similarity between nodes linked with them in other SimTrees Adjustment ratio for x = n1n1 n2n2 n4n4 n5n5 n6n6 n3n3 0.9 1.0 0.90.8 0.2 n7n7 n9n9 0.3 n8n8 0.8 0.9 Similarity between two sibling nodes n 1 and n 2 Adjustment ratio for node n 7 Average similarity between x and all other nodes Average similarity between x’s parent and all other nodes s(n4,n5)s(n4,n5)s(n 7,n 4 ) xx s(n5,n8)x s(n5,n8)

8 8 Overview of LinkClus Initialize a SimTree for objects of each type Repeat –For each SimTree, update the similarities between its nodes using similarities in other SimTrees Similarity between two nodes x and y is the average similarity between objects linked with them –Adjust the structure of each SimTree Assign each node to the parent node that it is most similar to

9 9 Initialization of SimTrees The “SimTrees” before initialization –Each leaf nodes have similarity 1 to itself and 0 to others Initializing a SimTree –Repeatedly find groups of tightly related nodes, which are merged into a higher-level node 101112131415161718192021222324 lmnopqrstuvwxy ST 2 ST 1

10 10 (continued) Tightness of a group of nodes –For a group of nodes {n 1, …, n k }, its tightness is defined as the number of leaf nodes in other SimTrees that are connected to all of {n 1, …, n k } n1n1 1 2 3 4 5 n2n2 The tightness of {n 1, n 2 } is 3 Nodes Leaf nodes in another SimTree

11 11 (continued) Finding tight groups Frequent pattern mining Procedure of initializing a tree –Start from leaf nodes (level-0) –At each level l, find non-overlapping groups of similar nodes with frequent pattern mining Reduced to g1g1 g2g2 {n1} {n1, n2} {n2} {n1, n2} {n2, n3, n4} {n4} {n3, n4} Transactions n1n1 1 2 3 4 5 6 7 8 9 n2n2 n3n3 n4n4 The tightness of a group of nodes is the support of a frequent pattern

12 12 Updating Similarities Between Nodes The initial similarities can seldom capture the relationships between objects Iteratively update similarities –Similarity between two nodes is the average similarity between objects linked with them ab z cd fg e hk lmnopqrstuvwxy ST 1 0 12 4567 3 89 101112131415161718192021222324 ST 2 10 11 12 13 14 sim(n a,n b ) = average similarity between and takes O(3 x 2) time

13 13 Aggregation-based Similarity Computation 45 10121314 ab ST 2 ST 1 11 0.2 0.9 1.0 0.80.91.0 For each node n k ∈ {n 10,n 11,n 12 } and n l ∈ {n 13,n 14 }, their path- based similarity sim p (n k, n l ) = s(n k, n 4 )·s(n 4, n 5 )·s(n 5, n l ). After aggregation, we reduce quadratic time computation to linear time computation. takes O(3 + 2) time

14 14 Simweights of Linkages 45 10121314 ab SC 2 SC 1 a:(0.9,3)b:(0.95,2) 11 0.2 0.9 1.0 0.80.91.0 Simweight between nodes n a and n 4 : the average similarity and total weight of linkages between them a:(1,1) b:(1,1) n a has a linkage of weight 1 and similarity 1 to each leaf node it is linked with weighted average similarity of linkages between n a and children of n 4 simweight(n a, n 4 )= (, 3 ) 0.9+1.0+0.8 3 total weight of linkages between n a and children of n 4

15 15 Computing Similarity with Simweights To compute sim(n a,n b ): Find all pairs of sibling nodes n i and n j, so that n a linked with n i and n b with n j. Calculate similarity (and weight) between n a and n b w.r.t. n i and n j. Calculate weighted average similarity between n a and n b w.r.t. all such pairs. sim(n a, n b ) = simweight(n a,n 4 ).sim x s(n 4, n 5 ) x simweight(n b,n 5 ).sim = 0.9 x 0.2 x 0.95 = 0.171 45 10121314 ab a:(0.9,3)b:(0.95,2) 11 0.2 sim(n a, n b ) can be computed from aggregated similarities

16 16 Adjusting SimTree Structures After similarity changes, the tree structure also needs to be changed –If a node is more similar to its parent’s sibling, then move it to be a child of that sibling –Try to move each node to its parent’s sibling that it is most similar to, under the constraint that each parent node can have at most c children n1n1 n2n2 n4n4 n5n5 n6n6 n3n3 n7n7 n9n9 n8n8 0.8 0.9 n7n7

17 17 Complexity TimeSpace Updating similaritiesO(M(logN) 2 )O(M+N) Adjusting tree structures O(N)O(N)O(N)O(N) LinkClusO(M(logN) 2 )O(M+N) SimRankO(M2)O(M2)O(N2)O(N2) For two types of objects, N in each, and M linkages between them.

18 18 Empirical Study Generating clusters using a SimTree –Suppose K clusters are to be generated –Find a level in the SimTree that has number of nodes closest to K –Merging most similar nodes or dividing largest nodes on that level to get K clusters Accuracy –Measured by manually labeled data –Accuracy of clustering: Percentage of pairs of objects in the same cluster that share common label Efficiency and scalability –Scalability w.r.t. number of objects, clusters, and linkages

19 19 Approaches in Comparison SimRank (Jeh & Widom, KDD 2002) –Computing pair-wise similarities Pruned-SimRank (P-SimRank) –Only compute similarities between objects that are linked to the same object SimRank with FingerPrints (F-SimRank) –Fogaras & R´acz, WWW 2005 –pre-computes a large sample of random paths from each object and uses the samples of two objects to estimate their SimRank similarity ReCom (Wang et al. SIGIR 2003) –Iteratively clustering objects using cluster labels of linked objects

20 20 DBLP Dataset We use 4170 most productive authors, and 154 well-known conferences with most proceedings –Manually labeled research areas of 400 most productive authors according to their home pages (or publications) –Manually labeled areas of 154 conferences according to their call for papers author-id author-name author-id paper-id proc-id conference location AuthorsPublishesProceedings year paper-id title Publications email proc-id conference Conferences publisher

21 21 Accuracy ApproachesAccr-AuthorAccr-Confaverage time LinkClus0.9570.72376.7 SimRank0.9580.7601020 ReCom0.9070.45743.1 F-SimRank0.9080.58383.6

22 22 (continued) Accuracy vs. Running time –LinkClus is almost as accurate as SimRank (most accurate), and is much more efficient

23 23 Email Dataset F. Nielsen. Email dataset. http://www.imm.dtu.dk/ ∼ rem/data/Email-1431.zip 370 emails on conferences, 272 on jobs, and 789 spam emails ApproachAccuracyTotal time (sec) LinkClus0.80261579.6 SimRank0.796539160 ReCom0.571174.6 F-SimRank0.3688479.7 CLARANS0.47688.55

24 24 Scalability (1) Tested on synthetic datasets, with randomly generated clusters Scalability w.r.t. number of objects –Number of clusters is fixed (40)

25 25 Scalability (2) Scalability w.r.t. number of objects & clusters –Each cluster has fixed size (100 objects)

26 26 Scalability (3) Scalability w.r.t. number of linkages from each object

27 27 Conclusions With our data structure SimTree, LinkClus can compress the pair-wise similarities while achieving high accuracy Experimental results show that LinkClus is a highly accurate and scalable approach for clustering multi-typed linked objects

28 28 Thank you Questions and comments


Download ppt "1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson."

Similar presentations


Ads by Google