Download presentation
Presentation is loading. Please wait.
1
1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google Vova Soroka IBM
2
2 Clustering A network: undirected graph with non-negative edge weights w(u,v): “Similarity” between u and v. Do not necessarily correspond to a proper metric Induced distance may not respect the triangle’s inequality Examples: Social networks. w(u,v) = strength of relationship between u and v. Biological networks. w(u,v) = genetic similarity between species u and v. Document networks. w(u,v) = topical similarity between u and v. Image networks. w(u,v) = color similarity/proximity between u and v. Clustering: partitioning of the network into regions of similarity Communities in social networks Species families in biological networks Groups of documents on the same topic. Segments of an image.
3
3 The cluster abundance problem Problem: Sometimes clustering algorithm produces masses of clusters. Large networks Fuzzy/soft clustering Needle in a haystack problem – which are the important clusters?
4
4 Cluster ranking Goals: Define a cluster strength measure Assigns a strength score to each subset of nodes Design cluster ranking algorithm Outputs the clusters in the network, ordered by their strength
5
5 A simple example strength(C) = |C|, if C is a clique. strength(C) = 0, if C is not a clique. Cluster ranking: {a,b,c}, {d,e,f} {c,g}, {g,f} a g e b c d f
6
6 Our contributions Cluster ranking framework New cluster strength measure Properly captures similarity among cluster members Applicable to both weighted and unweighted networks Arbitrary similarity weights Efficiently computable Cluster ranking algorithm Application to mining communities in “personal mailbox networks”
7
7 Cluster strength measure: Unweighted networks Which is a stronger cluster? Cohesion = measure of strength for unweighted clusters Cohesive cluster = does not “easily” break into pieces G1G1 G2G2
8
8 Edge separators Edge separator: A subset of the network’s edges whose removal breaks the network into two or more connected components. All previous work: cohesion(C) = “density” of “sparsest” edge separator Different notions of density for edge separators: Conductance [KannanVempalaVetta00] Normalized cut [ShiMalik00] Relative neighborhoods [FlakeLawrenceGiles00] Edge betweenness [GirvanNewman02] Modularity [GirvanNewman04]
9
9 Edge separators are not good enough True: sparse edge separator noncohesive cluster False: no sparse edge separator cohesive cluster Clique of size m v u v
10
10 Vertex separator: A subset of the network’s vertices whose removal breaks the network into two or more connected components. Our strength measure: cohesion(C) = “density” of “sparsest” vertex separator Separator is “sparse”, if S is small A,B are “balanced” B A S Vertex separators
11
11 Vertex separators are better Sparse edge separator sparse vertex separator noncohesive cluster Sparse vertex separator noncohesive cluster Clique of size m v u v
12
12 Cluster strength measure: Weighted networks Which is a stronger cluster? Cohesion is no longer the sole factor determining cluster strength 10 1 1 1 G1G1 G2G2
13
13 Thresholding Traditional approach for dealing with weighted networks Transforms the weighted network into an unweighted network by a threshold Threshold T<1 Threshold 1 ≤ T < 5 No threshold is suitable G1G1 G2G2 1 5 GTGT GTGT G
14
14 Integrated cohesion Which is a stronger cluster? Small T G 1 is stronger Large T G 2 is stronger Integrated cohesion: area under the curve Strong cluster: sustains high cohesion while increasing threshold Cohesion(G T ) T 1 G1G1 T 0.7 G2G2
15
15 C-Rank - Cluster Ranking Algorithm Candidate identification Ranking by strength score Elimination of non-maximal clusters
16
16 Candidate identification: Unweighted networks Given an unweighted network G Find a sparse vertex separator S of G Network splits into disconnected components A 1,…,A k Clusters = SUA 1,…,SUA k Recurse on SUA 1,…,SUA k S A2A2 A4A4 A3A3 A5A5 A1A1
17
17 c Candidate identification - Example Sparse separator: S = {c,d} Connected components: A 1 = {a,b}, A 2 = {e} Add back {c,d} to A 1 and A 2 a bd e A1A1 A2A2
18
18 Candidate identification - Example Sparse separator: S = {c,d} Connected components: A 1 = {a,b}, A 2 = {e} Add back {c,d} to A 1 and A 2 Since both components are cliques, no recursive calls are made ca bd c d e S U A 1 S U A 2
19
19 Mailbox networks Nodes: contacts appearing in headers of messages in a person’s mailbox Excluding mailbox owner Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence This is an egocentric social network Reflects the subjective perspective of the mailbox owner
20
20 Mining mailbox networks Motivation Advanced email client features Automatic group completion and correction Automatic group classification (colleagues, friends, spouse, etc.) Identification of “spam groups” and management of blocked lists Intelligence & law enforcement Mine mailboxes of suspected terrorists and criminals Our Goal Given: A mailbox network G Output: A ranking of communities in G
21
21 Ziv Bar-Yossef’s top 10 communities Description Member IDsWeightRank grad student + co-advisor1,21631 FOCS program committee3-19412 old car pool20,21,22,23,2439.23 new car pool20,21,22,23,24,2528.54 colleagues26,27285 colleagues28,29286 colleagues 26,30,31257 department committee32,33,34198 jokes forwarding group35-5315.99 reading group 54-671510
22
22 Experiments Enron Email Dataset (http://www.cs.cmu.edu/~enron/)http://www.cs.cmu.edu/~enron/ Made publicly available during the investigation of Enron fraud ~150 mailboxes of Enron employees More than 500,000 messages Compared with another clustering algorithm EB-Rank - Adaptation the popular edge betweenness algorithm [GirvanNewman02] to our framework
23
23 Relative recall
24
24 Score comparison
25
25 Conclusions The cluster ranking problem as a novel framework for clustering Integrated cohesion as a strength measure for overlapping clusters in weighted networks C-Rank: A new cluster ranking algorithm Application: mining mailbox networks
26
26 Thank You
27
27 Integrated cohesion Which is a stronger cluster? Note: to compute integral, need only G T for T’s that equal the distinct edge weights Cohesion(G T ) T 1 G1G1 T 0.7 G2G2
28
28 Integrated cohesion - Example 1 3 15 T 7 55 33 107 Cohesion = 1 3 Cohesion(G T ) G
29
29 Integrated cohesion - Example 1 3 T 3 Cohesion = 0.667 7 0.667 2.333 Cohesion(G T ) 15 7 55 107
30
30 Integrated cohesion - Example 1 3 T 3 Cohesion = 0.333 7 0.667 2.333 Cohesion(G T ) 15 10 1 int_cohesion(G) = 3 + 2.333 + 1 = 6.333 0.333
31
31 Cluster subsumption and maximality C is maximal iff partitioning any super-set of C into clusters leaves C in tact. S = sparsest separator of C (C 1, C 2 ) : induced cover of C S = sparsest separator of D (D 1,D 2 ) : Induced cover of D C 1 D 1, C 2 D 2 D subsumes C C is not maximal S D1D1 C1C1 D2D2 C2C2 D C
32
32 Candidate identification: Weighted networks Apply a threshold T=0 on G a bd c e 2 2 5 2 5 5 2 2 G
33
33 c Candidate identification: Weighted networks Unweighted candidate identification a bd e G0G0
34
34 Candidate identification: Weighted networks Recurse on ‘abcd’ and ‘cde’ separately ca bd c d e
35
35 Candidate identification: Weighted networks a bd c 5 2 5 5 2 2 Apply threshold T=2 on ‘abcd’
36
36 Candidate identification: Weighted networks a bd c Apply threshold T=2 on ‘abcd’ Recurse on ‘abc’ No recursive call on singleton ‘d’
37
37 Candidate identification: Weighted networks a b c Apply threshold T=5 on ‘abc’ 5 5 5
38
38 Candidate identification: Weighted networks Apply threshold T=5 on ‘abc’ No recursive call on singletons ‘a’,‘b’,‘c’ ca b
39
39 Candidate identification: Weighted networks Final candidate list: ‘abcde’ ‘abcd’ ‘abc’ ‘cde’ a bd c e 2 2 5 2 5 5 3 2
40
40 Computing sparse vertex separators Complexity of Sparsest Vertex Separator NP-hard Can be approximated in polynomial time via Semi- Definite Programming [FeigeHajiaghayiLee05] SDP might be inefficient in practice We find sparse vertex separators via Vertex Betweenness [Freeman77] Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak
41
41 Computing sparse vertex separators Complexity of Sparsest Vertex Separator NP-hard Can be approximated in polynomial time via Semi- Definite Programming [FeigeHajiaghayiLee05] SDP might be inefficient in practice We find sparse vertex separators via Vertex Betweenness [Freeman77] Efficiently computable via dynamic programming Works well empirically In worst-case, approximation can be weak
42
42 Normalized Vertex Betweenness (NVB) [Freeman77] Vertex Betweenness (VB) of a node v: Number of shortest paths passing through v Ex: ~m 2 for v, 0 for the other vertices Normalized Vertex Betweenness (NVB): divide by to get values in [0,1] NVB(G): Maximum NVB value over all nodes Theorem: cohesion(G) ≥ 1/(1 + |G| · NVB(G)) In practice: cohesion(G) ≈ 1/(1 + |G| · NVB(G)) Clique of size m v
43
43 Candidate identification: Weighted networks Ideal algorithm: Iterate over all possible thresholds T Output all clusters in G T Somewhat inefficient Actual algorithm: 1) Apply threshold T = min weight in G 2) Output clusters of G T 3) For each clique C in G T Recurse on C
44
44 C-Rank: Analysis Theorem: C-Rank is guaranteed to output all the maximal clusters. Lemma: C-Rank runs in time polynomial in its output length.
45
45 Mailbox networks a bd c 11 11 1 1 1 1 1 1 a b, c, d, and owner c d, e, and owner An egocentric social network Reflects the subjective perspective of the mailbox owner Nodes: contacts appearing in message headers Excluding mailbox owner Edges: connect contacts who co-occur at the same message header Edge weights: frequency of co-occurrence
46
46 Mailbox networks a b, c, d, and owner c d, e, and owner b owner a bd c e 1 1 1 21 12 1 1 1 1 1 2 An egocentric social network Reflects the subjective perspective of the mailbox owner Nodes: contacts appearing in message headers Excluding mailbox owner Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence
47
47 Mailbox networks a bd c e 1 1 1 22 12 1 1 1 1 1 2 a b, c, d, and owner c d, e, and owner b owner An egocentric social network Reflects the subjective perspective of the mailbox owner Nodes: contacts appearing in message headers Excluding mailbox owner Edges: connect contacts who co-occur at the same massage header Edge weights: frequency of co-occurrence
48
48 Ido Guy’s top 10 communities DescriptionMember IDsWeightRank project1 core team1,21841 spouse3872 advisor4753 project2 core team1,5,6,770.34 former advisor8625 project1 new team1,2,9,10,11,1248.26 academic course staff13-2546.97 project2 extended team (IBM)1,5,6,7,26-3046.78 project1 old team1,2,9,10,3142.39 project2 extended team (IBM+Lucent) 1,5,6,7,26-30,32-3541.310
49
49 Estimated precision
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.