1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google.

1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google Vova Soroka IBM

2 Clustering A network: undirected graph with non-negative edge weights  w(u,v): “Similarity” between u and v.  Do not necessarily correspond to a proper metric Induced distance may not respect the triangle’s inequality Examples:  Social networks. w(u,v) = strength of relationship between u and v.  Biological networks. w(u,v) = genetic similarity between species u and v.  Document networks. w(u,v) = topical similarity between u and v.  Image networks. w(u,v) = color similarity/proximity between u and v. Clustering: partitioning of the network into regions of similarity  Communities in social networks  Species families in biological networks  Groups of documents on the same topic.  Segments of an image.

3 The cluster abundance problem Problem: Sometimes clustering algorithm produces masses of clusters.  Large networks  Fuzzy/soft clustering Needle in a haystack problem – which are the important clusters?

4 Cluster ranking Goals:  Define a cluster strength measure Assigns a strength score to each subset of nodes  Design cluster ranking algorithm Outputs the clusters in the network, ordered by their strength

5 A simple example strength(C) = |C|, if C is a clique. strength(C) = 0, if C is not a clique. Cluster ranking:  {a,b,c}, {d,e,f}  {c,g}, {g,f} a g e b c d f

6 Our contributions Cluster ranking framework New cluster strength measure  Properly captures similarity among cluster members  Applicable to both weighted and unweighted networks  Arbitrary similarity weights  Efficiently computable Cluster ranking algorithm Application to mining communities in “personal mailbox networks”

7 Cluster strength measure: Unweighted networks Which is a stronger cluster? Cohesion = measure of strength for unweighted clusters  Cohesive cluster = does not “easily” break into pieces G1G1 G2G2

8 Edge separators Edge separator: A subset of the network’s edges whose removal breaks the network into two or more connected components. All previous work: cohesion(C) = “density” of “sparsest” edge separator Different notions of density for edge separators:  Conductance [KannanVempalaVetta00]  Normalized cut [ShiMalik00]  Relative neighborhoods [FlakeLawrenceGiles00]  Edge betweenness [GirvanNewman02]  Modularity [GirvanNewman04]

9 Edge separators are not good enough True: sparse edge separator noncohesive cluster False: no sparse edge separator cohesive cluster Clique of size m v u v

10 Vertex separator: A subset of the network’s vertices whose removal breaks the network into two or more connected components. Our strength measure: cohesion(C) = “density” of “sparsest” vertex separator Separator is “sparse”, if  S is small  A,B are “balanced” B A S Vertex separators

11 Vertex separators are better Sparse edge separator sparse vertex separator noncohesive cluster Sparse vertex separator noncohesive cluster Clique of size m v u v

12 Cluster strength measure: Weighted networks Which is a stronger cluster? Cohesion is no longer the sole factor determining cluster strength 10 1 1 1 G1G1 G2G2

13 Thresholding Traditional approach for dealing with weighted networks Transforms the weighted network into an unweighted network by a threshold  Threshold T<1  Threshold 1 ≤ T < 5  No threshold is suitable G1G1 G2G2 1 5 GTGT GTGT G

14 Integrated cohesion Which is a stronger cluster?  Small T G 1 is stronger  Large T G 2 is stronger Integrated cohesion: area under the curve  Strong cluster: sustains high cohesion while increasing threshold Cohesion(G T ) T 1 G1G1 T 0.7 G2G2

15 C-Rank - Cluster Ranking Algorithm Candidate identification Ranking by strength score Elimination of non-maximal clusters

16 Candidate identification: Unweighted networks Given an unweighted network G  Find a sparse vertex separator S of G  Network splits into disconnected components A 1,…,A k  Clusters = SUA 1,…,SUA k  Recurse on SUA 1,…,SUA k S A2A2 A4A4 A3A3 A5A5 A1A1

17 c Candidate identification - Example Sparse separator: S = {c,d} Connected components: A 1 = {a,b}, A 2 = {e} Add back {c,d} to A 1 and A 2 a bd e A1A1 A2A2

18 Candidate identification - Example Sparse separator: S = {c,d} Connected components: A 1 = {a,b}, A 2 = {e} Add back {c,d} to A 1 and A 2 Since both components are cliques, no recursive calls are made ca bd c d e S U A 1 S U A 2

19 Mailbox networks Nodes: contacts appearing in headers of messages in a person’s mailbox  Excluding mailbox owner Edges: connect contacts who co-occur at the same massage header  Edge weights: frequency of co-occurrence This is an egocentric social network  Reflects the subjective perspective of the mailbox owner

20 Mining mailbox networks Motivation  Advanced email client features Automatic group completion and correction Automatic group classification (colleagues, friends, spouse, etc.) Identification of “spam groups” and management of blocked lists  Intelligence & law enforcement Mine mailboxes of suspected terrorists and criminals Our Goal Given: A mailbox network G Output: A ranking of communities in G

21 Ziv Bar-Yossef’s top 10 communities Description Member IDsWeightRank grad student + co-advisor1,21631 FOCS program committee3-19412 old car pool20,21,22,23,2439.23 new car pool20,21,22,23,24,2528.54 colleagues26,27285 colleagues28,29286 colleagues 26,30,31257 department committee32,33,34198 jokes forwarding group35-5315.99 reading group 54-671510

22 Experiments Enron Email Dataset (http://www.cs.cmu.edu/~enron/)http://www.cs.cmu.edu/~enron/  Made publicly available during the investigation of Enron fraud  ~150 mailboxes of Enron employees  More than 500,000 messages Compared with another clustering algorithm  EB-Rank - Adaptation the popular edge betweenness algorithm [GirvanNewman02] to our framework

23 Relative recall

24 Score comparison

25 Conclusions The cluster ranking problem as a novel framework for clustering Integrated cohesion as a strength measure for overlapping clusters in weighted networks C-Rank: A new cluster ranking algorithm Application: mining mailbox networks

26 Thank You

27 Integrated cohesion Which is a stronger cluster?  Note: to compute integral, need only G T for T’s that equal the distinct edge weights Cohesion(G T ) T 1 G1G1 T 0.7 G2G2

28 Integrated cohesion - Example 1 3 15 T 7 55 33 107 Cohesion = 1 3 Cohesion(G T ) G

29 Integrated cohesion - Example 1 3 T 3 Cohesion = 0.667 7 0.667 2.333 Cohesion(G T ) 15 7 55 107

30 Integrated cohesion - Example 1 3 T 3 Cohesion = 0.333 7 0.667 2.333 Cohesion(G T ) 15 10 1 int_cohesion(G) = 3 + 2.333 + 1 = 6.333 0.333

31 Cluster subsumption and maximality C is maximal iff partitioning any super-set of C into clusters leaves C in tact. S = sparsest separator of C  (C 1, C 2 ) : induced cover of C S = sparsest separator of D  (D 1,D 2 ) : Induced cover of D C 1  D 1, C 2  D 2 D subsumes C  C is not maximal S D1D1 C1C1 D2D2 C2C2 D C

32 Candidate identification: Weighted networks Apply a threshold T=0 on G a bd c e 2 2 5 2 5 5 2 2 G

33 c Candidate identification: Weighted networks Unweighted candidate identification a bd e G0G0

34 Candidate identification: Weighted networks Recurse on ‘abcd’ and ‘cde’ separately ca bd c d e

35 Candidate identification: Weighted networks a bd c 5 2 5 5 2 2 Apply threshold T=2 on ‘abcd’

36 Candidate identification: Weighted networks a bd c Apply threshold T=2 on ‘abcd’ Recurse on ‘abc’ No recursive call on singleton ‘d’

37 Candidate identification: Weighted networks a b c Apply threshold T=5 on ‘abc’ 5 5 5

38 Candidate identification: Weighted networks Apply threshold T=5 on ‘abc’ No recursive call on singletons ‘a’,‘b’,‘c’ ca b

39 Candidate identification: Weighted networks Final candidate list:  ‘abcde’  ‘abcd’  ‘abc’  ‘cde’ a bd c e 2 2 5 2 5 5 3 2

40 Computing sparse vertex separators Complexity of Sparsest Vertex Separator  NP-hard  Can be approximated in polynomial time via Semi- Definite Programming [FeigeHajiaghayiLee05]  SDP might be inefficient in practice We find sparse vertex separators via Vertex Betweenness [Freeman77]  Efficiently computable via dynamic programming  Works well empirically  In worst-case, approximation can be weak

41 Computing sparse vertex separators Complexity of Sparsest Vertex Separator  NP-hard  Can be approximated in polynomial time via Semi- Definite Programming [FeigeHajiaghayiLee05]  SDP might be inefficient in practice We find sparse vertex separators via Vertex Betweenness [Freeman77]  Efficiently computable via dynamic programming  Works well empirically  In worst-case, approximation can be weak

42 Normalized Vertex Betweenness (NVB) [Freeman77] Vertex Betweenness (VB) of a node v: Number of shortest paths passing through v  Ex: ~m 2 for v, 0 for the other vertices Normalized Vertex Betweenness (NVB): divide by to get values in [0,1] NVB(G): Maximum NVB value over all nodes Theorem: cohesion(G) ≥ 1/(1 + |G| · NVB(G))  In practice: cohesion(G) ≈ 1/(1 + |G| · NVB(G)) Clique of size m v

43 Candidate identification: Weighted networks Ideal algorithm:  Iterate over all possible thresholds T  Output all clusters in G T  Somewhat inefficient Actual algorithm: 1) Apply threshold T = min weight in G 2) Output clusters of G T 3) For each clique C in G T Recurse on C

44 C-Rank: Analysis Theorem: C-Rank is guaranteed to output all the maximal clusters. Lemma: C-Rank runs in time polynomial in its output length.

45 Mailbox networks a bd c 11 11 1 1 1 1 1 1 a b, c, d, and owner c d, e, and owner An egocentric social network  Reflects the subjective perspective of the mailbox owner Nodes: contacts appearing in message headers  Excluding mailbox owner Edges: connect contacts who co-occur at the same message header  Edge weights: frequency of co-occurrence

46 Mailbox networks a b, c, d, and owner c d, e, and owner b owner a bd c e 1 1 1 21 12 1 1 1 1 1 2 An egocentric social network  Reflects the subjective perspective of the mailbox owner Nodes: contacts appearing in message headers  Excluding mailbox owner Edges: connect contacts who co-occur at the same massage header  Edge weights: frequency of co-occurrence

47 Mailbox networks a bd c e 1 1 1 22 12 1 1 1 1 1 2 a b, c, d, and owner c d, e, and owner b owner An egocentric social network  Reflects the subjective perspective of the mailbox owner Nodes: contacts appearing in message headers  Excluding mailbox owner Edges: connect contacts who co-occur at the same massage header  Edge weights: frequency of co-occurrence

48 Ido Guy’s top 10 communities DescriptionMember IDsWeightRank project1 core team1,21841 spouse3872 advisor4753 project2 core team1,5,6,770.34 former advisor8625 project1 new team1,2,9,10,11,1248.26 academic course staff13-2546.97 project2 extended team (IBM)1,5,6,7,26-3046.78 project1 old team1,2,9,10,3142.39 project2 extended team (IBM+Lucent) 1,5,6,7,26-30,32-3541.310

49 Estimated precision

1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google.

Similar presentations

Presentation on theme: "1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google.

Similar presentations

Presentation on theme: "1 Cluster Ranking with an Application to Mining Mailbox Networks Ziv Bar-Yossef Technion, Google Ido Guy Technion, IBM Ronny Lempel IBM Yoelle Maarek Google."— Presentation transcript:

Similar presentations

About project

Feedback