1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti

1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

2 Problem Definition People People Groups Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

3 Reminder People Graph: N nodes and E directed edges

4 Problem Definition People People Groups Goals: [#1] Find groups (of people, species, proteins, etc.) [#2] Find outlier edges (“bridges”) [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

5 Problem Definition People People Groups Properties: Fully Automatic (estimate the number of groups) Scalable Allow incremental updates

6 Related Work Graph Partitioning  METIS (Karypis+/1998)  Spectral partitioning (Ng+/2001) Clustering Techniques  K-means and variants (Pelleg+/2000,Hamerly+/2003)  Information-theoretic co-clustering (Dhillon+/2003) LSI (Deerwester+/1990) Choosing the number of “concepts” Measure of imbalance between clusters, OR Number of partitions Rows and columns are considered separately, OR Not fully automatic

7 Outline Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions

8 Outline Problem Definition Related Work Finding clusters in graphs  What is a good clustering?  How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions

9 What is a “good” clustering Node Groups versus Why is this better? Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression implies

10 Binary Matrix Node groups Main Idea Good Compression Good Clustering implies p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

11 Examples One node group highlow n node groups highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

12 What is a “good” clustering Node Groups versus Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

14 Algorithms k = 5 node groups

15 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost

16 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost

17 Node groups Fixed number of groups k Reassign: for each node: reassign it to the group which minimizes the code cost

18 Algorithms Start with initial matrix Choose better values for k Final grouping Lower the encoding cost Find good groups for fixed k

19 Choosing k Split: 1.Find the group R with the maximum entropy per node 2.Choose the nodes in R whose removal reduces the entropy per node in R 3.Send these nodes to the new group, and set k=k+1

20 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost Reassign Splits

21 Algorithms Properties:  Fully Automatic  number of groups is found automatically  Scalable  O(E) time  Allow incremental updates  reassign new node/edge to the group with least cost, and continue…

23 Outlier Edges Nodes Outliers Deviations from “normality” Lower quality compression Find edges whose removal maximally reduces cost Node Groups Outlier edges

24 Inter-cluster distances Nodes Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j

25 Inter-cluster distances Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j Grp1Grp2Grp3 5.5 4.55.1

27 Experiments “Quasi block-diagonal” graph with noise=10%

28 Experiments Authors DBLP dataset 6,090 authors in: SIGMOD ICDE VLDB PODS ICDT 175,494 “dots”, one “dot” per co-citation

29 Experiments Authors Author groups k=8 author groups found Stonebraker, DeWitt, Carey

30 Experiments Author groups Grp8Grp1 Inter-group distances

31 Experiments User groups Epinions dataset 75,888 users 508,960 “dots”, one “dot” per “trust” relationship k=19 groups found Small dense “core”

32 Experiments Number of “dots” Time (in seconds) Linear in the number of “dots”  Scalable

33 Conclusions Goals:  Find groups  Find outliers  Compute inter-group “distances” Properties:  Fully Automatic  Scalable  Allow incremental updates

1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti

Similar presentations

Presentation on theme: "1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti

Similar presentations

Presentation on theme: "1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti"— Presentation transcript:

Similar presentations

About project

Feedback