Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti

Similar presentations


Presentation on theme: "1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti"— Presentation transcript:

1 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

2 2 Problem Definition People People Groups Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

3 3 Reminder People Graph: N nodes and E directed edges

4 4 Problem Definition People People Groups Goals: [#1] Find groups (of people, species, proteins, etc.) [#2] Find outlier edges (“bridges”) [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

5 5 Problem Definition People People Groups Properties: Fully Automatic (estimate the number of groups) Scalable Allow incremental updates

6 6 Related Work Graph Partitioning  METIS (Karypis+/1998)  Spectral partitioning (Ng+/2001) Clustering Techniques  K-means and variants (Pelleg+/2000,Hamerly+/2003)  Information-theoretic co-clustering (Dhillon+/2003) LSI (Deerwester+/1990) Choosing the number of “concepts” Measure of imbalance between clusters, OR Number of partitions Rows and columns are considered separately, OR Not fully automatic

7 7 Outline Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions

8 8 Outline Problem Definition Related Work Finding clusters in graphs  What is a good clustering?  How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions

9 9 What is a “good” clustering Node Groups versus Why is this better? Good Clustering 1.Similar nodes are grouped together 2.As few groups as necessary A few, homogeneous blocks Good Compression implies

10 10 Binary Matrix Node groups Main Idea Good Compression Good Clustering implies p i 1 = n i 1 / (n i 1 + n i 0 ) (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

11 11 Examples One node group highlow n node groups highlow Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

12 12 What is a “good” clustering Node Groups versus Why is this better? low Total Encoding Cost = (n i 1 +n i 0 )* H(p i 1 ) Cost of describing n i 1, n i 0 and groups Code Cost Description Cost ΣiΣi +Σi+Σi

13 13 Outline Problem Definition Related Work Finding clusters in graphs  What is a good clustering?  How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions

14 14 Algorithms k = 5 node groups

15 15 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost

16 16 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost

17 17 Node groups Fixed number of groups k Reassign: for each node: reassign it to the group which minimizes the code cost

18 18 Algorithms Start with initial matrix Choose better values for k Final grouping Lower the encoding cost Find good groups for fixed k

19 19 Choosing k Split: 1.Find the group R with the maximum entropy per node 2.Choose the nodes in R whose removal reduces the entropy per node in R 3.Send these nodes to the new group, and set k=k+1

20 20 Algorithms Start with initial matrix Find good groups for fixed k Choose better values for k Final grouping Lower the encoding cost Reassign Splits

21 21 Algorithms Properties:  Fully Automatic  number of groups is found automatically  Scalable  O(E) time  Allow incremental updates  reassign new node/edge to the group with least cost, and continue…

22 22 Outline Problem Definition Related Work Finding clusters in graphs  What is a good clustering?  How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions

23 23 Outlier Edges Nodes Outliers Deviations from “normality” Lower quality compression Find edges whose removal maximally reduces cost Node Groups Outlier edges

24 24 Inter-cluster distances Nodes Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j

25 25 Inter-cluster distances Node Groups Grp1 Grp2 Grp3 Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j Grp1Grp2Grp3 5.5 4.55.1

26 26 Outline Problem Definition Related Work Finding clusters in graphs  What is a good clustering?  How can we find such a clustering? Outliers and inter-group distances Experiments Conclusions

27 27 Experiments “Quasi block-diagonal” graph with noise=10%

28 28 Experiments Authors DBLP dataset 6,090 authors in: SIGMOD ICDE VLDB PODS ICDT 175,494 “dots”, one “dot” per co-citation

29 29 Experiments Authors Author groups k=8 author groups found Stonebraker, DeWitt, Carey

30 30 Experiments Author groups Grp8Grp1 Inter-group distances

31 31 Experiments User groups Epinions dataset 75,888 users 508,960 “dots”, one “dot” per “trust” relationship k=19 groups found Small dense “core”

32 32 Experiments Number of “dots” Time (in seconds) Linear in the number of “dots”  Scalable

33 33 Conclusions Goals:  Find groups  Find outliers  Compute inter-group “distances” Properties:  Fully Automatic  Scalable  Allow incremental updates


Download ppt "1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti"

Similar presentations


Ads by Google