Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Search Landscape of Graph Partitioning Problems using Coupling and Cohesion as the Clustering Criteria Brian S. Mitchell & Spiros Mancoridis

Similar presentations


Presentation on theme: "1 The Search Landscape of Graph Partitioning Problems using Coupling and Cohesion as the Clustering Criteria Brian S. Mitchell & Spiros Mancoridis"— Presentation transcript:

1 1 The Search Landscape of Graph Partitioning Problems using Coupling and Cohesion as the Clustering Criteria Brian S. Mitchell & Spiros Mancoridis {bmitchel,smancori}@mcs.drexel.edu http://www.mcs.drexel.edu/~{bmitchel,smancori} Department of Computer Science Software Engineering Research Group http://serg.mcs.drexel.edu Drexel University, Philadelphia, PA, USA 10/05/2002

2 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 2 Software Clustering with Bunch Source Code Analysis Tools MDG File Bunch Clustering Tool Partitioned MDG File Visualization Tool Source Code void main() { printf(“hello”); } AcaciaChava M1 M2 M3 M5M4 M6 M7M8 M1 M2 M3 M5M4 M6 M7M8 Bunch GUI Clustering Algorithms Clustering Tools Programming API

3 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 3 Software Clustering as a Search Problem Source Code Analysis Tools MDG Source Code void main() { printf(“hello”); } AcaciaChava M1 M2 M3 M5M4 M6 M7M8 Software Clustering Search Algorithms bP = null; while(searching()) { p = selectNext(); if(p.isBetter(bP)) bP = p; } return bP; “GOOD” MDG Partition M1 M2 M3 M5M4 M6 M7M8 SEARCH SPACE Set of All MDG Partitions M1 M2 M3 M5M4 M6 M8M7 M1 M2 M3 M5M4 M6 M8M7 Total = 4140 Partitions

4 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 4 The Search Space is Enormous 1 = 1 2 = 2 3 = 5 4 = 15 5 = 52 6 = 203 7 = 877 8 = 4140 9 = 21147 10 = 115975 11 = 678570 12 = 4213597 13 = 27644437 14 = 190899322 15 = 1382958545 16 = 10480142147 17 = 82864869804 18 = 682076806159 19 = 5832742205057 20 = 51724158235372        otherwisekSS nkkif S knkn kn,11,1, 11 The number of MDG partitions grows very quickly, as the number of modules in the system increases… A 15 Module System is about the limit for performing Exhaustive Analysis

5 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 5 Our Assumption… “Well designed software systems are organized into cohesive clusters that are loosely interconnected.” We designed a measurement called MQ that embodies our assumption The MQ measurement balances cohesion and coupling We apply MQ to partitions of the MDG

6 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 6 Not all Partitions of the MDG are Good Solutions MDG Good Partition!Bad Partition! M1 M2 M1 M2M3 M1 M2 M4 M3 M5 M6M3 M4 M5M6 M4 M5 M6 MQ( Good Partition ) > MQ( Bad Partition )

7 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 7 The Software Clustering Problem: Algorithm Objectives “Find a good partition of the MDG.” A partition is the decomposition of a set of elements (i.e., all the nodes of the graph) into mutually disjoint clusters. A good partition is a partition where: highly interdependent nodes are grouped in the same clusters independent nodes are assigned to separate clusters The better the partition the higher the MQ

8 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 8 Bunch Hill Climbing Clustering Algorithm Generate a Random Decomposition of MDG Iteration Step Generate Next Neighbor Measure MQ Compare to Best Neighboring Partition Better Measure MQ Best Neighboring Partition New Best Neighboring Partition Convergence Best Neighboring Partition for Iteration Current Partition A neighbor partition is created by altering the current partition slightly. Neighbor Partition Better?

9 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 9 Bunch Hill Climbing Clustering Algorithm Generate a Random Decomposition of MDG Iteration Step Generate Next Neighbor Measure MQ Compare to Best Neighboring Partition Better Measure MQ Best Neighboring Partition New Best Neighboring Partition Convergence Best Neighboring Partition for Iteration Current Partition A neighbor partition is created by altering the current partition slightly. Neighbor Partition Better? Other Things of Interest We have implemented a family of hill-climbing algorithms We also implemented an Exhaustive and Genetic Algorithm Other Things of Interest We have implemented a family of hill-climbing algorithms We also implemented an Exhaustive and Genetic Algorithm

10 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 10 Hierarchical Clustering (1): Nested View 1. 2. Default 4. 3.

11 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 11 Hierarchical Clustering (2): Consolidated View 1. 2. Default 4. 3.

12 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 12 Hierarchical Clustering (3): Tree View

13 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 13 Hierarchical Clustering (3): Tree View Observations The number of levels for a given system’s clustering hierarchy is bounded by: O(log 2 N) because Bunch places at least 2 nodes in each cluster. Observations The number of levels for a given system’s clustering hierarchy is bounded by: O(log 2 N) because Bunch places at least 2 nodes in each cluster.

14 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 14 Evaluating The Software Clustering Results Over the past few years we have spent a lot of time evaluating Bunch’s software clustering results Empirically Semi-formally Measuring Similarity

15 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 15 What We Know Given a particular MDG, the results produced by Bunch converge to a family of related solutions The search space is large, and the probability of finding a good solution by random sampling is infinitesimal

16 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 16 Software Clustering using Graph Partitioning Techniques Running Bunch multiple times produces a family of related clustering results Bunch starts with a random partition of the MDG, and makes random moves to explore the search space

17 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 17 Software Clustering using Graph Partitioning Techniques How related are these clustering results?

18 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 18 Software Clustering using Graph Partitioning Techniques Given that there are 2,7644,437 distinct partitions of this MDG, there is a lot of agreement…

19 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 19 Software Clustering using Graph Partitioning Techniques Why Some Modules Don’t Agree… Library Modules Isomorphism Omnipresent Module Influences

20 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 20 Special Modules Isomorphic – Modules that are connected to multiple clusters with equal strength Library – All edges fan-in Driver – All edges fan-out Omnipresent – Modules that are strongly connected to many other modules in the system

21 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 21 Clustering a System Many Times (1)… RCS Dot Random Bunch

22 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 22 Clustering a System Many Times (2)… Swing Bunch Random Bunch

23 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 23 Clustering a System Many Times (2)… Swing Bunch Random Bunch Observations As the number of clusters increased in the random samples, MQ decreased Bunch converged to a consistent “family” of solutions, no matter where the random starting point was generated Some solutions were multi-modal Random solutions were consistently worse than Bunch’s solutions. Observations As the number of clusters increased in the random samples, MQ decreased Bunch converged to a consistent “family” of solutions, no matter where the random starting point was generated Some solutions were multi-modal Random solutions were consistently worse than Bunch’s solutions.

24 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 24 Example - Detailed Results: Bunch System The search space has some inherent structure, as random clusters constrained to the area where Bunch converged did not produce better MQ values. 77% 23%

25 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 25 Understanding the Search Space There are characteristics of Bunch’s clustering algorithms that are interesting: It seems unusual that the clustering algorithms produce consistent MQ values given the large search space Other approaches [spectral methods] to solving the clustering problem using Bunch’s MQ have not produced better clustering results The median clustering level is a good tradeoff between cluster size and number of clusters  Harman et al. examined using a target granularity [GECCO’02] to bias the desired cluster sizes

26 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 26 Investigating the Search Space Examined multiple systems of different size: 15 open source systems developed in C, C++, or Java 13 randomly generated graphs with different properties that we wanted to investigate We clustered each MDG 500 times and examined the clustering data to gain some insight into the search space.

27 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 27 Example: Median Clustering Level Cumulative MQ swingKerbos v.5

28 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 28 Example: Median Clustering Level MQ telnetdphp

29 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 29 Example: Median Clustering Level bashmod_ssl ping_libc elm lynx mailx X Axis: MQ Value

30 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 30 Example: Median Clustering Level – Random Bipartite Graphs bip-100-1bip-100-2bip-100-5 bip-100-25bip-100-75 X Axis: MQ Value

31 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 31 Example: Median Clustering Level – Random Graphs rnd-100-1rnd-100-2rnd-100-5 rnd-100-25 rnd-100-75 X Axis: MQ Value

32 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 32 Example: Median Clustering Level – Random “Circle” Graphs circle-50circle-100 circle-150 X Axis: MQ Value

33 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 33 MQ versus #Clusters krb5swingtelnetdphp bashmod_sslping_libcelm lynxmailx X Axis: #Clusters Y Axis: MQ Value

34 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 34 MQ versus #Clusters bip-100-1bip-100-5bip-100-25bip-100-75 rnd-100-1rnd-100-5rnd-100-25rnd-100-75 cir-50cir- 100 cir- 150 X Axis: #Clusters Y Axis: MQ Value

35 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 35 Internal- versus External Edges krb5swingtelnetdphp bashmod_sslping_libcelm lynxmailx X Axis: External Edges Y Axis: Internal Edges

36 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 36 Internal- versus External Edges bip-100-1bip-100-5bip-100-25bip-100-75 rnd-100-1rnd-100-5rnd-100-25rnd-100-75 cir-50cir- 100 cir- 150 X Axis: External Edges Y Axis: Internal Edges

37 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 37 Real Systems

38 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 38 Random Systems

39 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 39 Real Systems

40 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 40 Random Systems

41 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 41 What we Learned From Studying the Search Landscape Not all modules are “equal” - Some modules: Are connected to many other modules Are connected to few other modules Have a large fan-in Have a large fan-out Are uniformly connected to other system components Are not uniformly connected to other system components Some modules may have a more “natural” home than other subsystems with respect to their assigned cluster

42 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 42 What we Learned From Studying the Search Landscape Bunch tends to converge to a consistent solution with respect to MQ There is a very low probability of finding one of these partitions by random selection The partitions found by Bunch are a very small subset of the overall search landscape The degree of isomorphism in the clustering results was larger than expected

43 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 43 What we Learned From Studying the Search Landscape When examining the median level of the clustering hierarchy we observed that all systems tend to converge to at most 2 levels The systems that we studied range from under 100 modules to several thousand modules The number of levels in the clustering hierarchy is bounded by O(log 2 N) We expect that studying systems with several hundred thousand modules would produce results where the median level converges to more than 2 levels.  We observed this in very sparse graphs (e.g., rnd-100-1, and bip-100-1)

44 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 44 Conclusions (1) Understanding the search landscape is important A single run of Bunch is helpful, but it does not highlight modules/classes that tend to drift between clusters Analysis of many Bunch runs helps build a mental model of the search landscape

45 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 45 Conclusions (2) A best practice for program understanding Cluster a system many times in order to understand the search landscape Identify and separate omnipresent, library and supplier modules Identify that tend to drift between many subsystems  Assign to other clusters manually, or influence the clustering algorithm by adjusting the edge weights  Bunch supports manual and semi-automatic clustering features to help with this type of analysis

46 Drexel University Software Engineering Research Group (SERG) http://serg.mcs.drexel.edu 46 Questions Special Thanks To: AT&T Research Sun Microsystems DARPA NSF US Army SEMINAL Group


Download ppt "1 The Search Landscape of Graph Partitioning Problems using Coupling and Cohesion as the Clustering Criteria Brian S. Mitchell & Spiros Mancoridis"

Similar presentations


Ads by Google