Brian Mitchell - Drexel University MCS680-FCS 1 Case Study: Automatic Techniques For Software Modularization int MSTWeight(int.

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 1 Case Study: Automatic Techniques For Software Modularization int MSTWeight(int graph[][], int size) { int i,j; int weight = 0; for(i=0; i<size; i++) for(j=0; j<size; j++) weight+= graph[i][j]; return weight; } 1111 n n O(1) O(n) Running Time = 2O(1) + O(n 2 ) = O(n 2 ) MCS680: Foundations Of Computer Science

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 2 Introduction This topic reinforces the concepts of set and graph theory by demonstrating a current research area –Algorithms for Automatic Software Modularization This research was conducted by Drexel faculity: –Brian Mitchell –Spiros Mancoridis –Chris Rorres

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 3 Software Engineering Problem Software maintenance is an arduous task because of the difficulties associated with understanding the intricate relationships that exist between the source code components –Design document is inaccurate –Original system architect/designer is no longer available for consultation With no mechanism for gaining insight into the system design and structure, the software maintenance practitioner is often forced to make modifications to the source code without a through understanding of the systems organization Also, heavily used software systems change rapidly –Use of an “ad-hoc” maintenance approach will negatively affect the system design

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 4 Software Engineering Problem Software engineers have long known of the difficulties associated with maintaining software systems whose only current documentation is limited to the source code Leads to decay in the design due to source code changes that are made without an understanding of the system structure –Size of modern day software systems is beyond a programmers cognitive ability to determine the affect of a local change on the entire system –Changes made to the source code without an understanding of it’s organization usually contradict one or more aspects of the original design Goal is to give the programmer a tool that visualizes the modularization of the system

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 5 Other Work In Field Top-Down Approaches –Tools such as “Rigi” and “Arch” have been developed to perform a modularization of a software system Still requires somebody familiar with the system to provide feedback and/or set system-specific parameters Bottom-Up Approaches –Software Reflection Model Used to capture and exploit the differences that exist between the actual source code organization and the designers high-level model of the systems modularization Streamline learning process –The Orphan Adoption Problem Given the name of a new software resource (an orphan), this tool emits as output the name of the subsystem that has been chosen as the parent for the orphan

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 6 Our Automatic Modularization Tool Implements algorithms that we developed that –Are fully automatic –Recursively generates a hierarchical view of of the system organization based solely on information extracted from the source code Fully automatic techniques are not only useful to programmers that lack familiarity with the system, but can also be used by the system architect to compare the documented modularization, with the one created by our tool and learn from the differences

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 7 Software System Organization Software systems contain a finite set of software components and a collection of relationships that govern how the software components interact with each other Typical software components –Classes, Modules –Variables, Macros –Structures Typical software relationships –Import –Export –Inherit Can represent the system structure as a resource dependency graph –The information required to build this graph can be obtained by parsing the source code

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 8 Example Resource Dependency Graph: Plan9 The following resource dependency graph was automatically generated by scanning the source code from the file system of the Plan9 operating system –Access to source code provided by AT&T Labs

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 9 Goals of Research Goal of our research is to automatically partition the components of a system into clusters that maximize cohesion and minimize coupling The clusters once discovered represent a higher level abstraction of the systems organization by grouping related software components into subsystems Each subsystem contains a collection of modules that either –Cooperate to perform some high-level function in the overall system Scanner, parser, code generator –Provide a set of related services that are used throughout the system Import Library File manager, memory manger

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 10 Automatically Modularized Visualization of Plan9 OS The following graph was derived by our clustering utility Formal definitions for cohesion, coupling and modularization quality must now be developed in order to illustrate our process

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 11 Architecture of our Clustering Environment { cout... } Source Code Modules CIA Utility scan Parse Source Code XREF Database generate Awk Script - Query - Format scan Clustering Engine generate DOT File read DOTTY Utility read Clustered Graph display

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 12 Quantifying Cohesion Cohesion is an indication of the strength of the relationships that exist between modules that are grouped into a cluster. –High cohesion = Strong Encapsulation. We define cohesion (H) as a measurement of intra-edge dependencies between the components in a particular cluster. –Formally, the cohesion H i of cluster i consisting of N i components and  i intra-edge dependencies is: This measurement is a percentage of intra-edge dependencies, which is N i 2.

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 13 Qualifying Coupling Coupling (C) is a measurement of inter- edge dependencies between the components of two distinct clusters The coupling C i,j between clusters i and j each consisting of N i and N j components respectively, and  i,j inter-edge dependencies is: This measurement is a percentage of the maximum number of inter-edge dependencies between clusters i and j

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 14 Modularization Quality Modularization Quality (MQ) is defined as the measurement of the “goodness” of a particular system modularization. –Specifically, the MQ of a modularization of k clusters, where H i is the cohesion of the i th cluster and C i,j is the coupling between the i th and j th clusters is: –This measurement shows the trade-off between cohesion and coupling by Rewarding many small highly-cohesive clusters Penalizing too many inter-edges

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 15 Modularization Quality Example Subsystem 1 M 1 M 2 M 3 Subsystem 2 M 4 M 5 Subsystem 3 M 6 M 7 M 8

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 16 Partitions of a Set Must construct a data model to represent a partition (a clustering) of a software system Consider the source code organization for system S. –S = {M 1, M 2, …, M n } –Let a collection  = {A 1, A 2, …, A n } be a set of non-empty subsets such that each A i  S.  is a partition of S if: The subsets are a covering of S The subsets are mutually exclusive Each subset A i is called a cluster of the partition A partition of S onto k non-empty clusters is called a k-partition of S

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 17 Number of k-Partititions of a Set Let S be a set of n elements. The number of k-partitions of an n-set satisifies the recurrence equation: The entries S n,k are called Stirling numbers Striling numbers govern the number of k- partitions of a set. Stirling numbers grow exponentially with respect to the size of S.

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 18 Clustering: Optimal Solution Algorithm –Let S = {M 1, M 2, …, M n }, where each M i is a module in the software system –Let G be the graph representing the relationships between the modules in S –Generate every partition of set S –Evaluate MQ for each partition –The partition with the largest MQ is the optimal solution The algorithm works well for sets of up to 15 elements, beyond that the number of k- partitions becomes too large to enumerate in a reasonable timeframe Clearly, sub-optimal techniques must be employed for large sets

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 19 How many k-partitions are there? 1 = 1 2 = 2 3 = 5 4 = 15 5 = 52 6 = 203 7 = 877 8 = 4140 9 = 21147 10 = 115975 11 = 678570 12 = 4213597 13 = 27644437 14 = 190899322 15 = 1382958545 16 = 10480142147 17 = 82864869804 18 = 682076806159 19 = 5832742205057 20 = 51724158235372 The following table illustrates the number of k-partitions of a system given that the system has N modules.

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 20 Sub-Optimal Modularization Strategy The search space required for enumerating all possible partitions is too large in most software systems –We need to develop a search strategy that quickly discovers an acceptable sub- optimal clustering Generic Sub-Optimal Algorithm Construct a resource dependency graph G that represents the relationships between the modules in S. Generate a uniformly distributed random clusterings of S. We use a combinatorial algorithm to accomplish this task because our sub-optimal techniques require the generation of many random clusterings. Iteratively improve a randomly generated clustering, by measuring its MQ, until no further improvement is possible. This task is accomplished by heuristically moving modules in S between the generated clusters. Repeat this process until an acceptable sub-optimal result it determined.

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 21 Neighboring Partition We need a way to improve a partitions MQ We define a partition NP to be a neighbor of a partition P if and only if: –NP is exactly the same as P except that a single element of P is in a different cluster in partition NP

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 22 Generic Sub-Optimal Algorithm Algorithm –Let S = {M 1, M 2, …, M n }, where each M i is a module in the software system –Let G be the graph representing the relationships between the modules in S –Generate a random partition P of set S –If possible, find a neighboring partition NP that has an improved MQ over P –If an improved neighboring partition is found Let P = NP –P is the sub-optimal solution A variety of algorithms for finding sub- optimal solutions are possible, depending on how “improved” is defined

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 23 Steepest-Ascent Hill Climbing (SAHC Algorithm) Algorithm –Let S = {M 1, M 2, …, M n }, where each M i is a module in the software system –Let G be the graph representing the relationships between the modules in S –Generate a random partition P of set S –Repeat Find the best neighboring partition BNP that has MQ(BNP) > MQ(P) If an improved BNP is found such that MQ(BNP) > MQ(P) –Let P = BNP –Until no further “improved” BNP’s can be found –P is the sub-optimal solution BNP may be expensive to calculate –All neighboring partitions of P must be examined

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 24 Next-Ascent Hill Climbing (NAHC) Algorithm Algorithm –Let S = {M 1, M 2, …, M n }, where each M i is a module in the software system –Let G be the graph representing the relationships between the modules in S –Generate a random partition P of set S –Repeat Find a better neighboring partition bNP that has MQ(bNP) > MQ(P) If an improved bNP is found such that MQ(bNP) > MQ(P) –Let P = bNP –Until no further “improved” BNP’s can be found –P is the sub-optimal solution A bNP is discovered by randomly searching the set of neighboring partitions until a partition with a higher MQ is found –Usually, not all NP’s will have to be examined

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 25 A Genetic Algorithm Framework Our experimentation with the SAHC and NAHC algorithms have shown that given an initial random starting partition that –The algorithms will converge to a local maximum –However, not all initial partitions converge to an acceptable result Therefore we must either: –Run the experiment many times using different initial partitions and pick the experiment that results in the largest MQ –Or, Devise an approach that works with a population of randomly generated initial partitions and concurrently improves them until all of the initial samples converge The partition in the final population with the largest MQ is the sub-optimal solution This approach lends itself to being implemented with a Genetic Algorithm

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 26 Genetic Algorithms Genetic algorithms were first developed by John Holland et. al. at the University of Michigan Genetic algorithms have been applied to many problems that involve exploring large search spaces Characteristics of GA’s –Combine survival-of-the-fittest techniques with a structured and randomized information exchange Facilitates innovative algorithms that parallel the natural human selection process GA are more than a randomized search, instead, they exploit historical data to speculate new information that is expected to yield improved results

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 27 Genetic Search Sub-Optimal Clustering Algorithm Algorithm –Let S = {M 1, M 2, …, M n }, where each M i is a module in the software system –Let G be the graph representing the relationships between the modules in S –Generate a random partition P of set S –Repeat Randomly select a percentage of partitions from the population and improve them using the SAHC or NAHC technique Generate a new population (from the current one) by using a biased wheel that favors partitions with larger MQ –Let P = bNP –Until no improvement is seen for t generations, until the population has converged, or until the max. number of generations has been executed –P in the final generation with the largest MQ is the sub-optimal solution

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 28 Agglomerative Clustering The prevous algorithms discovered subsystems based on the graph that was formed by recovering the relationships that existed in the source code components In most systems, however, we are interested in finding a hierarchy of subsystems that capture the higher-order relationships that exist in the software Wrapping our algorithms with an agglomerative clustering engine solves this problem

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 29 Agglomerative Clustering Algorithm Algorithm –Let S = {M 1, M 2, …, M n } –Let G be the resource dependency graph –Let Q be a queue –Repeat Find a maximal partition (Pmax) of S using the Optimal, SAHC or NAHC algorithm Save partition Pmax on Q Now let S = {C 1, C 2, …, C n } where each Ci is a cluster in Pmax Build a new graph G by treating each cluster in Pmax as a single element. Furthermore if there is at least one edge between any two clusters in Pmax then there is an edge between their representative nodes in G –Until Pmax has coalesced into a single cluster –Q contains a hierarchy of partitions

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 30 Where to Get the Clustering Engine We have implemented and applied the clustering engines to many examples The system can be downloaded on the Web from the Drexel University Software Engineering Reasearch Group (SERG) hompeage at: –http://www.mcs.drexel.edu/~serg The clustering engine was developed using the Java 1.1 programming language

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 31 Compiler Example

Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS 32 Boxer (Autolayout Utility) Example

Brian Mitchell - Drexel University MCS680-FCS 1 Case Study: Automatic Techniques For Software Modularization int MSTWeight(int.

Similar presentations

Presentation on theme: "Brian Mitchell - Drexel University MCS680-FCS 1 Case Study: Automatic Techniques For Software Modularization int MSTWeight(int."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Brian Mitchell - Drexel University MCS680-FCS 1 Case Study: Automatic Techniques For Software Modularization int MSTWeight(int.

Similar presentations

Presentation on theme: "Brian Mitchell - Drexel University MCS680-FCS 1 Case Study: Automatic Techniques For Software Modularization int MSTWeight(int."— Presentation transcript:

Similar presentations

About project

Feedback