Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science and Technology ESEM 2009 1

Contents 1.Abstract 2.Background 3.Problem Analysis 4.Case study 5.Results 6.Conclusion and Future Work 2

Abstract Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects. First attempt of solution: simple log data transformations P(y=1) x X = design-complexitymetric P(Fault prone class) 3

Background Some design-complexity metrics have shown to be good predictors of fault-prone classes in LR models Among these metrics are the Chidamber & Kemerer (CK) metrics – 80 th and 20 th percentiles of the distributions can be used to determine high and low values – Their thresholds cannot be determined before their use and should be derived and used locally 4

Problem Analysis Can a LR model built with these kind of metrics work efficiently with different software projects? LEAST FAULTYMOST FAULTY Small Size SW project Large Size SW project X = Number of Methods P (y=1) 10 5 20

Case Study 1.Data analysis of 7 different projects and application of simple log data transformations. 2.Construction of 3 univariate LR models using a large open source project (1 st release of the MYLYN System with 638 Java classes). – Dependent Variables: CK-CBO, CK-RFC, CK-WMC – Independent Variables: Defects (from Bugzilla & CVS) 3.Test these models with 2 other smaller projects (with 11 and13 Java classes) 6

7 Challenge (**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000. produced biased regression estimates and reduce the predictive power of regression models BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system **

(**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000. RFC Data of BNS is more spread than the data of the MYL BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** 8

(**) Eclipse Project (*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000. RFC Data of BNS is more spread than the data of the MYL BNS: Banking system (2006) * CRS: Cruise control system (2005) * ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)* FACS: Factory automation system (2005) * GMF: Graphic Modeling Framework ** MYL : Mylyn system ** 9

Case Study Solution. Simple data transformation using “Log10” Example : 10 Number of Outliers are less Data Spread is more uniform LCBO = Log10(CBO+1)LTCBO = Log10(CBO+1) + dm; Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed

Results Effects of the Log data Transformations: Elimination of great number of outliers Overall goodness of fit of the 3 models is better Discrimination (Most Faulty/Least Faulty) – All models discriminate well between most Faulty and Least Faulty classes of the Mylyn System – What about using different projects? 11

Results GroupModelCorrect Classification (RAW DATA) Correct Classification (LOG Tx DATA) Effect MF (6 classes) CBO25  RFC55= WMC66= LF (5 classes) CBO55= RFC33= WMC44= BOTH (11 classes) CBO710  RFC88= WMC10 = BANKING SYSTEM 12 MF: Most Faulty LF: Least Faulty

Results GroupModelCorrect Classification (RAW DATA) Correct Classification (LOG Tx DATA) Effect MF (9 classes) CBO37  RFC98  WMC76  LF (4 classes) CBO44= RFC03  WMC04  BOTH (13 classes) CBO711  RFC911  WMC710  E-COMMERCE SYSTEM 13 MF: Most Faulty LF: Least Faulty

Conclusions and Future work CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model. Further data exploration and study of data transformations 14

Thank you! questions, comments … contact: erika.camargo@jaist.ac.jp 15

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Similar presentations

Presentation on theme: "Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.

Similar presentations

Presentation on theme: "Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science."— Presentation transcript:

Similar presentations

About project

Feedback