Predict Failures with Developer Networks and Social Network Analysis

Slides:



Advertisements
Similar presentations
Autocorrelation and Heteroskedasticity
Advertisements

Inference for Regression
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Mining Metrics to Predict Component Failures Nachiappan Nagappan, Microsoft Research Thomas Ball, Microsoft Research Andreas Zeller, Saarland University.
A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.
Maureen Meadows Senior Lecturer in Management, Open University Business School.
9. SIMPLE LINEAR REGESSION AND CORRELATION
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
More about Correlations. Spearman Rank order correlation Does the same type of analysis as a Pearson r but with data that only represents order. –Ordinal.
Correlational Designs
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Correlation and Regression Analysis
Classification and Prediction: Regression Analysis
Linear Regression.  Uses correlations  Predicts value of one variable from the value of another  ***computes UKNOWN outcomes from present, known outcomes.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Correlation and Linear Regression
1 The Relationship of Cyclomatic Complexity, Essential Complexity and Error Rates Mike Chapman and Dan Solomon
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Introduction to Regression Analysis. Two Purposes Explanation –Explain (or account for) the variance in a variable (e.g., explain why children’s test.
Chapter 15 Correlation and Regression
Learning Objective Chapter 14 Correlation and Regression Analysis CHAPTER fourteen Correlation and Regression Analysis Copyright © 2000 by John Wiley &
SE-280 Dr. Mark L. Hornick 1 Statistics Review Linear Regression & Correlation.
Statistical Evaluation of Data
Statistics and Quantitative Analysis U4320 Segment 8 Prof. Sharyn O’Halloran.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Chapter 17 Partial Correlation and Multiple Regression and Correlation.
Tests and Measurements Intersession 2006.
Examining Relationships in Quantitative Research
Chapter 16 Data Analysis: Testing for Associations.
Chapter Thirteen Copyright © 2006 John Wiley & Sons, Inc. Bivariate Correlation and Regression.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
Correlation & Regression Analysis
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
CORRELATION ANALYSIS.
Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
1 New metrics for characterizing the significance of nodes in wireless networks via path-based neighborhood analysis Leandros A. Maglaras 1 Dimitrios Katsaros.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Chapter 12 Understanding Research Results: Description and Correlation
Tom Ostrand Elaine Weyuker Bob Bell AT&T Labs – Research
Statistics 101 Chapter 3 Section 3.
Virtual COMSATS Inferential Statistics Lecture-26
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
CHAPTER fourteen Correlation and Regression Analysis
Chapter 15 Linear Regression
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
6-1 Introduction To Empirical Models
Analysis of Variance: Some Review and Some New Ideas
CORRELATION ANALYSIS.
Correlation and Regression-III
Anastasia Baryshnikova  Cell Systems 
Correlation and Regression
Linear Regression and Correlation
Product moment correlation
Linear Regression and Correlation
Introduction to Regression
Diagnostics and Transformation for SLR
Spearman’s Rank Correlation Coefficient
Presentation transcript:

Predict Failures with Developer Networks and Social Network Analysis Andrew Meneely et al.

Introduction Research Question Importance Research Goal Predict failures at the file level Importance Dramatically decrease fixing cost Research Goal Examine human factors in failure prediction by applying social network analysis to code churn information

Introduction (cont.) Method Case study introduce file-based metrics based on SNA as additional predictors of software failures Case study a mature Nortel product (over 3 million LOC) get models using failure data from 2 releases, validated against a subsequent release in 20% files, one model: 58%, optimal: 61% a significant correlation exists between file-based developer network metrics and failures

Definitions of Network Metrics Node, Connection, Path Geodesic path (social distance): shortest path between 2 nodes Diameter: Longest geodesic path Connectivity: measure direct connections Degree: number of connections on a node Hub (a “well-known” developer): degree is above a threshold Disconnected: a node has no edges

Network Metrics (cont.) Centrality: quantify how closely nodes are indirectly connected to the rest of network Closeness: the average distance from v to any other node in the network that can be reached from v Betweenness: the number of geodesic paths that include v divided by the total number of geodesic paths in the network

Get Developer Network Metrics Step 1 Initial code churn information Step 2 Construct developer social network Step 3 Compute developer-based metrics Step 4 Compute file-based metrics

Step 1: code churn information

Step 2: developer social network

Step 3: developer-based metrics

Step 4: file-based metrics

Independent and Dependent Variables Independent Variables Dependent Variables the number of system test failures for a file the number of post-release failures for a file

Model selection and validation Find best combination of variables and a regression Training set and validation set Candidate regression Number of failures for a given file: Negative binomial regression and Poisson regression Probability that a file had at least one failure: Logistic regression

Step One: Initial model selection Determine Combinations of candidate variables Transformation of variables Candidate regressions Weights of variables Evaluated by Goodness-of-fit statistics (training error) calculated in SAS v9.1 using proc genmod

Step Two: Final model selection Cross-validation Training partition and validation partition Catch over-fit models Spearman rank correlation coefficient The two models with the highest average correlation coefficient and the lowest standard deviation become our final models to be validated

Step Three: Model validation Evaluated against the validation set Two criteria Spearman rank correlation coefficient between the estimated values and the observed values Examine the difference between our predicted prioritization and an optimal prioritization

Step Four: Further Analysis Evaluate how well the model works Compared to SLOC model Compare the model with a model containing only code churn metrics and not network metrics, and vice versa Assess network metrics as an early indicator Investigate possible latent factors

Case study An industrial product at Nortel Networks 2,500 files of (11,000 files, 3.17 million LOC) System Testing Model (step 1) negative binomial regression Degree was positively correlated with failures Closeness was negatively correlated The actual beta-weights are not included Cross-validation (step 2) Spearman rank correlation coefficients for the system test model was 0.778 60.5% of the variance was explained

Model validation (step 3) Next release

Model validation (cont.) rate of actual discovery of failures by the Nortel system test team

Compared with other models Model selection and validation

Model as an Early Indicator Use our model early in development perform our analysis of ten-fold cross-validation using data from only the first half of the development time during release Rn+1 average Spearman rank of 0.693 with standard deviation of 0.02, with all correlation coefficients significant (p<0.01)

Conclusion Our model performed significantly well in prioritizing files based on predicted failures. developer networks are useful for failure prediction early in the development phase and provide a useful abstraction of the code churn data

Thank you!