1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University

2 Introduction We are investigating the quality assurance of Machine Learning (ML) applicationsWe are investigating the quality assurance of Machine Learning (ML) applications Currently we are concerned with a real-world application for potential future use in predicting electrical device failuresCurrently we are concerned with a real-world application for potential future use in predicting electrical device failures Machine Learning applications fall into a class for which it can be said that there is “no reliable oracle”Machine Learning applications fall into a class for which it can be said that there is “no reliable oracle” –These are also known as “non-testable programs” and could fall into Davis and Weyuker’s class of “programs which were written in order to determine the answer in the first place. There would be no need to write such programs, if the correct answer were known.”

3 Introduction We have developed an approach to creating test cases for Machine Learning applications:We have developed an approach to creating test cases for Machine Learning applications: Analyze the problem domain and real-world data setsAnalyze the problem domain and real-world data sets Analyze the algorithm as it is definedAnalyze the algorithm as it is defined Analyze an implementation’s runtime optionsAnalyze an implementation’s runtime options Our approach was designed for MartiRank and then generalized to other ranking algorithms such as Support Vector Machines (SVM)Our approach was designed for MartiRank and then generalized to other ranking algorithms such as Support Vector Machines (SVM)

4 Overview Machine Learning BackgroundMachine Learning Background Testing Approach and FrameworkTesting Approach and Framework Findings and ResultsFindings and Results Evaluation and ObservationsEvaluation and Observations Future WorkFuture Work

5 Machine Learning Fundamentals Data sets consist of a number of examples, each of which has attributes and a labelData sets consist of a number of examples, each of which has attributes and a label In the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the labelIn the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the label In the second phase, the model is applied to a previously-unseen data set (“testing” data) with unknown labels to produce a classification (or, in our case, a ranking)In the second phase, the model is applied to a previously-unseen data set (“testing” data) with unknown labels to produce a classification (or, in our case, a ranking) –This can be used for validation or for prediction

6 MartiRank and SVM MartiRank was specifically designed for the device failure applicationMartiRank was specifically designed for the device failure application –Seeks to find the combination of segmenting and sorting the data that produces the best result SVM is typically a classification algorithmSVM is typically a classification algorithm –Seeks to find a hyperplane that separates examples from different classes –Different “kernels” use different approaches –SVM-Light has a ranking mode based on the distance from the hyperplane

7 Related Work There has been much research into applying Machine Learning techniques to software testing, but not the other way aroundThere has been much research into applying Machine Learning techniques to software testing, but not the other way around Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctnessReusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctness

8 Analyzing the Problem Domain Consider properties of the real-world data setsConsider properties of the real-world data sets –Data set size: Number of attributes and examples –Range of values: attributes and labels –Precision of floating-point numbers –Categorical data: how alphanumeric attrs are addressed Also, repeating or missing data valuesAlso, repeating or missing data values

9 Analyzing the Algorithm Look for imprecisions in the specification, not necessarily bugs in the implementationLook for imprecisions in the specification, not necessarily bugs in the implementation –How to handle missing attribute values –How to handle negative labels Consider how to construct a data set that could cause a “predictable” rankingConsider how to construct a data set that could cause a “predictable” ranking

10 Analyzing the Runtime Options Determine how the implementation may manipulate the input dataDetermine how the implementation may manipulate the input data –Permuting the input order –Reading the input in “chunks” Consider configuration parametersConsider configuration parameters –For example, disabled anything probabilistic Need to ensure that results are deterministic and repeatableNeed to ensure that results are deterministic and repeatable

11 The Testing Framework Data set generator: # of examples, # of attributes, % failures, % missing, any categorical data, repeat/no-repeat modesData set generator: # of examples, # of attributes, % failures, % missing, any categorical data, repeat/no-repeat modes Model comparison: specific to MartiRankModel comparison: specific to MartiRank Ranking comparison: includes metrics like normalized equivalence and AUCsRanking comparison: includes metrics like normalized equivalence and AUCs Tracing options: for generating and comparing outputs of debugging statementsTracing options: for generating and comparing outputs of debugging statements

12 Equivalence Classes Data sizes of different orders of magnitudeData sizes of different orders of magnitude Repeating vs. non-repeating attribute valuesRepeating vs. non-repeating attribute values Missing vs. no-missing attribute valuesMissing vs. no-missing attribute values Categorical vs. non-categorical dataCategorical vs. non-categorical data 0/1 labels vs. non-negative integer labels0/1 labels vs. non-negative integer labels Predictable vs. non-predictable data setsPredictable vs. non-predictable data sets Used data set generator to parameterize test case selection criteriaUsed data set generator to parameterize test case selection criteria

13 Testing MartiRank Produced a core dump on data sets with large number of attributes (over 200)Produced a core dump on data sets with large number of attributes (over 200) Implementation does not correctly handle negative labelsImplementation does not correctly handle negative labels Does not use a “stable” sorting algorithmDoes not use a “stable” sorting algorithm

14 Regression Testing of MartiRank Creation of a suite of testing data allowed us to use it for regression testingCreation of a suite of testing data allowed us to use it for regression testing Discovered that refactoring had introduced a bug into an important calculationDiscovered that refactoring had introduced a bug into an important calculation

15 Testing Multiple Implementations of MartiRank We had three implementations developed by three different codersWe had three implementations developed by three different coders Can be used as “pseudo-oracles” for each otherCan be used as “pseudo-oracles” for each other Used to discover a bug in the way one implementation was handling missing valuesUsed to discover a bug in the way one implementation was handling missing values

16 Applying Approach to SVM-Light Permuting the input data led to different modelsPermuting the input data led to different models –Caused by “chunking” data for use by an approximating variant of optimization algorithm Introduction of noise in a data set in some cases caused it not to find a “predictable” rankingIntroduction of noise in a data set in some cases caused it not to find a “predictable” ranking Different kernels also caused different results with “predictable” rankingsDifferent kernels also caused different results with “predictable” rankings

17 Evaluation and Observations Testing approach revealed bugs and imprecision in the implementations, as well as discrepancies from the stated algorithmsTesting approach revealed bugs and imprecision in the implementations, as well as discrepancies from the stated algorithms Inspection of the algorithms led to the creation of “predictable” data setsInspection of the algorithms led to the creation of “predictable” data sets What is “predictable” for one algorithm may not lead to a “predictable” ranking in anotherWhat is “predictable” for one algorithm may not lead to a “predictable” ranking in another Algorithm’s failure to address specific data set traits can lead to incorrect results (and/or inconsistent results across implementations)Algorithm’s failure to address specific data set traits can lead to incorrect results (and/or inconsistent results across implementations) The approach can be generalized to other Machine Learning ranking algorithms, as well as classificationThe approach can be generalized to other Machine Learning ranking algorithms, as well as classification

18 Limitations and Future Work Test suite adequacy for coverage not addressedTest suite adequacy for coverage not addressed Can also include mutation testing for effectiveness of data setsCan also include mutation testing for effectiveness of data sets Should investigate creating large data sets that correlate to real-world dataShould investigate creating large data sets that correlate to real-world data Could also consider non-deterministic Machine Learning algorithmsCould also consider non-deterministic Machine Learning algorithms

19 Questions?

1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Similar presentations

Presentation on theme: "1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Similar presentations

Presentation on theme: "1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback