How do we know if AI is right? Challenges in the testing of AI systems

How do we know if AI is right? Challenges in the testing of AI systems
Jukka K Nurminen Professor Data-Intensive Computing in Natural Sciences AI Testing Tiedekulma/ Jukka K Nurminen

AI Research Focus on Algorithms Less on Testing and Other Real Use Issues
“However, to date very little work has been done on assuring the correctness of the software applications that implement machine learning algorithms.” Xie, X., Ho, J. W. K., Murphy, C., Kaiser, G., Xu, B., & Chen, T. Y. (2011). Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software, 84(4), 544–558. “Software testing is well studied, as is machine learning, but their intersection has been less well explored in the literature” Breck, Eric, et al. "What’s your ML Test Score? A rubric for ML production systems." NIPS Workshop on Reliable Machine Learning in the Wild AI Testing Tiedekulma/ Jukka K Nurminen

Failures in software systems
Bad investment decisions Knight was the largest trader in U.S. equities. Due to a computer trading “glitch” in 2012, it took a $440M loss in less than an hour. Fatal cancer treatment A bug in the code controlling the Therac-25 radiation therapy machine was directly responsible for at least five patient deaths in the 1980s when it administered excessive quantities of beta radiation Biased decisions AIs, from IBM Microsoft and Chinese company Megvii, could correctly identify a person’s gender from a photograph 99 per cent of the time – but only for white men. For dark-skinned women, accuracy dropped to just 35 per cent. Uber car hitting the cyclist AI Testing Tiedekulma/ Jukka K Nurminen

Software Lifetime Costs - Development is only small part
For classic software maintenance cost dominates Testing cost is about the same size as development How is it for AI software? Software Life-Cycle Costs - Schach 2002 AI Testing Tiedekulma/ Jukka K Nurminen

AI is still experimental - Lifecycle support problems are not yet visible
AI is still mainly in research labs (and news headlines) although some companies are very active and advanced When major deployments starts to happen interest to efficient SW processes for AI likely to be of interest BUT we are not there yet AI Testing Tiedekulma/ Jukka K Nurminen

AI DOES not always give right answer how to deal with Statistical results?
… Cat Cat Cat Dog AI Testing Tiedekulma/ Jukka K Nurminen

We do not know the ”right answer”
AI Testing Tiedekulma/ Jukka K Nurminen

We do not agree on the ”right answer” => AI ethics
Statistical results. Outcome is level of confidence. Not Pass or Fail. Accuracy is e.g. 97% in statistical testing Different consequences of false positives vs. false negatives Oracle problem. We do not know the “right” answer Reinforcement learning, optimization We use Machine Learning for those kinds of problems, which we cannot solve explicitly Is the training/testing material representative? Is it biased? Is there a bug or is it just a feature of the problem? Would another neural net architecture do better? Besides: a buggy ML program does not crash nor produce an error message, it just fails to learn or act properly Thus traditional SW testing (“verify that the output for this input is as expected”) does not work for ML AI Testing Tiedekulma/ Jukka K Nurminen

Challenges of testing of machine learning models
Statistical results. Outcome is level of confidence. Not Pass or Fail. Oracle problem. We do not have the “right answer” A buggy ML program does not crash nor produce an error message, it just fails to learn or act properly Borderline between bug and feature is vague Is the training/testing material representative? Is it biased? Would another neural net architecture do better? Would another model do better? AI Testing Tiedekulma/ Jukka K Nurminen

Testing of ML Model Original Dataset Supervised learning case
Training set Dev set Original Dataset Test set Not always 60% 20% For very large (1M) datasets Dev and Test sets should be much smaller (1% = 10k) 70% 30% Supervised learning case We have a set of feature vectors and a label for each Split data into training and test sets Select classifier type, network architecture, and hyper-parameters Train the classifier with the training data only Test with Dev set (and Test set) AI Testing Tiedekulma/ Jukka K Nurminen

Adversarial input AI Testing Tiedekulma/ Jukka K Nurminen

Evtimov et al. Robust Physical-World Attacks on Deep Learning Models
LISA CNN based on AlexNet AI Testing Tiedekulma/ Jukka K Nurminen

ML model is only a part of a bigger software system
In a production systems ML code is often less than 5 % of total code Google Crash Course on machine learning AI Testing Tiedekulma/ Jukka K Nurminen

ML models and other software modules have complex interactions
Unexpected output can cause problems elsewhere in the system Changes in any module can influence other modules AI modules may not be 100% accurate How to avoid errors to propagate? Is their a way to dampen the errors? How updates in upstream ML model propagate downstream? AI Testing Tiedekulma/ Jukka K Nurminen

Autonomous Driving Today’s car: ~100 control units, ~100 million lines of code Future: Multiple AI systems working together Each situation is unique AI not able to be 100% sure of its outcome If Uber/Volvo fixes this can a similar problem be in other car brands? How should authorities test these things? AI Testing Tiedekulma/ Jukka K Nurminen

New kinds of tests are needed
Breck, Eric, et al. "What’s your ML Test Score? A rubric for ML production systems." NIPS Workshop on Reliable Machine Learning in the Wild AI Testing Tiedekulma/ Jukka K Nurminen

Self-evaluation of ML capabilities
Four sets of tests Data, model, infrastructure, monitoring Score 0-5 in each 0 = more of a research project that productized system 5 = exceptional levels of automated testing and monitoring ML-score = Min 𝑒𝑎𝑐ℎ 𝑎𝑟𝑒𝑎 𝑡𝑒𝑠𝑡 𝑝𝑜𝑖𝑛𝑡 Breck, Eric, et al. "What’s your ML Test Score? A rubric for ML production systems." NIPS Workshop on Reliable Machine Learning in the Wild AI Testing Tiedekulma/ Jukka K Nurminen

Whitebox Testing of Neural Networks
Code coverage testing provides little value but with new tools we can see inside the neural net DeepXPlore, DLFuzz, … Testing: Can we detect poorly trained parts of the network? Maintenance: Can we detect when trained network is used differently from its training? Need for retraining, detection of adversarial attacks, … AI Testing Tiedekulma/ Jukka K Nurminen

Data problem Testbench
Add artificial errors to data and see how it influences system operation ML system Error generator Σ Plug-in new modules easily Built in error types + user defined new error types Allow comparing how results changed as a function of data problems AI Testing Tiedekulma/ Jukka K Nurminen

IVVES ITEA Project Proposal
WP3 Testing Techniques for Complex Evolving Systems WP5 Framework & Methodology for DevOps WP6 Standardization, Dissemination and Exploitation WP1 Case Studies Automotive Health Engineering Banking Telecom WP2 Validation Techniques for ML Model Quality Data Quality Data Creation Online Testing & Monitoring ML-based Testing Risk based Testing WP4 Data Analytics in Engineering Data Analytics in Development Data Analytics in QA Data Collection University of Helsinki & VTT + 11 industrial partners from Finland + international consortium (Germany, France, Sweden, Netherlands, Spain, Canada) Interested problems and challenges wanted! AI Testing Tiedekulma/ Jukka K Nurminen

Thank you! AI Testing Tiedekulma/ Jukka K Nurminen

How do we know if AI is right? Challenges in the testing of AI systems

Similar presentations

Presentation on theme: "How do we know if AI is right? Challenges in the testing of AI systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How do we know if AI is right? Challenges in the testing of AI systems

Similar presentations

Presentation on theme: "How do we know if AI is right? Challenges in the testing of AI systems"— Presentation transcript:

Similar presentations

About project

Feedback