Projection of network outputs

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

ECG Signal processing (2)

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Automatic classification of weld cracks using artificial intelligence and statistical methods Ryszard SIKORA, Piotr BANIUKIEWICZ, Marcin CARYK Szczecin.

Support Vector Machines

Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

K-separability Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological.

Data Mining Techniques Outline

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.

Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Coloring black boxes: visualization of neural network decisions Włodzisław Duch School of Computer Engineering, Nanyang Technological University, Singapore,

Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Transfer functions: hidden possibilities for better neural networks. Włodzisław Duch and Norbert Jankowski Department of Computer Methods, Nicholas Copernicus.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

A Posteriori Corrections to Classification Methods Włodzisław Duch & Łukasz Itert Department of Informatics, Nicholas Copernicus University, Torun, Poland.

Global Visualization of Neural Dynamics

Three kinds of learning

What is Cluster Analysis?

How to learn hard Boolean functions Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Aula 4 Radial Basis Function Networks

Dan Simon Cleveland State University

Radial-Basis Function Networks

Ranga Rodrigo April 5, 2014 Most of the sides are from the Matlab tutorial. 1.

Neural Networks Lecture 8: Two simple learning algorithms

Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Big data classification using neural network

Deep Feedforward Networks

Computational Intelligence: Methods and Applications

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Data Mining K-means Algorithm

Perceptrons Lirong Xia.

Mean Shift Segmentation

Final Year Project Presentation --- Magic Paint Face

Tomasz Maszczyk and Włodzisław Duch Department of Informatics,

REMOTE SENSING Multispectral Image Classification

CS 188: Artificial Intelligence

Hyperparameters, bias-variance tradeoff, validation

Neuro-Computing Lecture 4 Radial Basis Function Network

Perceptron as one Type of Linear Discriminants

Fuzzy rule-based system derived from similarity to prototypes

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

McCulloch–Pitts Neuronal Model :

Visualization of the hidden node activities or hidden secrets of neural networks. Włodzisław Duch Department of Informatics Nicolaus Copernicus University,

Neural networks (1) Traditional multi-layer perceptrons

Support Vector Neural Training

FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR

Heterogeneous adaptive systems

Roc curves By Vittoria Cozza, matr

Linear Discrimination

EM Algorithm and its Applications

Introduction to Neural Networks

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Lecture 16. Classification (II): Practical Considerations

Perceptrons Lirong Xia.

Support Vector Machines 2

DIGITAL IMAGE PROCESSING Elective 3 (5th Sem.)

Goodfellow: Chapter 14 Autoencoders

Presentation transcript:

Projection of network outputs Coloring black boxes: visualization of neural network decisions Włodzisław Duch School of Computer Engineering, Nanyang Technological University, Singapore, Department of Informatics, Nicholaus Copernicus University, Torun, Poland. Problem Common belief: neural networks (NN) are black boxes, performing incomprehensible functions, they should not be used for safety-critical applications. Understanding of network decisions is possible using logical rules (Duch et al 2001}, but logical rules equivalent to NN functions severely distort original decision borders. Feature space is partitioned by rules into hypercuboids (for crisp logical rules) or ellipsoids (for typical triangular or Gaussian fuzzy membership functions). Typical NN software outputs: mean square error (MSE), classification error, sometimes estimation of the classification probability, sometimes ROC curves. But: perhaps errors are confined to a distant and localized region of the feature space? How to estimate confidence in predictions, distinguish easy/difficult cases? Well trained MLPs provide estimations of p(C|X) close to 0 and 1: they are overconfident in their predictions. 10 errors with p(C|X)1 are more dengerous than 20 with p(C|X)0.5. How to compare networks that have identical accuracy, but quite different weights and biases? Is the network hiding a quirky behavior that may lead to completely wrong results for new data? Is regularization improving the quality of NN? If I only could see it ... but how? Feature spaces are highly dimensional. Answer: look at mapping of the training data samples, their similarities! K classes => scatterogram in K-dimensional space. Dynamics of learning Left: with 3 neurons the network will always under-fit the data, unable to separate small classes. Right: even if 6 hidden neurons are used problems with convergence may arise in some runs. Recommendation: stop further learning, start from another initialization. Perfect solutions may be dangerous, to check for overfitting and see classification margins add some noise. RBF always provides soft decision borders and is more robust under perturbations (Wine data below). 5-class Satimage data. Left: images cluster around pentagon vertices; middle: adding noise with 0.01 variance shows which clusters have almost overlapping regions; right: this is even more evident using 5 Gaussian clusters. Two solutions after 100 iterations, SCG training, 8 hidden nodes, 164 and 197 errors. Projection of network outputs Map each N-dim training vector X to output vector o={oi(X)[0,1]}, i=1..K Similarity between X  similarity between o(X), in respect to class membership and distance. Output classes may form continuum => smooth transition from o1(X)=1, o2(X)=0 to o1(X)=0, o2(X)=1 outputs. Sometimes softmax transformation is used to obtain probabilities oi(X)=p(Ci|X), but then information is lost, since p(Ci|X) sum to 1, eliminating answers “don’t know” or o1(X)=0, o2(X)=0, and “member of both classes”, or o1(X)=1, o2(X)=1. For K>3 use parallel coordinates, non-linear MDS or linear projections. Map outputs oi(X) to K corners of regular polygon in 2D. (0,0) vertex corresponds to (1,0,..,0) output, (0,1) vertex to (0,1,..,0), etc. Two very good converged solutions after 19 K and 22 K iterations, SCG training, 8 hidden nodes, 1 error, same MSE. Left: vectors similar to the green are half-way between green and blue; right – high chance of confusing some vectors from blue and red classes. Networks are over-confident and not stable, mapping to a single point. These networks are over-confident, weights are very large, sigmoids are step-like, decision borders are very sharp. Sometimes re-labeling the classes simplifies the picture. Higher number of classes: use parallel coordinates, MDS, several 2D scatterograms or linear projections. Conclusions Images of the training data vectors mapped by NN: show the dynamics of learning, problems with convergence; show overfitting and underfitting effects; enable to compare different network solution with the same MSE/accuracy, compare training algorithms, inspect classification margins perturbing the input vectors; display regions of the input space where potential problems may arise; display effects of regularization, model selection, early stopping; show stability of network classification under perturbation of original vectors; enable quick identification of errors and outliers; estimate confidence in classification of a given vector by placing it in relation to known data vectors. (1,1,1) (1,1,0) (1,1,a) (0,1,0) (0,1,1) (0,0,1) (1,0,1) (0,1,a) (1,0,a) (a,0,0) (1,a,0) (a,1,0) (0,0,a) (a,0,1) (0,a,1) (a,1,1) (1,a,1) (0,a,0) (0,0,0) Vectors near the (1,1,1) corner have larger markers. Since outputs are in [0,1], projections lie within hexagon, corners correspond to binary (o1,o2,o3) values. Corners opposite to oi has inverted bits 1-oi. Points corresponding to vectors that weakly excite single output approach (0,0,0) point along the (a,0,0), (0,a,0) or (0,0,a) lines, while points in the overlapping region of class two and three approach the center along (a,1,1), (1,a,1) and (1,1,a) lines. Effect: poor generalization of the network, evident when training vectors are perturbed by adding noise. Adding regularization term to the error function makes sigmoids smoother. Left: solution with 12 hidden neurons and a=0.01 regularization coefficient, 2 errors; right – small variance (0.001) noise perturbation, 5 additional vector per one training vector added. The best network solutions show large clusters of points around vertices of the polygon, without overlaps with clusters and with no vectors close to the center of the projection. Scatterograms contain more information than ROC curves displaying detection rates for a given false alarm rate: images close to the polygon vertices correspond to the high probability assigned by the classifier, types of errors are evident, and potential errors may be noticed even before they occur. Goal of training: not zero error or zero MSE, but stable, separated images! References W. Duch, R. Adamczak and K. Grąbczewski, Methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks {\bf 12}: 277-306, 2001. I. Nabnay and C. Bishop, NETLAB software, Aston University, Birmingham, UK, 1997. Data 3-class data, from UCI: Hypothyroid, screening tests for thyroid problems. 3772 for training (first year), 3428 cases for test; about 92.5% normal (blue), about 5% with primary hypothyroid (red), and about 2.5% with compensated hypothyroid (green). 21 attributes, 15 binary and 6 continuous (age + level of five hormones). 5-class data, from UCI: Satimage data, six types of soil images from the Landsat satellite multi-spectral scanner. The 3x3 neighborhoods of a central pixels from 4 different spectra are provided as feature vectors (36 dimensions). The last, mixed soil class, has been removed to make small figures more legible, leaving 5 classes only, and 3397 training samples.