Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding of complex data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland.

Similar presentations


Presentation on theme: "Understanding of complex data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland."— Presentation transcript:

1 Understanding of complex data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland http://www.phys.uni.torun.pl/~duch

2 What am I going to say Data and CIData and CI What we hope for.What we hope for. Forms of understanding.Forms of understanding. Visualization.Visualization. Prototypes.Prototypes. Logical rules.Logical rules. Some knowledge discovered.Some knowledge discovered. Expert system for psychometry.Expert system for psychometry. Conclusions, or why am I saying this?Conclusions, or why am I saying this?

3 Types of Data Data was precious! Now it is overwhelming...Data was precious! Now it is overwhelming... Statistical data – clean, numerical, controlled experiments, vector space model.Statistical data – clean, numerical, controlled experiments, vector space model. Relational data – marketing, finances.Relational data – marketing, finances. Textual data – Web, NLP, search.Textual data – Web, NLP, search. Complex structures – chemistry, economics.Complex structures – chemistry, economics. Sequence data – bioinformatics.Sequence data – bioinformatics. Multimedia data – images, video.Multimedia data – images, video. Signals – dynamic data, biosignals.Signals – dynamic data, biosignals. AI data – logical problems, games, behavior …AI data – logical problems, games, behavior …

4 Computational Intelligence Computational Intelligence Data  Knowledge Artificial Intelligence Expert systems Fuzzy logic Pattern Recognition Machine learning Probabilistic methods Multivariate statistics Visuali- zation Evolutionary algorithms Neural networks Soft computing

5 Turning data into knowledge What should CI methods do? Provide descriptive and predictive non- parametric models of data.Provide descriptive and predictive non- parametric models of data. Allow to classify, approximate, associate, correlate, complete patterns.Allow to classify, approximate, associate, correlate, complete patterns. Allow to discover new categories and interesting patterns.Allow to discover new categories and interesting patterns. Help to visualize multi-dimensional relationships among data samples.Help to visualize multi-dimensional relationships among data samples. Allow to understand the data in some way.Allow to understand the data in some way. Help to model brains!Help to model brains!

6 Forms of useful knowledge AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever. But... knowledge accessible to humans is in: symbols,symbols, similarity to prototypes,similarity to prototypes, images, visual representations.images, visual representations. What type of explanation is satisfactory? Interesting question for cognitive scientists. Different answers in different fields.

7 Data understanding Types of explanation: visualization-based: maps, diagrams, relations...visualization-based: maps, diagrams, relations... exemplar-based: prototypes and similarity;exemplar-based: prototypes and similarity; logic-based: symbols and rules.logic-based: symbols and rules. Humans remember examples of each category and refer to such examples – as similarity-based or nearest- neighbors methods do.Humans remember examples of each category and refer to such examples – as similarity-based or nearest- neighbors methods do. Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do.Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. Logical rules are the highest form of summarization of knowledge.Logical rules are the highest form of summarization of knowledge.

8 Visualization: dendrograms All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure. Normal and malignant lymphocytes.

9 Visualization: 2D projections All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure. 3-bit parity + all 5-bit combinations.

10 Visualization: MDS mapping Results of pure MDS mapping + centers of hierarchical clusters connected. 3-bit parity + all 5-bit combinations.

11 Visualization: 3D projections Only age is continuous, other values are binary Fine Needle Aspirate of Breast Lesions, red=malignant, green=benign A.J. Walker, S.S. Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521

12 Visualization: MDS mappings Try to preserve all distances in 2D nonlinear mapping MDS large sets using LVQ + relative mapping: Antoine Naud + WD, this conference.

13 Prototype-based rules IF P = arg min R D(X,R) THAN Class(X)=Class(P) C-rules (Crisp), are a special case of F-rules (fuzzy rules). F-rules (fuzzy rules) are a special case of P-rules (Prototype). P-rules have the form: D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P. P-rules are easy to interpret! IF X=You are most similar to the P=Superman THAN You are in the Super-league. IF X=You are most similar to the P=Weakling THAN You are in the Failed-league. “Similar” may involve different features or D(X,P).

14 P-rules Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. Manhattan function =>  (X;P)=exp{  |X-P|} Various distance functions lead to different MF. Ex. data-dependent distance functions, for symbolic data:

15 Crisp P-rules New distance functions from info theory  interesting MF. Membership Functions  new distance function, with local D(X,R) for each cluster. Crisp logic rules: use L  norm: D  (X,P) = ||X  P||  = max i W i |X i  P i | D  (X,P) = const => rectangular contours. L  Chebyshev distance with thresholds  P IF D  (X,P)  P THEN C(X)=C(P) is equivalent to a conjunctive crisp rule IF X 1  [P 1  P  W 1,P 1  P  W 1 ]  … … X N  [P N  P  W N,P N  P  W N ] THEN C(X)=C(P)

16 Decision borders Euclidean distance from 3 prototypes, one per class. Minkovski =20 distance from 3 prototypes. D(P,X) =const and decision borders D(P,X)=D(Q,X).

17 P-rules for Wine Manhattan distance: 6prototypes kept, 4 errors, f2 removed Many other solutions. L  distance (crisp rules): 15 prototypes kept, 5 errors, f2, f8, f10 removed Euclidean distance: 11 prototypes kept, 7 errors

18 Complex objects Vector space concept is not sufficient for complex object. A common set of features is meaningless. AI: complex objects, states, problem descriptions. General approach: sufficient to evaluate similarity D(O i,O j ). Compare O i, O j : define transformation Elementary operators  k, eg. substring’s substitutions. Many T connecting a pair of objects Oi and Oj objects exist. Cost of transformation = sum of  k costs. Similarity: lowest transformation cost. Bioinformatics: sophisticated similarity functions for sequences. Dynamic programming finds similarities. Use adaptive costs and general framework for SBL methods. See Marczak et al (this conference).

19 PromotersPromoters DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4 PDF distance, symbolic s=a, c, t, g replaced by p(s|+)

20 Logical rules Crisp logic rules: for continuous x use linguistic variables (predicate functions). s k (x)  True [X k  x X' k ], for example: small(x) = True{x|x < 1} medium(x) = True{x|x  [1,2]} large(x) = True{x|x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height(X) AND has-hat(X) AND has- beard(X) THEN (X is a Brownie) ELSE IF... ELSE...

21 Crisp logic decisions Crisp logic is based on rectangular membership functions: True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Severe limitation on the expressive power of crisp logical rules!

22 DT decisions borders Decision trees lead to specific decision borders. SSV tree on Wine data, proline + flavanoids content

23 Logical rules - advantages Logical rules, if simple enough, are preferable. Rules may expose limitations of black box solutions.Rules may expose limitations of black box solutions. Only relevant features are used in rules.Only relevant features are used in rules. Rules may sometimes be more accurate than NN and other CI methods.Rules may sometimes be more accurate than NN and other CI methods. Overfitting is easy to control, rules usually have small number of parameters.Overfitting is easy to control, rules usually have small number of parameters. Rules forever !? A logical rule about logical rules is:Rules forever !? A logical rule about logical rules is: IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice.

24 Logical rules - limitations Logical rules are preferred but... Only one class is predicted p(C i |X,M) = 0 or 1Only one class is predicted p(C i |X,M) = 0 or 1 black-and-white picture may be inappropriate in many applications. Discontinuous cost function allow only non- gradient optimization.Discontinuous cost function allow only non- gradient optimization. Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules.Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. Reliable crisp rules may reject some cases as unclassified.Reliable crisp rules may reject some cases as unclassified. Interpretation of crisp rules may be misleading.Interpretation of crisp rules may be misleading. Fuzzy rules are not so comprehensible.Fuzzy rules are not so comprehensible.

25 Rules - choices Simplicity vs. accuracy. Confidence vs. rejection rate. Accuracy (overall) A(M) = p  + p  Error rate L(M) = p  + p  Rejection rate R(M)=p +r +p  r = 1  L(M)  A(M) Sensitivity S + (M)= p +|+ = p ++ /p + Specificity S  (M)= p  = p  /p  p  is a hit; p  false alarm; p  is a miss.

26 Neural networks and rules Myocardial Infarction ~ p(MI|X) SexAgeSmoking ECG: ST Pain Intensity Pain Duration Elevation 0.7 5 1  1365 Inputs: Output weights Input weights

27 Knowledge from networks Simplify networks: force most weights to 0, quantize remaining parameters, be constructive! Regularization: mathematical technique improving predictive abilities of the network. Result: MLP2LN neural networks that are equivalent to logical rules.

28 MLP2LNMLP2LN Converts MLP neural networks into a network performing logical operations (LN). Inputlayer Aggregation: better features Output: one node per class. Rule units: threshold logic Linguistic units: windows, filters

29 Learning dynamics Decision regions shown every 200 training epochs in x 3, x 4 coordinates; borders are optimally placed with wide margins.

30 Neurofuzzy systems Feature Space Mapping (FSM) neurofuzzy system. Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions: Fuzzy:  x (no/yes) replaced by a degree  x . Triangular, trapezoidal, Gaussian... MF. M.f-s in many dimensions:

31 GhostMiner Philosophy There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity- based, committees.There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity- based, committees. Provide tools for visualization of data.Provide tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects.Support the process of knowledge discovery/model building and evaluating, organizing it into projects. GhostMiner, data mining tools from our lab. Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner AnalyzerSeparate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer

32 Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. 286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%) no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes 9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation.

33 Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. Many systems used, 65-78% accuracy reported. Single rule: IF (nodes-involved  [0,2]  degree-malignant = 3 THEN recurrence, ELSE no-recurrence 76.2% accuracy, only trivial knowledge in the data: Highly malignant breast cancer involving many nodes is likely to strike back.

34 Recurrence - comparison. Method 10xCV accuracy MLP2LN 1 rule 76.2 SSV DT stable rules75.7  1.0 k-NN, k=10, Canberra74.1  1.2 MLP+backprop. 73.5  9.4 (Zarndt) CART DT 71.4  5.0 (Zarndt) FSM, Gaussian nodes 71.7  6.8 Naive Bayes 69.3  10.0 (Zarndt) Other decision trees < 70.0

35 Breast cancer diagnosis. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. 699 cases, 9 features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses Tasks: distinguish benign from malignant cases.

36 Breast cancer rules. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. Simplest rule from MLP2LN, large regularization: If uniformity of cell size  3 Then benign Else malignant Sensitivity=0.97, Specificity=0.85 More complex NN solutions, from 10CV estimate: Sensitivity =0.98, Specificity=0.94

37 Breast cancer comparison. Method 10xCV accuracy k-NN, k=3, Manh97.0  2.1 (GM) FSM, neurofuzzy 96.9  1.4 (GM) Fisher LDA 96.8 MLP+backprop. 96.7 (Ster, Dobnikar) LVQ 96.6 (Ster, Dobnikar) IncNet (neural)96.4  2.1 (GM) Naive Bayes 96.4 SSV DT, 3 crisp rules 96.0  2.9 (GM) LDA (linear discriminant)96.0 Various decision trees 93.5-95.6

38 l Collected in the Outpatient Center of Dermatology in Rzeszów, Poland. l Four types of Melanoma: benign, blue, suspicious, or malignant. l 250 cases, with almost equal class distribution. l Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5). l TDS (Total Dermatoscopy Score) - single index l Goal: hardware scanner for preliminary diagnosis. Melanoma skin cancer

39 Method Rules Training % Test % MLP2LN, crisp rules 498.0 all 100 SSV Tree, crisp rules 497.5±0.3 100 FSM, rectangular f. 7 95.5±1.0 100 knn+ prototype selection 13 97.5±0.0 100 FSM, Gaussian f. 15 93.7±1.0 95±3.6 knn k=1, Manh, 2 features -- 97.4±0.3 100 LERS, rough rules 21 -- 96.2 Melanoma results

40 27 features taken into account: polarity, size, hydrogen-bond donor or acceptor, pi-donor or acceptor, polarizability, sigma effect. Pairs of chemicals, 54 features, are compared, which one has higher activity? 2788 cases, 5-fold crossvalidation tests. Antibiotic activity of pyrimidine compounds. Pyrimidines: which compound has stronger antibiotic activity? Common template, substitutions added at 3 positions, R 3, R 4 and R 5.

41 Antibiotic activity - results. Pyrimidines: which compound has stronger antibiotic activity? Mean Spearman's rank correlation coefficient used:  r s  Method Rank correlation FSM, 41 Gaussian rules 0.77±0.03 Golem (ILP)0.68 Linear regression 0.65 CART (decision tree)0.50

42 Thyroid screening. Garavan Institute, Sydney, Australia 15 binary, 6 continuous Training: 93+191+3488 Validate: 73+177+3178 l Determine important clinical factors l Calculate prob. of each diagnosis. Hidden units Final diagnoses TSH T4U Clinical findings Age sex … T3 TT4 TBG Normal Hyperthyroid Hypothyroid

43 Thyroid – some results. Accuracy of diagnoses obtained with different systems. Method Rules/Features Training % Test % MLP2LN optimized 4/6 99.9 99.36 CART/SSV Decision Trees 3/5 99.8 99.33 Best Backprop MLP -/21 100 98.5 Naïve Bayes -/- 97.0 96.1 k-nearest neighbors -/- - 93.8

44 PsychometryPsychometry MMPI (Minnesota Multiphasic Personality Inventory) psychometric test. Printed formsPrinted forms are scanned or computerized version of the test is used. computerized version Printed formscomputerized version Raw data: 550 questions, ex: I am getting tired quickly: Yes - Don’t know - No Results are combined into 10 clinical scales and 4 validity scales using fixed coefficients. Each scale measures tendencies towards hypochondria, schizophrenia, psychopathic deviations, depression, hysteria, paranoia etc.Each scale

45 Scanned form

46 Computer input

47 ScalesScales

48 PsychometryPsychometry There is no simple correlation between single values and final diagnosis. Results are displayed in form of a histogram, called ‘a psychogram’. Interpretation depends on the experience and skill of an expert, takes into account correlations between peaks.a psychogram Goal: an expert system providing evaluation and interpretation of MMPI tests at an expert level. Problem: agreement between experts only 70% of the time; alternative diagnosis and personality changes over time are important.

49 PsychogramPsychogram

50 Psychometric data 1600 cases for woman, same number for men. 27 classes: norm, psychopathic, schizophrenia, paranoia, neurosis, mania, simulation, alcoholism, drug addiction, criminal tendencies, abnormal behavior due to... Extraction of logical rules: 14 scales = features. Define linguistic variables and use FSM, MLP2LN, SSV - giving about 2-3 rules/class.

51 Psychometric data 10-CV for FSM is 82-85%, for C4.5 is 79-84%. +G x Input uncertainty +G x around 1.5% (best ROC) improves FSM results to 90-92%. MethodData N. rules Accuracy +Gx%+Gx%+Gx%+Gx% C 4.5 ♀5593.093.7 ♂6192.593.1 FSM♀6995.497.6 ♂9895.996.9

52 Psychometric Expert Probabilities for different classes. For greater uncertainties more classes are predicted. Fitting the rules to the conditions: typically 3-5 conditions per rule, Gaussian distributions around measured values that fall into the rule interval are shown in green. Verbal interpretation of each case, rule and scale dependent.

53 MMPI probabilities

54 MMPI rules

55 MMPI verbal comments

56 VisualizationVisualization Probability of classes versus input uncertainty. Detailed input probabilities around the measured values vs. change in the single scale; changes over time define ‘patients trajectory’. Interactive multidimensional scaling: zooming on the new case to inspect its similarity to other cases.

57 Class probability/uncertainty

58 Class probability/feature

59 MDS visualization

60 ConclusionsConclusions Data understanding is challenging problem. Classification rules are frequently only the first step and may not be the best solution.Classification rules are frequently only the first step and may not be the best solution. Visualization is always helpful.Visualization is always helpful. P-rules may be competitive if complex decision borders are required, providing different types of rules.P-rules may be competitive if complex decision borders are required, providing different types of rules. Understanding of complex objects is possible, although difficult, using adaptive costs and distance as least expensive transformations (action principles in physics).Understanding of complex objects is possible, although difficult, using adaptive costs and distance as least expensive transformations (action principles in physics). Why am I saying all this? Because we have hopes for great applications!Why am I saying all this? Because we have hopes for great applications!

61 ChallengesChallenges Discovery of theories rather than data models Discovery of theories rather than data models Integration with image/signal analysis Integration with image/signal analysis Integration with reasoning in complex domains Integration with reasoning in complex domains Combining expert systems with neural networks Combining expert systems with neural networks…. Fully automatic universal data analysis systems: press the button and wait for the truth … We are slowly getting there. More & more computational intelligence tools (including our own) are available.


Download ppt "Understanding of complex data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland."

Similar presentations


Ads by Google