Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto,

Similar presentations


Presentation on theme: "What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto,"— Presentation transcript:

1 What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto, 8-12 July 2002

2 What is it about? Data used to be precious! Now it is overwhelming...Data used to be precious! Now it is overwhelming... In many areas of science, business and commerce people are drowning in data.In many areas of science, business and commerce people are drowning in data. Ex: astronomy super-telescope – data mining in existing databases.Ex: astronomy super-telescope – data mining in existing databases. Database technology allows to store and retrieve large amounts of data of any kind.Database technology allows to store and retrieve large amounts of data of any kind. There is knowledge hidden in data.There is knowledge hidden in data. Data analysis requires intelligence.Data analysis requires intelligence.

3 Ancient history 1960: first databases, collections of data.1960: first databases, collections of data. 1970: RDBMS, relational data model most popular today, large centralized systems.1970: RDBMS, relational data model most popular today, large centralized systems. 1980: application-oriented data models, specialized for scientific, geographic, engineering data, time series, text, object-oriented models, distributed databases.1980: application-oriented data models, specialized for scientific, geographic, engineering data, time series, text, object-oriented models, distributed databases. 1990: multimedia and Web databases, data warehousing (subject-oriented DB for decision support), and on-line analytical processing (OLAP), deduction and verification of hypothetical patterns.1990: multimedia and Web databases, data warehousing (subject-oriented DB for decision support), and on-line analytical processing (OLAP), deduction and verification of hypothetical patterns. Data mining: first conference in 1989, book 1996, discover something useful!Data mining: first conference in 1989, book 1996, discover something useful!

4 Data Mining History 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro and W. Frawley 1991)1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky-Shapiro and W. Frawley 1991) Workshops on KDD Workshops on KDD 1996 Advances in Knowledge Discovery and Data Mining (Fayyad et al.)1996 Advances in Knowledge Discovery and Data Mining (Fayyad et al.) International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) 1997 Journal of Data Mining and Knowledge Discovery1997 Journal of Data Mining and Knowledge Discovery 1998 ACM SIGKDD, SIGKDD’ conferences, and SIGKDD Explorations1998 ACM SIGKDD, SIGKDD’ conferences, and SIGKDD Explorations Many conferences on data mining: PAKDD, PKDD, SIAM- Data Mining, (IEEE) ICDM, etc.Many conferences on data mining: PAKDD, PKDD, SIAM- Data Mining, (IEEE) ICDM, etc.

5 References, papers KDD WWW Resources: ResearchIndex: AI & ML aspects NN & Statistics Comparison of results on many datasets:

6 Data Mining and statistics Statisticians deal with data: what’s new in DM?Statisticians deal with data: what’s new in DM? Many DM methods have roots in statistics.Many DM methods have roots in statistics. Statistics used to deal with small, controlled experiments, while DM deals with large, messy collections of data.Statistics used to deal with small, controlled experiments, while DM deals with large, messy collections of data. Statistics is based on analytical probabilistic models, DM is based on algorithms that find patterns in data.Statistics is based on analytical probabilistic models, DM is based on algorithms that find patterns in data. Many DM algorithms came from other sources and slowly get some statistical justification.Many DM algorithms came from other sources and slowly get some statistical justification. Key factor for DM is the computer cost/performance.Key factor for DM is the computer cost/performance. Sometimes DM is more art than science …Sometimes DM is more art than science …

7 Types of Data Statistical data – clean, numerical, controlled experiments, vector space model.Statistical data – clean, numerical, controlled experiments, vector space model. Relational data – marketing, finances.Relational data – marketing, finances. Textual data – Web, NLP, search.Textual data – Web, NLP, search. Complex structures – chemistry, economics.Complex structures – chemistry, economics. Sequence data – bioinformatics.Sequence data – bioinformatics. Multimedia data – images, video.Multimedia data – images, video. Signals – dynamic data, biosignals.Signals – dynamic data, biosignals. AI data – logical problems, games, behavior …AI data – logical problems, games, behavior …

8 What is DM? Discovering interesting patterns, finding useful summaries of large databases.Discovering interesting patterns, finding useful summaries of large databases. DM is more than database technology, On-Line Analitic Processing (OLAP) tools.DM is more than database technology, On-Line Analitic Processing (OLAP) tools. DM is more than statistical analysis, although it includes classification, association, clustering, outlier and trend analysis, decision rules, prototype cases, multidimensional visualization etc. Understanding of data has not been an explicit goal of statistics, focusing on predictive data models.DM is more than statistical analysis, although it includes classification, association, clustering, outlier and trend analysis, decision rules, prototype cases, multidimensional visualization etc. Understanding of data has not been an explicit goal of statistics, focusing on predictive data models.

9 DM applications Many applications, but spectacular new knowledge is rarely discovered. Some examples:Many applications, but spectacular new knowledge is rarely discovered. Some examples: –“Diapers and beer” correlation: please them close and put potato chips in between. –Mining astronomical catalogs (Skycat, Sloan Sky survey): new subtype of stars has been discovered! –Bioinformatics: more precise characterization of some diseases, many discoveries to be made? –Credit card fraud detection (HNC company). –Discounts of air/hotel for frequent travelers.

10 Important issues in data mining. Use of statistical and CI methods for KDD.Use of statistical and CI methods for KDD. What makes an interesting pattern?What makes an interesting pattern? Handling uncertainty in the data.Handling uncertainty in the data. Handling noise, outliers and missing or unknown data.Handling noise, outliers and missing or unknown data. Finding linguistic variables, discretization of continuous data, presentation and evaluation of knowledge.Finding linguistic variables, discretization of continuous data, presentation and evaluation of knowledge. Knowledge representation for structural data, heterogeneous information, textual databases & NLP.Knowledge representation for structural data, heterogeneous information, textual databases & NLP. Performance, scalability, distributed data, incremental or “on-line” processing.Performance, scalability, distributed data, incremental or “on-line” processing. Best form of explanation depends on the application.Best form of explanation depends on the application.

11 DM dangers If there are too many conclusions to draw some inferences will be true by chance due to too small data samples (Bonferroni’s theorem). Example 1: David Rhine (Duke Univ) ESP tests. 1 person in 1000 guessed correctly color (red or black) of 10 cards: is this evidence for ESP? Retesting of these people gave average results. Rhine’s conclusion: telling people that they have ESP interferes with their ability …If there are too many conclusions to draw some inferences will be true by chance due to too small data samples (Bonferroni’s theorem). Example 1: David Rhine (Duke Univ) ESP tests. 1 person in 1000 guessed correctly color (red or black) of 10 cards: is this evidence for ESP? Retesting of these people gave average results. Rhine’s conclusion: telling people that they have ESP interferes with their ability … Example 2: using m letters to form a random sequence of the length N all possible subsequences of log m N are found => Bible code!

12 Knowledge discovery in databases (KDD): a search process for understandable and useful patterns in data. Data Mining Clean, Collect, Summarize Data Warehouse Data Preparation Training Data Model Patterns Verification, Evaluation Operational Databases Data Mining process most effort

13 Stages of DM process Data gathering, data warehousing, Web crawling.Data gathering, data warehousing, Web crawling. Preparation of the data: cleaning, removing outliers and impossible values, removing wrong records, finding missing data.Preparation of the data: cleaning, removing outliers and impossible values, removing wrong records, finding missing data. Exploratory data analysis: visualization of different aspects of data.Exploratory data analysis: visualization of different aspects of data. Finding relevant features for questions that are asked, preparing data structures for predictive methods, converting symbolic values to numerical representation.Finding relevant features for questions that are asked, preparing data structures for predictive methods, converting symbolic values to numerical representation. Pattern extraction, discovery, rules, prototypes.Pattern extraction, discovery, rules, prototypes. Evaluation of knowledge gained, finding useful patterns, consultation with experts.Evaluation of knowledge gained, finding useful patterns, consultation with experts.

14 Multidimensional Data Cuboids Data warehouses use multidimensional data model.Data warehouses use multidimensional data model. Projections (views) of data on different dimensions (attributes) form “data cuboids”.Projections (views) of data on different dimensions (attributes) form “data cuboids”. In DB warehousing literature: base cuboid: original data, N-Dim. apex cuboid: 0-D cuboid, highest-level summary; data cube: lattice of cuboids.In DB warehousing literature: base cuboid: original data, N-Dim. apex cuboid: 0-D cuboid, highest-level summary; data cube: lattice of cuboids. Ex: Sales data cube, viewed in multiple dimensionsEx: Sales data cube, viewed in multiple dimensions –Dimension tables, ex. item (item_name, brand, type), or time(day, week, month, quarter, year) –Fact tables, measures (such as cost), and keys to each of the related dimension tables

15 Data Cube: A Lattice of Cuboids time,item time,item,location none timeitemlocationsupplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

16 Forms of useful knowledge But... knowledge accessible to humans is in: symbols, similarity to prototypes, images, visual representations. What type of explanation is satisfactory? Interesting question for cognitive scientists. Different answers in different fields. AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever.

17 Forms of knowledge Types of explanation: exemplar-based: prototypes and similarity; logic-based: symbols and rules; visualization-based: exploratory data analysis, maps, diagrams, relations... Humans remember examples of each category and refer to such examples – as similarity- based or nearest-neighbors methods do. Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. Logical rules are the highest form of summarization of knowledge.

18 Computational Intelligence Computational Intelligence Data => Knowledge Artificial Intelligence Expert systems Fuzzy logic Pattern Recognition Machine learning Probabilistic methods Multivariate statistics Visuali- zation Evolutionary algorithms Neural networks Soft computing

19 CI methods for data mining Provide non-parametric (“universal”), predictive models of data.Provide non-parametric (“universal”), predictive models of data. Classify new data to pre-defined categories, supporting diagnosis & prognosis.Classify new data to pre-defined categories, supporting diagnosis & prognosis. Discover new categories, clusters, patterns.Discover new categories, clusters, patterns. Discover interesting associations, correlations.Discover interesting associations, correlations. Allow to understand the data, creating fuzzy or crisp logical rules, or prototypes.Allow to understand the data, creating fuzzy or crisp logical rules, or prototypes. Help to visualize multi-dimensional relationships among data samples.Help to visualize multi-dimensional relationships among data samples.

20 Association rules Classification rules: X => C(X)Classification rules: X => C(X) Association rules: looking for correlation between components of X, i.e. probability p(X i |X 1,X i-1,X i+1,X n ).Association rules: looking for correlation between components of X, i.e. probability p(X i |X 1,X i-1,X i+1,X n ). “Market basket” problem: many items selected from an available pool to a basket; what are the correlations?“Market basket” problem: many items selected from an available pool to a basket; what are the correlations? Only frequent items are interesting: itemsets with high support, i.e. appearing together in many baskets. Search for rules above support threshold > 1%.Only frequent items are interesting: itemsets with high support, i.e. appearing together in many baskets. Search for rules above support threshold > 1%.

21 Association rules - related Related problems to market basket: correlation between documents – high for plagiarism; phrases in documents – high for semantically related documents.Related problems to market basket: correlation between documents – high for plagiarism; phrases in documents – high for semantically related documents. Causal relations matter, although may be difficult to determine: lower the price of diapers, keep high beer price, or try the reverse – what will happen?Causal relations matter, although may be difficult to determine: lower the price of diapers, keep high beer price, or try the reverse – what will happen? More general approach: Bayesian belief networks, causal networks, graphical models.More general approach: Bayesian belief networks, causal networks, graphical models.

22 ClusteringClustering Given points in multidimensional space divided them into groups that are “similar”.Given points in multidimensional space divided them into groups that are “similar”. Ex: if epidemic breaks, look for location of cases on the map (cholera in London). Documents in the space of words cluster according to their topics.Ex: if epidemic breaks, look for location of cases on the map (cholera in London). Documents in the space of words cluster according to their topics. How to measure similarity?How to measure similarity? Hierarchical approaches: start from single cases, join them forming clusters; ex: dendrogram. Centroid approaches: assume a few centers and adapt their position; ex: k-means, LVQ, SOM.Hierarchical approaches: start from single cases, join them forming clusters; ex: dendrogram. Centroid approaches: assume a few centers and adapt their position; ex: k-means, LVQ, SOM.

23 Neural networks Inspired by neurobiology: simple elements cooperate changing internal parameters.Inspired by neurobiology: simple elements cooperate changing internal parameters. Large field, dozens of different models, over 500 papers on NN in medicine each year.Large field, dozens of different models, over 500 papers on NN in medicine each year. Supervised networks: heteroassociative mapping X=>Y, symptoms => diseases, universal approximators.Supervised networks: heteroassociative mapping X=>Y, symptoms => diseases, universal approximators.Supervised networksSupervised networks Unsupervised networks: clusterization, competitive learning, autoassociation.Unsupervised networks: clusterization, competitive learning, autoassociation.Unsupervised networksUnsupervised networks Reinforcement learning: modeling behavior, playing games, sequential data.Reinforcement learning: modeling behavior, playing games, sequential data.Reinforcement learningReinforcement learning

24 Supervised learning Compare the desired with the achieved outputs … you can’t always get what you want. Examples: MLP/RBF NN, kNN, SVM, LDA, DT …

25 Unsupervised learning Find interesting structures in data. SOM, many variants.

26 Reinforcement learning Reward comes after the sequence of actions. Games, survival behavior, planning sequences of actions.

27 Unsupervised NN example Clustering and visualization of the quality of life index (UN data) by SOM map. Poor classification, inaccurate visualization.

28 Real and artificial neurons Synapses Axon Dendrites Synapses (weights) Nodes – artificial neurons Signals

29 Neural network for MI diagnosis Myocardial Infarction ~ p(MI|X) SexAgeSmoking ECG: ST Pain Intensity Pain Duration Elevation  1365 Inputs: Output weights Input weights

30 MI network function Training: setting the values of weights and thresholds, efficient algorithms exist. Effect: non-linear regression function Such networks are universal approximators: they may learn any mapping X => Y

31 Knowledge from networks Simplify networks: force most weights to 0, quantize remaining parameters, be constructive! Regularization: mathematical technique improving predictive abilities of the network. Result: MLP2LN neural networks that are equivalent to logical rules.

32 MLP2LNMLP2LN Converts MLP neural networks into a network performing logical operations (LN). Inputlayer Aggregation: better features Output: one node per class. Rule units: threshold logic Linguistic units: windows, filters

33 Learning dynamics Decision regions shown every 200 training epochs in x 3, x 4 coordinates; borders are optimally placed with wide margins.

34 Neurofuzzy systems Feature Space Mapping (FSM) neurofuzzy system. Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions: Fuzzy:  x (no/yes) replaced by a degree  x . Triangular, trapezoidal, Gaussian... MF. M.f-s in many dimensions:

35 GhostMiner Philosophy There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees. Provide tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects. GhostMiner, data mining tools from our lab. Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer.

36 Heterogeneous systems Discovering simplest class structures, its inductive bias, requires heterogeneous adaptive systems (HAS). Ockham razor: simpler systems are better. HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria. Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand.

37 Wine data example alcohol content ash content magnesium content flavanoids content proanthocyanins phenols content OD280/D315 of diluted wines malic acid content alkalinity of ash total phenols content nonanthocyanins phenols content color intensity hue proline. Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, continuous features:

38 Exploration and visualization General info about the data

39 Exploration: data Inspect the data

40 Exploration: data statistics Distribution of feature values Proline has very large values, the data should be standardized before further processing.

41 Exploration: data standardized Standardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std] Other options: normalize to fit in [-1,+1], or normalize rejecting some extreme values.

42 Exploration: 1D histograms Distribution of feature values in classes Some features are more useful than the others.

43 Exploration: 1D/3D histograms Distribution of feature values in classes, 3D

44 Exploration: 2D projections Projections (cuboids) on selected 2D Projections on selected 2D

45 Visualize data Relations in more than 3D are hard to imagine. SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions. Measure of topographical distortions: map all X i points from R n to x i points in R m, m < n, and ask: How well are R ij = D(X i, X j ) distances reproduced by distances r ij = d(x i,x j ) ? Use m = 2 for visualization, use higher m for dimensionality reduction.

46 Visualize data: MDS Multidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) … Minimize measure of topographical distortions moving the x coordinates.

47 Visualize data: Wine The green outlier can be identified easily. 3 clusters are clearly distinguished, 2D is fine.

48 Decision trees Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms. 4 attributes used, 10 errors, 168 correct, 94.4% correct. Simplest things first: use decision tree to find logical rules.

49 Decision borders Multivariate trees: test on combinations of attributes, hyperplanes. Result: feature space is divided into cuboids. Wine data: univariate decision tree borders for proline and flavanoids Univariate trees: test the value of a single attribute x < a.

50 Logical rules s k (x)  True [X k  x X' k ], for example: small(x) = True{x|x < 1} medium(x) = True{x|x  [1,2]} large(x) = True{x|x > 2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height(X) AND has-hat(X) AND has- beard(X) THEN (X is a Brownie) ELSE IF... ELSE... Crisp logic rules: for continuous x use linguistic variables (predicate functions).

51 Crisp logic decisions True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Sever limitation on the expressive power of crisp logical rules! Crisp logic is based on rectangular membership functions:

52 Logical rules - advantages Rules may expose limitations of black box solutions. Only relevant features are used in rules. Rules may sometimes be more accurate than NN and other CI methods. Overfitting is easy to control, rules usually have small number of parameters. Rules forever !? A logical rule about logical rules is: IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice. Logical rules, if simple enough, are preferable.

53 Logical rules - limitations Logical rules are preferred but... Only one class is predicted p(C i |X,M) = 0 or 1 black-and-white picture may be inappropriate in many applications. Discontinuous cost function allow only non- gradient optimization. Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. Reliable crisp rules may reject some cases as unclassified. Interpretation of crisp rules may be misleading. Fuzzy rules are not so comprehensible.

54 Rules - choices Accuracy (overall) A(M) = p  + p  Error rate L(M) = p  + p  Rejection rate R(M)=p +r +p  r = 1  L(M)  A(M) Sensitivity S + (M)= p +|+ = p ++ /p + Specificity S  (M)= p  = p  /p  p  is a hit; p  false alarm; p  is a miss. Simplicity vs. accuracy. Confidence vs. rejection rate.

55 Rules – error functions The overall accuracy is equal to a combination of sensitivity and specificity weighted by the a priori probabilities: A(M) = p  S  (M)+p  S  (M) Optimization of rules for the C + class; large  means no errors but high rejection rate. E(M   )=  L(M  )  A(M  )=  (p  +p  )  (p  +p  ) min M E(M;  )  min M {(1+  )L(M)+R(M)} Optimization with different costs of errors min M E(M;  ) = min M {p  +  p  } = min M {p   S  (M))  p  r (M) +  [p   S  (M))  p  r (M)]} ROC (Receiver Operating Curve): p  ( p  , hit(false alarm).

56 Wine example – SSV rules Decision trees provide rules of different complexity. Simplest tree: 5 nodes, corresponding to 3 rules; 25 errors, mostly Class2/3 wines mixed.

57 Wine – SSV 5 rules Lower pruning leads to more complex tree. 7 nodes, corresponding to 5 rules; 10 errors, mostly Class2/3 wines mixed.

58 Wine – SSV optimal rules Various solutions may be found, depending on the search: 5 rules with 12 premises, making 6 errors, 6 rules with 16 premises and 3 errors, 8 rules, 25 premises, and 1 error. if OD280/D315 >  proline >  color > then class 1 if OD280/D315 >  proline >  color < then class 2 if OD280/D  malic-acid < 2.82 then class 2 if OD280/D315 >  proline < then class 2 if OD280/D315 <  hue < then class 3 if OD280/D  malic-acid > 2.82 then class 3 What is the optimal complexity of rules? Use crossvalidation to estimate generalization.

59 Wine – FSM rules Complexity of rules depends on desired accuracy. Use rectangular functions for crisp rules. Optimal accuracy may be evaluated using crossvalidation. FSM discovers simpler rules, for example: if proline > then class 1 (48 cases, 45 correct, 2 recovered by other rules). if color < then class 2 (63 cases, 60 correct) SSV: hierarchical rules FSM: density estimation with feature selection.

60 Examples of interesting knowledge discovered! The most famous example of knowledge discovered by data mining: correlation between beer, milk and diapers. Other examples: 2 subtypes of galactic spectra forced astrophysicist to reconsider star evolutionary processes. Several examples of knowledge found by us in medical and other datasets follow.

61 Mushrooms The Mushroom Guide: no simple rule for mushrooms; no rule like: ‘leaflets three, let it be’ for Poisonous Oak and Ivy cases, 51.8% are edible, the rest non-edible. 22 symbolic attributes, up to 12 values each, equivalent to 118 logical features, or = possible input vectors. Odor: almond, anise, creosote, fishy, foul, musty, none, pungent, spicy Spore print color: black, brown, buff, chocolate, green, orange, purple, white, yellow. Safe rule for edible mushrooms: odor=(almond.or.anise.or.none)  spore-print-color =  green 48 errors, 99.41% correct This is why animals have such a good sense of smell! What does it tell us about odor receptors?

62 Mushrooms rules To eat or not to eat, this is the question! Not any more... A mushroom is poisonous if: R 1 ) odor =  (almond  anise  none); 120 errors, 98.52% R 2 ) spore-print-color = green 48 errors, 99.41% R 3 ) odor = none  stalk-surface-below-ring = scaly  stalk-color-above-ring =  brown 8 errors, 99.90% R 4 ) habitat = leaves  cap-color = white no errors! R 1 + R 2 are quite stable, found even with 10% of data; R 3 and R 4 may be replaced by other rules, ex: R' 3 ): gill-size=narrow  stalk-surface-above-ring=(silky  scaly) R' 4 ): gill-size=narrow  population=clustered Only 5 of 22 attributes used! Simplest possible rules? 100% in CV tests - structure of this data is completely clear.

63 Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. 286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%) no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes 9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation.

64 Rules for breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. Many systems used, 65-78% accuracy reported. Single rule: IF (nodes-involved  [0,2]  degree-malignant = 3 THEN recurrence, ELSE no-recurrence 76.2% accuracy, only trivial knowledge in the data: Highly malignant breast cancer involving many nodes is likely to strike back.

65 Recurrence - comparison. Method 10xCV accuracy MLP2LN 1 rule 76.2 SSV DT stable rules75.7  1.0 k-NN, k=10, Canberra74.1  1.2 MLP+backprop  9.4 (Zarndt) CART DT 71.4  5.0 (Zarndt) FSM, Gaussian nodes 71.7  6.8 Naive Bayes 69.3  10.0 (Zarndt) Other decision trees < 70.0

66 Breast cancer diagnosis. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. 699 cases, 9 features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses Tasks: distinguish benign from malignant cases.

67 Breast cancer rules. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. Simplest rule from MLP2LN, large regularization: If uniformity of cell size  3 Then benign Else malignant Sensitivity=0.97, Specificity=0.85 More complex NN solutions, from 10CV estimate: Sensitivity =0.98, Specificity=0.94

68 Breast cancer comparison. Method 10xCV accuracy k-NN, k=3, Manh97.0  2.1 (GM) FSM, neurofuzzy 96.9  1.4 (GM) Fisher LDA 96.8 MLP+backprop (Ster, Dobnikar) LVQ 96.6 (Ster, Dobnikar) IncNet (neural)96.4  2.1 (GM) Naive Bayes 96.4 SSV DT, 3 crisp rules 96.0  2.9 (GM) LDA (linear discriminant)96.0 Various decision trees

69 l Collected in the Outpatient Center of Dermatology in Rzeszów, Poland. l Four types of Melanoma: benign, blue, suspicious, or malignant. l 250 cases, with almost equal class distribution. l Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5). l TDS (Total Dermatoscopy Score) - single index l Goal: hardware scanner for preliminary diagnosis. Melanoma skin cancer

70 R1: IF TDS ≤ 4.85 AND C-BLUE IS absent THEN MELANOMA IS Benign-nevus R2: IF TDS ≤ 4.85 AND C-BLUE IS present THEN MELANOMA IS Blue-nevus R3: IF TDS > 5.45 THEN MELANOMA IS Malignant R4: IF TDS > 4.85 AND TDS < 5.45 THEN MELANOMA IS Suspicious 5 errors (98.0%) on the training set 0 errors (100 %) on the test set. Feature aggregation is important! Without TDS 15 rules are needed. Melanoma rules

71 Method Rules Training % Test % MLP2LN, crisp rules all 100 SSV Tree, crisp rules 497.5± FSM, rectangular f ± knn+ prototype selection ± FSM, Gaussian f ±1.0 95±3.6 knn k=1, Manh, 2 features ± LERS, rough rules Melanoma results

72 SummarySummary Data mining is a large field; only a few issues have been mentioned here. DM involves many steps, here only those related to pattern recognition were stressed, but in practice scalability and efficiency issues may be most important. Neural networks are used still mostly for building predictive data models, but they may also provide simplified description in form of rules. Rules are not the only for of data understanding. Rules may be a beginning for a practical application. Some interesting knowledge has been discovered.

73 ChallengesChallenges Discovery of theories rather than data models Discovery of theories rather than data models Integration with image/signal analysis Integration with image/signal analysis Integration with reasoning in complex domains Integration with reasoning in complex domains Combining expert systems with neural networks Combining expert systems with neural networks Fully automatic universal data analysis systems: press the button and wait for the truth … We are slowly getting there. More & more computational intelligence tools (including our own) are available.

74 DisclaimerDisclaimer A few slides/figures were taken from various presentations found in the Internet; unfortunately I cannot identify original authors at the moment, since these slides went through different iterations. I have to apologize for that.


Download ppt "What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland ISEP Porto,"

Similar presentations


Ads by Google