Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge discovery & data mining Classification & fraud detection Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa

Similar presentations


Presentation on theme: "Knowledge discovery & data mining Classification & fraud detection Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa"— Presentation transcript:

1 Knowledge discovery & data mining Classification & fraud detection Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa A EDBT2000

2 Konstanz, EDBT2000 tutorial - Class 2 Module outline zThe classification task zMain classification techniques yBayesian classifiers yDecision trees yHints to other methods zApplication to a case-study in fiscal fraud detection: audit planning

3 Konstanz, EDBT2000 tutorial - Class 3 The classification task zInput: a training set of tuples, each labelled with one class label zOutput: a model (classifier) which assigns a class label to each tuple based on the other attributes. zThe model can be used to predict the class of new tuples, for which the class label is missing or unknown zSome natural applications ycredit approval ymedical diagnosis ytreatment effectiveness analysis

4 Konstanz, EDBT2000 tutorial - Class 4 Basic Framework for Inductive Learning Inductive Learning System Environment Training Examples Testing Examples Induced Model of Classifier Output Classification (x, f(x)) (x, h(x)) h(x) = f(x)? A problem of representation and search for the best hypothesis, h(x). ~ Classification systems and inductive learning

5 Konstanz, EDBT2000 tutorial - Class 5 Train & test zThe tuples (observations, samples) are partitioned in training set + test set. zClassification is performed in two steps: 1.training - build the model from training set 2.test - check accuracy of the model using test set

6 Konstanz, EDBT2000 tutorial - Class 6 Train & test zKind of models yIF-THEN rules yOther logical formulae yDecision trees zAccuracy of models yThe known class of test samples is matched against the class predicted by the model. yAccuracy rate = % of test set samples correctly classified by the model.

7 Konstanz, EDBT2000 tutorial - Class 7 Training step Training Data Classification Algorithms IF age = OR income = high THEN credit = good Classifier (Model)

8 Konstanz, EDBT2000 tutorial - Class 8 Test step Test Data Classifier (Model)

9 Konstanz, EDBT2000 tutorial - Class 9 Prediction Unseen Data Classifier (Model)

10 Konstanz, EDBT2000 tutorial - Class 10 Machine learning terminology zClassification = supervised learning yuse training samples with known classes to classify new data zClustering = unsupervised learning ytraining samples have no class information yguess classes or clusters in the data

11 Konstanz, EDBT2000 tutorial - Class 11 Comparing classifiers zAccuracy zSpeed zRobustness yw.r.t. noise and missing values zScalability yefficiency in large databases zInterpretability of the model zSimplicity ydecision tree size yrule compactness zDomain-dependent quality indicators

12 Konstanz, EDBT2000 tutorial - Class 12 Classical example: play tennis? zTraining set from Quinlan’s book

13 Konstanz, EDBT2000 tutorial - Class 13 Module outline zThe classification task zMain classification techniques yBayesian classifiers yDecision trees yHints to other methods zApplication to a case-study in fraud detection: planning of fiscal audits

14 Konstanz, EDBT2000 tutorial - Class 14 Bayesian classification zThe classification problem may be formalized using a-posteriori probabilities: z P(C|X) = prob. that the sample tuple X= is of class C. zE.g. P(class=N | outlook=sunny,windy=true,…) zIdea: assign to sample X the class label C such that P(C|X) is maximal

15 Konstanz, EDBT2000 tutorial - Class 15 Estimating a-posteriori probabilities zBayes theorem: P(C|X) = P(X|C)·P(C) / P(X) zP(X) is constant for all classes zP(C) = relative freq of class C samples zC such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum zProblem: computing P(X|C) is unfeasible!

16 Konstanz, EDBT2000 tutorial - Class 16 Naïve Bayesian Classification zNaïve assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C) zIf i-th attribute is categorical: P(x i |C) is estimated as the relative freq of samples having value x i as i-th attribute in class C zIf i-th attribute is continuous: P(x i |C) is estimated thru a Gaussian density function zComputationally easy in both cases

17 Konstanz, EDBT2000 tutorial - Class 17 Play-tennis example: estimating P(x i |C) outlook P(sunny|p) = 2/9P(sunny|n) = 3/5 P(overcast|p) = 4/9P(overcast|n) = 0 P(rain|p) = 3/9P(rain|n) = 2/5 temperature P(hot|p) = 2/9P(hot|n) = 2/5 P(mild|p) = 4/9P(mild|n) = 2/5 P(cool|p) = 3/9P(cool|n) = 1/5 humidity P(high|p) = 3/9P(high|n) = 4/5 P(normal|p) = 6/9P(normal|n) = 2/5 windy P(true|p) = 3/9P(true|n) = 3/5 P(false|p) = 6/9P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14

18 Konstanz, EDBT2000 tutorial - Class 18 Play-tennis example: classifying X zAn unseen sample X = zP(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = zP(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = zSample X is classified in class n (don’t play)

19 Konstanz, EDBT2000 tutorial - Class 19 The independence hypothesis… z… makes computation possible z… yields optimal classifiers when satisfied z… but is seldom satisfied in practice, as attributes (variables) are often correlated. zAttempts to overcome this limitation: yBayesian networks, that combine Bayesian reasoning with causal relationships between attributes yDecision trees, that reason on one attribute at the time, considering most important attributes first

20 Konstanz, EDBT2000 tutorial - Class 20 Module outline zThe classification task zMain classification techniques yBayesian classifiers yDecision trees yHints to other methods zApplication to a case-study in fraud detection: planning of fiscal audits

21 Konstanz, EDBT2000 tutorial - Class 21 Decision trees zA tree where zinternal node = test on a single attribute zbranch = an outcome of the test zleaf node = class or class distribution A? B?C? D? Yes

22 Konstanz, EDBT2000 tutorial - Class 22 Classical example: play tennis? zTraining set from Quinlan’s book

23 Konstanz, EDBT2000 tutorial - Class 23 Decision tree obtained with ID3 (Quinlan 86) outlook overcast humiditywindy highnormal false true sunny rain NNPP P

24 Konstanz, EDBT2000 tutorial - Class 24 From decision trees to classification rules zOne rule is generated for each path in the tree from the root to a leaf zRules are generally simpler to understand than trees outlook overcast humiditywindy highnormal false true sunny rain N NPP P IF outlook=sunny AND humidity=normal THEN play tennis

25 Konstanz, EDBT2000 tutorial - Class 25 Decision tree induction zBasic algorithm ytop-down recursive ydivide & conquer ygreedy (may get trapped in local maxima) zMany variants: yfrom machine learning: ID3 (Iterative Dichotomizer), C4.5 (Quinlan 86, 93) yfrom statistics: CART (Classification and Regression Trees) (Breiman et al 84) yfrom pattern recognition: CHAID (Chi-squared Automated Interaction Detection) (Magidson 94) zMain difference: divide (split) criterion

26 Konstanz, EDBT2000 tutorial - Class 26 Generate_DT(samples, attribute_list) = 1)Create a new node N; 2)If samples are all of class C then label N with C and exit; 3)If attribute_list is empty then label N with majority_class(N) and exit; 4)Select best_split from attribute_list; 5)For each value v of attribute best_split: zLet S_v = set of samples with best_split=v ; zLet N_v = Generate_DT(S_v, attribute_list \ best_split) ; zCreate a branch from N to N_v labeled with the test best_split=v ;

27 Konstanz, EDBT2000 tutorial - Class 27 Criteria for finding the best split zInformation gain (ID3 – C4.5) yEntropy, an information theoretic concept, measures impurity of a split ySelect attribute that maximize entropy reduction zGini index (CART) yAnother measure of impurity of a split ySelect attribute that minimize impurity z  2 contingency table statistic (CHAID) yMeasures correlation between each attribute and the class label ySelect attribute with maximal correlation

28 Konstanz, EDBT2000 tutorial - Class 28 Information gain (ID3 – C4.5) E.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. zAmount of information to decide if an arbitrary example belongs to Pos or Neg: zfp = p / (p+n)fn = n / (p+n) I(p,n) = - fp ·log 2 (fp) - fn ·log 2 (fn)

29 Konstanz, EDBT2000 tutorial - Class 29 Information gain (ID3 – C4.5) zEntropy = information needed to classify samples in a split according to an attribute zSplitting S with attribute A results in partition {S 1, S 2, …, S k } zp i (resp. n i ) = # elements in S i from Pos (resp. Neg) E(A) =  i  [1,k] I(p i,n i ) · (p i +n i ) / (p+n) gain(A) = I(p,n) - E(A) zSelect A which maximizes gain(A) zExtensible to continuous attributes

30 Konstanz, EDBT2000 tutorial - Class 30 Information gain - play tennis example outlook overcast humiditywindy highnormal false true sunny rain NNP P P zChoosing best split at root node: zgain(outlook) = zgain(temperature) = zgain(humidity) = zgain(windy) = zCriterion biased towards attributes with many values – corrections proposed (gain ratio)

31 Konstanz, EDBT2000 tutorial - Class 31 Gini index (CART) zE.g., two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements. zfp = p / (p+n)fn = n / (p+n) gini(S) = 1 – fp 2 - fn 2 zIf dataset S is split into S 1, S 2 then gini split (S 1, S 2 ) = gini( S 1 ) · (p 1 +n 1 )/(p+n) + gini( S 2 ) · (p 2 +n 2 )/(p+n)

32 Konstanz, EDBT2000 tutorial - Class 32 Gini index - play tennis example zTwo top best splits at root node: zSplit on outlook: zS 1 : {overcast} (4Pos, 0Neg) S 2 : {sunny, rain} zSplit on humidity: zS 1 : {normal} (6Pos, 1Neg) S 2 : {high} outlook rain, sunny P overcast …………… humidity high P normal …………… 86% 100%

33 Konstanz, EDBT2000 tutorial - Class 33 Entropy vs. Gini (on continuous attributes) zGini tends to isolate the largest class from all other classes zEntropy tends to find groups of classes that add up to 50% of the data Is age < 40 Is age < 65

34 Konstanz, EDBT2000 tutorial - Class 34 Other criteria in decision tree construction zBranching scheme: ybinary vs. k-ary splits ycategorical vs. continuous attributes zStop rule: how to decide that a node is a leaf: yall samples belong to same class yimpurity measure below a given threshold yno more attributes to split on yno samples in partition zLabeling rule: a leaf node is labeled with the class to which most samples at the node belong

35 Konstanz, EDBT2000 tutorial - Class 35 The overfitting problem zIdeal goal of classification: find the simplest decision tree that fits the data and generalizes to unseen data yintractable in general zA decision tree may become too complex if it overfits the training samples, due to ynoise and outliers, or ytoo little training data, or ylocal maxima in the greedy search zTwo heuristics to avoid overfitting: yStop earlier: Stop growing the tree earlier. yPost-prune: Allow overfit, and then simplify the tree.

36 Konstanz, EDBT2000 tutorial - Class 36 Stopping vs. pruning zStopping: Prevent the split on an attribute (predictor variable) if it is below a level of statistical significance - simply make it a leaf (CHAID) zPruning: After a complex tree has been grown, replace a split (subtree) with a leaf if the predicted validation error is no worse than the more complex tree (CART, C4.5) zIntegration of the two: PUBLIC (Rastogi and Shim 98) – estimate pruning conditions (lower bound to minimum cost subtrees) during construction, and use them to stop.

37 Konstanz, EDBT2000 tutorial - Class 37 If dataset is large Available Examples Training Set Test Set 70% 30% Used to develop one tree check accuracy Divide randomly Generalization = accuracy

38 Konstanz, EDBT2000 tutorial - Class 38 If data set is not so large zCross-validation Available Examples Training Set Test. Set 10% 90% Repeat 10 times Used to develop 10 different tree Tabulate accuracies Generalization = mean and stddev of accuracy

39 Konstanz, EDBT2000 tutorial - Class 39 Categorical vs. continuous attributes zInformation gain criterion may be adapted to continuous attributes using binary splits zGini index may be adapted to categorical. zTypically, discretization is not a pre- processing step, but is performed dynamically during the decision tree construction.

40 Konstanz, EDBT2000 tutorial - Class 40 Summarizing … tool  C4.5CARTCHAID arity of split binary and K-ary binaryK-ary split criterion information gain gini index 22 stop vs. prune prune stop type of attributes categorical +continuous categorical

41 Konstanz, EDBT2000 tutorial - Class 41 Scalability to large databases zWhat if the dataset does not fit main memory? zEarly approaches: yIncremental tree construction (Quinlan 86) yMerge of trees constructed on separate data partitions (Chan & Stolfo 93) yData reduction via sampling (Cattlet 91) zGoal: handle order of 1G samples and 1K attributes zSuccessful contributions from data mining research ySLIQ (Mehta et al. 96) ySPRINT(Shafer et al. 96) yPUBLIC(Rastogi & Shim 98) yRainForest(Gehrke et al. 98)

42 Konstanz, EDBT2000 tutorial - Class 42 Module outline zThe classification task zMain classification techniques yDecision trees yBayesian classifiers yHints to other methods zApplication to a case-study in fraud detection: planning of fiscal audits

43 Konstanz, EDBT2000 tutorial - Class 43 Backpropagation zIs a neural network algorithm, performing on multilayer feed-forward networks (Rumelhart et al. 86). zA network is a set of connected input/output units where each connection has an associated weight. zThe weights are adjusted during the training phase, in order to correctly predict the class label for samples.

44 Konstanz, EDBT2000 tutorial - Class 44 Backpropagation PROS zHigh accuracy zRobustness w.r.t. noise and outliers CONS zLong training time zNetwork topology to be chosen empirically zPoor interpretability of learned weights

45 Konstanz, EDBT2000 tutorial - Class 45 Prediction and (statistical) regression zRegression = construction of models of continuous attributes as functions of other attributes zThe constructed model can be used for prediction. zE.g., a model to predict the sales of a product given its price zMany problems solvable by linear regression, where attribute Y (response variable) is modeled as a linear function of other attribute(s) X (predictor variable(s)): Y = a + b·X zCoefficients a and b are computed from the samples using the least square method. f(x) x

46 Konstanz, EDBT2000 tutorial - Class 46 Other methods (not covered) zK-nearest neighbors algorithms zCase-based reasoning zGenetic algorithms zRough sets zFuzzy logic zAssociation-based classification (Liu et al 98)

47 Konstanz, EDBT2000 tutorial - Class 47 Module outline zThe classification task zMain classification techniques yDecision trees yBayesian classifiers yHints to other methods zApplication to a case-study in fraud detection: planning of fiscal audits

48 Konstanz, EDBT2000 tutorial - Class 48 Fraud detection and audit planning zA major task in fraud detection is constructing models of fraudulent behavior, for: ypreventing future frauds (on-line fraud detection) ydiscovering past frauds (a posteriori fraud detection) zFocus on a posteriori FD: analyze historical audit data to plan effective future audits zAudit planning is a key factor, e.g. in the fiscal and insurance domain: ytax evasion (from incorrect/fraudulent tax declarations) estimated in Italy between 3% and 10% of GNP

49 Konstanz, EDBT2000 tutorial - Class 49 Case study zConducted by our Pisa KDD Lab (Bonchi et al 99) zA data mining project at the Italian Ministry of Finance, with the aim of assessing: ythe potential of a KDD process oriented to planning audit strategies; ya methodology which supports this process; yan integrated logic-based environment which supports its development.

50 Konstanz, EDBT2000 tutorial - Class 50 Audit planning zNeed to face a trade-off between conflicting issues: ymaximize audit benefits: select subjects to be audited to maximize the recovery of evaded tax yminimize audit costs: select subjects to be audited to minimize the resources needed to carry out the audits. zIs there a KDD methodology which may be tuned according to these options? zHow extracted knowledge may be combined with domain knowledge to obtain useful audit models?

51 Konstanz, EDBT2000 tutorial - Class 51 Autofocus data mining policy options, business rules fine parameter tuning of mining tools

52 Konstanz, EDBT2000 tutorial - Class 52 Classification with decision trees zReference technique: yQuinlan’s C4.5, and its evolution C5.0 zAdvanced mechanisms used: ypruning factor ymisclassification weights yboosting factor

53 Konstanz, EDBT2000 tutorial - Class 53 Available data sources zDataset: tax declarations, concerning a targeted class of Italian companies, integrated with other sources: ysocial benefits to employees, official budget documents, electricity and telephone bills. zSize: 80 K tuples, 175 numeric attributes. zA subset of 4 K tuples corresponds to the audited companies: youtcome of audits recorded as the recovery attribute (= amount of evaded tax ascertained )

54 Konstanz, EDBT2000 tutorial - Class 54 Data preparation original dataset 81 K audit outcomes 4 K data consolidation data cleaning attribute selection

55 Konstanz, EDBT2000 tutorial - Class 55 Cost model zA derived attribute audit_cost is defined as a function of other attributes f audit_cost

56 Konstanz, EDBT2000 tutorial - Class 56 Cost model and the target variable zrecovery of an audit after the audit cost actual_recovery = recovery - audit_cost ztarget variable (class label) of our analysis is set as the Class of Actual Recovery (c.a.r.): negative if actual_recovery  0 zc.a.r. = positiveif actual_recovery > 0.

57 Konstanz, EDBT2000 tutorial - Class 57 Training set & test set zAim: build a binary classifier with c.a.r. as target variable, and evaluate it zDataset is partitioned into: ytraining set, to build the classifier ytest set, to evaluate it zRelative size: incremental samples approach zIn our case, the resulting classifiers improve with larger training sets. zAccuracy test with 10-fold cross-validation

58 Konstanz, EDBT2000 tutorial - Class 58 Quality assessment indicators zThe obtained classifiers are evaluated according to several indicators, or metrics zDomain-independent indicators yconfusion matrix ymisclassification rate zDomain-dependent indicators yaudit # yactual recovery yprofitability yrelevance

59 Konstanz, EDBT2000 tutorial - Class 59 Domain-independent quality indicators zconfusion matrix (of a given classifier) yTN (TP): true negative (positive) tuples yFN (FP): false negative (positive) tuples zmisclassification rate = # (FN  FP) / # test-set

60 Konstanz, EDBT2000 tutorial - Class 60 Domain-dependent quality indicators zaudit # (of a given classifier): number of tuples classified as positive = # (FP  TP) zactual recovery: total amount of actual recovery for all tuples classified as positive zprofitability: average actual recovery per audit zrelevance: ratio between profitability and misclassification rate

61 Konstanz, EDBT2000 tutorial - Class 61 The REAL case zClassifiers can be compared with the REAL case, consisting of the whole test-set: zaudit # (REAL) = 366  actual recovery(REAL) = M euro

62 Konstanz, EDBT2000 tutorial - Class 62 Controlling classifier construction zmaximize audit benefits: minimize FN zminimize audit costs: minimize FP zhard to get both! yunbalance tree construction towards eiher negatives or positives zwhich parameters may be tuned? ymisclassification weights, e.g., trade 1 FN for 10 FP yreplication of minority class yboosting and pruning level

63 Konstanz, EDBT2000 tutorial - Class 63 Model evaluation: classifier 1 (min FP) yno replication in training-set (unbalance towards negative) y10-trees adaptive boosting ymisc. rate = 22% yaudit # = 59 (11 FP) yactual rec.= Meuro yprofitability = 2.401

64 Konstanz, EDBT2000 tutorial - Class 64 Model evaluation: classifier 2 (min FN) yreplication in training-set (balanced neg/pos) ymisc. weights (trade 3 FP for 1 FN) y3-trees adaptive boosting ymisc. rate = 34% yaudit # = 188 (98 FP) yactual rec.= Meuro yprofitability = 0.878

65 Konstanz, EDBT2000 tutorial - Class 65 What have we achieved? zIdea of a KDD methodology tailored for a vertical application: audit planning ydefine an audit cost model ymonitor training- and test-set construction yassess the quality of a classifier ytune classifier construction to specific policies zIts formalization requires a flexible KDSE – knowledge discovery support environment, supporting: yintegration of deduction and induction yintegration of domain and induced knowledge yseparation of conceptual and implementation level

66 Konstanz, EDBT2000 tutorial - Class 66 References - classification zC. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, zF. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. DaWak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery, Sept zF. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, Aug zJ. Catlett. Megainduction: machine learning on very large databases. PhD Thesis, Univ. Sydney, zP. K. Chan and S. J. Stolfo. Metalearning for multistrategy and parallel learning. In Proc. 2nd Int. Conf. on Information and Knowledge Management, p , zJ. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, zJ. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, zL. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth International Group, zP. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proc. KDD'95, August zJ. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc Int. Conf. Very Large Data Bases, pages , New York, NY, August zB. Liu, W. Hsu and Y. Ma. Integrating classification and association rule mining. In Proc. KDD’98, New York, 1998.

67 Konstanz, EDBT2000 tutorial - Class 67 References - classification zJ. Magidson. The CHAID approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages Blackwell Business, Cambridge Massechusetts, zM. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. In Proc Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, March zS. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey. Data Mining and Knowledge Discovery 2(4): , 1998 zJ. R. Quinlan. Bagging, boosting, and C4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence (AAAI'96), , Portland, OR, Aug zR. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc Int. Conf. Very Large Data Bases, , New York, NY, August zJ. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc Int. Conf. Very Large Data Bases, , Bombay, India, Sept zS. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, zD. E. Rumelhart, G. E. Hinton and R. J. Williams. Learning internal representation by error propagation. In D. E. Rumelhart and J. L. McClelland (eds.) Parallel Distributed Processing. The MIT Press, 1986


Download ppt "Knowledge discovery & data mining Classification & fraud detection Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa"

Similar presentations


Ads by Google