Presentation is loading. Please wait.

Presentation is loading. Please wait.

ML410C Projects in health informatics – Project and information management Data Mining.

Similar presentations


Presentation on theme: "ML410C Projects in health informatics – Project and information management Data Mining."— Presentation transcript:

1 ML410C Projects in health informatics – Project and information management Data Mining

2 Last time… Why do we need data analysis? What is data mining? Examples where data mining has been useful Data mining and other areas of computer science and mathematics Some (basic) data mining tasks

3 The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): (1996)

4 CRISP-DM: CRoss Industry Standard Process for Data Mining Shearer C., “The CRISP-DM model: the new blueprint for data mining”, Journal of Data Warehousing 5 (2000) (see also

5 Today DATETIMEROOMTOPIC MONDAY :00-11:45502Introduction to data mining WEDNESDAY :00-10:45501Decision trees, rules and forests FRIDAY :00-11:45Sal CEvaluating predictive models and tools for data mining

6 What is classification Overview of classification methods Decision trees Forests Today

7 Predictive data mining Our task Input: data representing objects that have been assigned labels Goal: accurately predict labels for new (previously unseen) objects

8 An example: classification Features (attributes) Examples (observations) Ex. All caps No. excl. marks Missing date No. digits in From: Image fraction Spam e1yes0no3 0yes e2yes3no0 0.2yes e3no0 0 1 e4no4yes4 0.5yes e5yes0 2 0no e6no0 0 0

9 Decision tree Spam = yes Spam = no Spam = yes

10 Rules Spam = no Spam = yes Spam = no

11 Forests

12 Classification What is the class of the following ? – No Caps: Yes – No. excl. marks: 0 – Missing date: Yes – No. digits in From: 4 – Image fraction: 0.3

13 Classification What is classification? Issues regarding classification and prediction Classification by decision tree induction Classification by Naïve Bayes classifier Classification by Nearest Neighbor Classification by Bayesian Belief networks

14 Classification: – predicts categorical class labels – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute – uses the model for classifying new data Typical Applications – credit approval – target marketing – medical diagnosis – treatment effectiveness analysis Classification

15 Credit approval – A bank wants to classify its customers based on whether they are expected to pay back their approved loans – The history of past customers is used to train the classifier – The classifier provides rules, which identify potentially reliable future customers Why Classification? A motivating application

16 Credit approval – Classification rule: If age = “ ” and income = high then credit_rating = excellent – Future customers Paul: age = 35, income = high  excellent credit rating John: age = 20, income = medium  fair credit rating Why Classification? A motivating application

17 Classification—A Two-Step Process Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction: training set – The model is represented as classification rules, decision trees, or mathematical formulas

18 Classification—A Two-Step Process Model usage: for classifying future or unknown objects – Estimate accuracy of the model The known label of test samples is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over- fitting will occur

19 Classification Process (1): Model Construction Training Data Classification Algorithms IF LDL = ‘high’ OR Gluc > 6 mmol/lit THEN Heart attack = ‘yes’ Classifier (Model)

20 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, high, 7.5) Heart attack? Accuracy=?

21 Supervised vs. Unsupervised Learning Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

22 Issues regarding classification and prediction: Evaluating Classification Methods Predictive accuracy Speed – time to construct the model – time to use the model Robustness – handling noise and missing values Scalability – efficiency in disk-resident databases Interpretability: – understanding and insight provided by the model Goodness of rules (quality) – decision tree size – compactness of classification rules

23 Classification by Decision Tree Induction Decision tree – A flow-chart-like tree structure – Internal node denotes a test on an attribute – Branch represents an outcome of the test – Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases – Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes – Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample – Test the attribute values of the sample against the decision tree

24 Training Dataset Example

25 Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes

26 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Samples are partitioned recursively based on selected attributes – Test (split) attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left

27 Algorithm for Decision Tree Induction (pseudocode) Algorithm GenDecTree(Sample S, Attlist A) 1.create a node N 2.If all samples are of the same class C then label N with C; terminate; 3.If A is empty then label N with the most common class C in S (majority voting); terminate; 4.Select a  A, with the highest information gain; Label N with a; 5.For each value v of a: a.Grow a branch from N with condition a=v; b.Let S v be the subset of samples in S with a=v; c.If S v is empty then attach a leaf labeled with the most common class in S; d.Else attach the node generated by GenDecTree(S v, A-a)

28 Attribute Selection Measure: Information Gain Let p i be the probability that an arbitrary tuple in D belongs to class C i, estimated by |C i, D |/|D| - where C i, D denotes the set of tuples that belong to class C i Expected information (entropy) needed to classify a tuple in D: - where m is the number of classes

29 Attribute Selection Measure: Information Gain Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A

30 Attribute Selection: Information Gain  Class P: buys_computer = “yes”  Class N: buys_computer = “no”

31 Splitting the samples using age age? <= >40 labeled yes

32 Gini index If a data set D contains examples from n classes, gini index, gini(D) is defined as - where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as

33 Gini index Reduction in Impurity: The attribute that provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node

34 Gini index (CART, IBM IntelligentMiner) Example: D has 9 tuples in buys_computer = “yes” and 5 in “no” Suppose that attribute “income” partitions D into 10 records (D 1 : {low, medium}) and 4 records (D 2 : {high}).

35 Gini index Then: = 0.45 and gini {medium,high} = 0.30 All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values

36 Comparing Attribute Selection Measures The two measures, in general, return good results but – Information gain: biased towards multivalued attributes – Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor test sets that result in equal-sized partitions and purity in both partitions

37 Overfitting due to noise Decision boundary is distorted by noise point

38 Overfitting due to insufficient samples Why?

39 Overfitting due to insufficient samples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

40 Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Poor accuracy for unseen samples Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

41 Occam’s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model Therefore, one should include model complexity when evaluating a model “entia non sunt multiplicanda praeter ecessitatem”, which translates to: “entities should not be multiplied beyond necessity”.

42 Pros and Cons of decision trees Cons – Cannot handle complicated relationship between features – simple decision boundaries – problems with lots of missing data Pros + Reasonable training time + Fast application + Easy to interpret + Easy to implement + Can handle large number of features

43 Some well-known decision tree learning implementations CART Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth ID3 Quinlan JR (1986) Induction of decision trees. Machine Learning 1:81–106 C4.5 Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann J48Implementation of C4.5 in WEKA

44 Handling missing values Remove attributes with missing values Remove examples with missing values Assume most frequent value Assume most frequent value given a class Learn the distribution of a given attribute Find correlation between attributes

45 Handling missing values ExampleA 1…Class e1yes+ e2no+ e3yes- e4?- A1 yes no e1 (w=1) e3 (w=1) e4 (w=2/3) e2 (w=1) e4 (w=1/3)

46 k-nearest neighbor classifiers k-nearest neighbors of a record x are data points that have the k smallest distance to x

47 k-nearest neighbor classification Given a data record x find its k closest points – Closeness: ? Determine the class of x based on the classes in the neighbor list – Majority vote – Weigh the vote according to distance e.g., weight factor, w = 1/d 2

48 Characteristics of nearest-neighbor classifiers No model building (lazy learners) – Lazy learners: computational time in classification – Eager learners: computational time in model building Decision trees try to find global models, k-NN take into account local information K-NN classifiers depend a lot on the choice of proximity measure

49 Condorcet’s jury theorem If each member of a jury is more likely to be right than wrong, then the majority of the jury, too, is more likely to be right than wrong and the probability that the right outcome is supported by a majority of the jury is a (swiftly) increasing function of the size of the jury, converging to 1 as the size of the jury tends to infinity Condorcet, 1785

50 Condorcet’s jury theorem

51 Random forests Random forests (Breiman 2001) are generated by combining two techniques: bagging (Breiman 1996) the random subspace method (Ho 1998) L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001 L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996 T. K. Ho. The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8): , 1998

52 Bagging A bootstrap replicate E’ of a set of examples E is created by randomly selecting n = |E| examples from E with replacement. Ex.OtherBarFri/SatHungryGuests Wait e1yesno yessomeyes e2yesno yesfullno e3noyesno someyes e4yesnoyes fullyes e5yesnoyes nononeno e6noyesno yessomeyes Ex.OtherBarFri/SatHungryGuests Wait e2yesno yesfullno e2yesno yesfullno e3noyesno someyes e4yesnoyes fullyes e4yesnoyes fullyes e6noyesno yessomeyes

53 Forests

54 What is classification Overview of classification methods Decision trees Forests Today

55 Next time DATETIMEROOMTOPIC MONDAY :00-11:45502Introduction to data mining WEDNESDAY :00-10:45501Decision trees, rules and forests FRIDAY :00-11:45Sal CEvaluating predictive models and tools for data mining


Download ppt "ML410C Projects in health informatics – Project and information management Data Mining."

Similar presentations


Ads by Google