CIS 335 CIS 335 Data Mining Classification Part I.

CIS 335 CIS 335 Data Mining Classification Part I

CIS 335 what is a model? it can be a set of statistics or rules, tree, neural net, linear, etc how to build the model? assumptions?

CIS 335 what are applications of classification?

CIS 335 Labels the goal is to predict the class of an unlabeled instance what are examples of classes? how many labels can each have? is it feasible to get labeled instances? class label is discrete and unordered - why? numeric prediction is done by regression

CIS 335 sets training set validation set test set cross-validation (n-fold)

CIS 335 definitions instances, tuples, records, samples, rows,... attributes, features, variables,

CIS 335 two step process: learning (induction) predicting how does this relate to your own decision- making process?

CIS 335 data mining supervised - there is a class and labeled instances are available  classification  anomaly detection unsupervised - no class  clustering  association analysis

CIS 335 mapping function y = f(X) X is the instance f is the model, learned from the training data y is the class sometimes there are several "discriminators": f 1, f 2, f 3 - one for each class

CIS 335 overfitting model too accurately describes the training data doesn't do very well on new instances imagine a classifier that predicts student success based on g-number one that generalizes can be better sometimes post-processing can improve generalization how do you overfit?

CIS 335 accuracy number of correct predictions / total predictions for the confusion matrix: it is 98/115 =.85 what about the one below: 01 04610 1752 abc a2131 b5452 c7427

CIS 335 decision trees model is the tree itself each branch is a test and the leafs are labels to classify an instance, trace the path through the tree where have you seen decision trees? old male uncle aunt cousin y y n n

CIS 335 what are they good for? classifying, of course give a description of the data (exploratory) tree form is intuitive simple and fast

CIS 335 induction ID3 -> c4.5 and CART were early classifiers (J48 is c4.5) input is instances with attributes and labels output is tree

CIS 335 Goal: pure leaves use splits to isolate each split makes leaves more pure yellow small tang orange lemon n y y n colorsizefruit orangesmalltang yellowsmalllemon yellowsmalllemon orangesmalltang orangesmallorange largeorange largeorange

CIS 335 Measuring Purity +6 -14 attr x + n y - +8 -2 gini is a common metric for left leaf=1-(.8 2 +.2 2 )=.32 for right leaf=1-(.7 2 +.3 2 )=.42 for the entire split, use weighted sum gini(split)=.32*.33+.42*.67=.38 7

CIS 335 Expanding the tree nodes that are not very pure can be further split on another attribute process can continue until  all nodes are pure  a threshold is met +5 -2 attr x + n y +8 -2 attr y + 0 1 - +1 -12 +6 -14

CIS 335 Numeric attributes and other splits choose a good number – one that produces the lowest gini evaluate all possible splits multiway splits are also possible e.g. marital status: S, D, M attr z + <10 ≥10 -

CIS 335 Greedy Algorithms example TSP

CIS 335 greedy algorithm look through each attribute calc result of split using gini or other measure select attribute/split with best result split can be  discrete  continous value  binary with splitting sets (careful about ordinal)

CIS 335 selection measures based on purity information gain gain ratio gini

CIS 335 pruning postprocessing subtrees can be removed if the purity is "good enough" sometimes subtrees can be repeated or replicated

CIS 335 Bayes classifier based on Bayes theorem good accuracy and speed assumes  iid  independence of attributes

CIS 335 Probability teenm/fbuy yfy nmy nfy nfy ymy nfy nmn ymn ymn nfn yfn ymn yfn ymn counts  how many total records _______  how many teens _______  how many female _______  how many buy _______ what is probability  of teens → p(teen=y) ______  of males → p(gender=male) _____  of buying → p(buy = y) _______

CIS 335 Conditional Probability teenm/fbuy yfy nmy nfy nfy ymy nfy nmn ymn ymn nfn yfn ymn yfn ymn of those that bought  how many teens _____  how many male _____ p(teen=y | buy=y): probability of being a teen given that you bought what is the conditional probability  p(teen | buy) ______  p(female | not buy) ______

CIS 335 Conditional Probability, cont. teenm/fbuy yfy nmy nfy nfy ymy nfy nmn ymn ymn nfn yfn ymn yfn ymn formula: let x be the event that cust is teen and y be the event that they buy what is p(x,y)? _______ what is p(x)? _______ what is p(x|y)? ________

CIS 335 Bayes formula derivation p(x,y) is the same as p(y,x) according to definition of conditional prob: and so and thus rearranging, we have

CIS 335 Bayes theorem variables: X is an instance, C is the class want to know p(C 0 | X) (probability that class is 0 given the evidence X) p(C 0 | X) is the posterior probability p(C 0 ) is the prior p(X) is the evidence p(X | C 0 ) is the likelihood

CIS 335 Calculating posterior directly teenm/fbuy yfy nmy nfy nfy ymy nfy nmn ymn ymn nfn yfn ymn yfn ymn p(buy | teen) = 2/8 p(buy | male) = 2/7 this can be done easily for one attribute p(buy | not teen, male)  there are only two instances  think about it for 100 attributes – data is just not available

CIS 335 example want to predict whether or not you will have a good day based on if you have had breakfast and whether the sun is shining let X={x 1,x 2 } be an instance, x 1 is breakfast(Y/N), x 2 is sunshine(Y/N) C is the class0=bad day, 1=good day

CIS 335 Naive Bayes p(C 0 | x 1,x 2 ) = p(x 1,x 2 | C 0 ) p(C 0 ) / p(X) problem is p(x1,x2 | C 0 ) is complex simplify by assuming attribute values are independent of each other

CIS 335 collecting data for discrete attributes p(C 0 ) = number of bad days / number of days p(x 1 =1 | C 0 ) is the number of bad days you had breakfast p(x 2 =1 | C 0 ) is the number of bad days the sun was shining p(x 1 =0 | C 0 ) is the number of bad days you didn't have breakfast p(x 2 =0 | C 0 ) is the number of bad days the sun wasn't shining p(x 1 =0,x 2 =0) percentage of days you didn't have breakfast and the sun was shining

CIS 335 another simplification do not have to calculate p(X) since it is the same for all posteriors regardless of class if > then >

CIS 335 collecting data for continuous attributes same general idea for the discrete attributes separate all values for an attribute for a particular class calculate mean and s.d. use these to calculate the prob for a particular value

CIS 335 comparing results confusion matrix precision TP/(TP+FP) recallTP/(TP+FN) accuracy(TP+TN)/(TP+FN+FP+TN) f1 metric2*prec*recall / (prec+recall) predicted actual yesno yesTPFN noFPTN

CIS 335 example precision = 95 / 109 = 0.87 recall = 95 / 98 = 0.97 accuracy = 182 / 199 = 0.91 f1 = 0.92 predicted actual yesno yes953 no1487

CIS 335 CIS 335 Data Mining Classification Part I.

Similar presentations

Presentation on theme: "CIS 335 CIS 335 Data Mining Classification Part I."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 335 CIS 335 Data Mining Classification Part I.

Similar presentations

Presentation on theme: "CIS 335 CIS 335 Data Mining Classification Part I."— Presentation transcript:

Similar presentations

About project

Feedback