Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.

Similar presentations


Presentation on theme: "Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012."— Presentation transcript:

1 Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

2 Information Theory 2

3 Entropy Information theoretic measure Measures information in model Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn 3

4 Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 4

5 Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 5

6 Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 6

7 Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 7

8 Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 8

9 Cross-Entropy Comparing models Actual distribution unknown p Use simplified model to estimate m Closer match will have lower cross-entropy 9

10 Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions 10

11 Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions 11

12 Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions Not a proper distance metric: 12

13 Relative Entropy Commonly known as Kullback-Liebler divergence Expresses difference between probability distributions Not a proper distance metric: asymmetric KL(p||q) != KL(q||p) 13

14 Joint & Conditional Entropy Joint entropy: 14

15 Joint & Conditional Entropy Joint entropy: Conditional entropy: 15

16 Joint & Conditional Entropy Joint entropy: Conditional entropy: 16

17 Joint & Conditional Entropy Joint entropy: Conditional entropy: 17

18 Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L 18

19 Mutual Information Measure of information in common between two distributions 19

20 Mutual Information Measure of information in common between two distributions 20

21 Mutual Information Measure of information in common between two distributions 21

22 Mutual Information Measure of information in common between two distributions 22

23 Mutual Information Measure of information in common between two distributions Symmetric: I(X;Y) = I(Y;X) 23

24 Mutual Information Measure of information in common between two distributions Symmetric: I(X;Y) = I(Y;X) I(X;Y) = KL(p(x,y)||p(x)p(y)) 24

25 Decision Trees 25

26 Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C 26

27 Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class 27

28 Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class Data: set of instances labeled data: y is known unlabeled data: y is unknown 28

29 Classification Task Task: C is a finite set of labels (aka categories, classes) Given x, determine its category y in C Instance: (x,y) x: thing to be labeled/classified y: label/class Data: set of instances labeled data: y is known unlabeled data: y is unknown Training data, test data 29

30 Two Stages Training: Learner: training data  classifier 30

31 Two Stages Training: Learner: training data  classifier Testing: Decoder: test data + classifier  classification output 31

32 Two Stages Training: Learner: training data  classifier Classifier: f(x) =y: x is input; y in C Testing: Decoder: test data + classifier  classification output Also Preprocessing Postprocessing Evaluation 32

33 33

34 Roadmap Decision Trees: Sunburn example Decision tree basics From trees to rules Key questions Training procedure? Decoding procedure? Overfitting? Different feature type? Analysis: Pros & Cons 34

35 Sunburn Example 35

36 Learning about Sunburn Goal: Train on labeled examples Predict Burn/None for new instances Solution?? Exact match: same features, same output Problem: 2*3^3 feature combinations Could be much worse Same label as ‘most similar’ Problem: What’s close? Which features matter? Many match on two features but differ on result 36

37 Learning about Sunburn Better Solution: Decision tree Training: Divide examples into subsets based on feature tests Sets of samples at leaves define classification Prediction: Route NEW instance through tree to leaf based on feature tests Assign same value as samples at leaf 37

38 Sunburn Decision Tree Hair Color Lotion Used Blonde Red Brown Alex: None John: None Pete: None Emily: Burn NoYes Sarah: Burn Annie: Burn Katie: None Dana: None 38

39 Decision Tree Structure Internal nodes: Each node is a test Generally tests a single feature E.g. Hair == ? Theoretically could test multiple features Branches: Each branch corresponds to outcome of test E.g Hair == Red; Hair != Blond Leaves: Each leaf corresponds to a decision Discrete class: Classification / Decision Tree Real value: Regression 39

40 From Trees to Rules Tree: Branches from root to leaves = Tests => classifications Tests = if antecedents; Leaf labels= consequent All Decision trees-> rules Not all rules as trees 40

41 From ID Trees to Rules Hair Color Lotion Used Blonde Red Brown Alex: None John: None Pete: None Emily: Burn NoYes Sarah: Burn Annie: Burn Katie: None Dana: None (if (equal haircolor blonde) (equal lotionused yes) (then None)) (if (equal haircolor blonde) (equal lotionused no) (then Burn)) (if (equal haircolor red) (then Burn)) (if (equal haircolor brown) (then None)) 41

42 Which Tree? Many possible decision trees for any problem How can we select among them? What would be the ‘best’ tree? Smallest? Shallowest? Most accurate on unseen data? 42

43 Simplicity Occam’s Razor: Simplest explanation that covers the data is best Occam’s Razor for decision trees: Smallest tree consistent with samples will be best predictor for new data Problem: Finding all trees & finding smallest: Expensive! Solution: Greedily build a small tree 43

44 Building Trees: Basic Algorithm Goal: Build a small tree such that all samples at leaves have same class Greedy solution: At each node, pick test using ‘best’ feature Split into subsets based on outcomes of feature test Repeat process until stopping criterion i.e. until leaves have same class 44

45 Key Questions Splitting: How do we select the ‘best’ feature? Stopping: When do we stop splitting to avoid overfitting? Features: How do we split different types of features? Binary? Discrete? Continuous? 45

46 Building Decision Trees: I Goal: Build a small tree such that all samples at leaves have same class Greedy solution: At each node, pick test such that branches are closest to having same class Split into subsets where most instances in uniform class 46

47 Picking a Test Hair Color Blonde Red Brown Alex: N Pete: N John: N Emily: B Sarah: B Dana: N Annie: B Katie: N HeightWeightLotion Short Average Tall Alex:N Annie:B Katie:N Sarah:B Emily:B John:N Dana:N Pete:N Sarah:B Katie:N Light Average Heavy Dana:N Alex:N Annie:B Emily:B Pete:N John:N No Yes Sarah:B Annie:B Emily:B Pete:N John:N Dana:N Alex:N Katie:N 47

48 Picking a Test HeightWeightLotion Short Average Tall Annie:B Katie:N Sarah:B Dana:N Sarah:B Katie:N Light Average Heavy Dana:N Annie:B No Yes Sarah:B Annie:B Dana:N Katie:N 48

49 Measuring Disorder Problem: In general, tests on large DB’s don’t yield homogeneous subsets Solution: General information theoretic measure of disorder Desired features: Homogeneous set: least disorder = 0 Even split: most disorder = 1 49

50 Measuring Entropy If split m objects into 2 bins size m1 & m2, what is the entropy? 50

51 Measuring Disorder: Entropy the probability of being in bin i Entropy (disorder) of a split Assume -½ log 2 ½ - ½ log 2 ½ = ½ +½ = 1 ½½ -¼ log 2 ¼ - ¾ log 2 ¾ = 0.5 + 0.311 = 0.811 ¾¼ -1log 2 1 - 0log 2 0 = 0 - 0 = 001 Entropyp2p2 p1p1 51

52 Information Gain InfoGain(Y|X) How many bits can we save if know X? InfoGain(Y|X) = H(Y) – H(Y|X) (equivalent to InfoGain(Y,X)) 52

53 Information Gain InfoGain(S,A): expected reduction in entropy due to A Select A with max InfoGain Resulting in lowest average entropy 53

54 Computing Average Entropy Disorder of class distribution on branch i Fraction of samples down branch i |S| instances Branch1 Branch 2 S a1 a S a1 b S a2 a S a2 b 54

55 Entropy in Sunburn Example Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454 Height = 0.954 - 0.69= 0.264 Weight = 0.954 - 0.94= 0.014 Lotion = 0.954 - 0.61= 0.344 S = [3B,5N] 55

56 Entropy in Sunburn Example Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5 Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1 S=[2B,2N] 56

57 Building Decision Trees with Information Gain Until there are no inhomogeneous leaves Select an inhomogeneous leaf node Replace that leaf node by a test node creating subsets that yield highest information gain Effectively creates set of rectangular regions Repeatedly draws lines in different axes 57

58 Alternate Measures Issue with Information Gain: Favors features with more values Option: Gain Ratio S a : elements of S with value A=a 58

59 Overfitting Overfitting: Model fits the training data TOO well Fits noise, irrelevant details Why is this bad? Harms generalization Fits training data too well, fits new data badly For model m, training_error(m), D_error(m) – D=all data If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’) 59

60 Avoiding Overfitting Strategies to avoid overfitting: Early stopping: Stop when InfoGain < threshold Stop when number of instances < threshold Stop when tree depth > threshold Post-pruning Grow full tree and remove branches Which is better? Unclear, both used. For some applications, post-pruning better 60

61 Post-Pruning Divide data into Training set: used to build the original tree Validation set: used to perform pruning Build decision tree based on training data Until pruning does not reduce validation set performance Compute perf. for pruning each nodes (& its children) Greedily remove nodes that do not reduce VS performance Yields smaller tree with best performance 61

62 Performance Measures Compute accuracy on: Validation set k-fold cross-validation Weighted classification error cost: Weight some types of errors more heavily Minimum description length: Favor good accuracy on compact models MDL = error(tree) + model_size(tree) 62

63 Rule Post-Pruning Convert tree to rules Prune rules independently Sort final rule set Probably most widely used method (toolkits) 63

64 Modeling Features Different types of features need different tests Binary: Test branches on true/false Discrete: Branches for each discrete value Continuous? Need to discretize Enumerate all values  not possible or desirable Pick value x Branches: value = x How can we pick split points? 64

65 Picking Splits Need split useful, sufficient split points What’s a good strategy? Approach: Sort all values for the feature in training data Identify adjacent instances of different classes Candidate split points between those instances Select candidate with highest information gain 65

66 Features in Decision Trees: Pros Feature selection: Tests features that yield low disorder E.g. selects features that are important! Ignores irrelevant features Feature type handling: Discrete type: 1 branch per value Continuous type: Branch on >= value Absent features: Distribute uniformly 66

67 Features in Decision Trees: Cons Features Assumed independent If want group effect, must model explicitly E.g. make new feature AorB Feature tests conjunctive 67

68 Decision Trees Train: Build tree by forming subsets of least disorder Predict: Traverse tree based on feature tests Assign leaf node sample label Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading Cons: Poor feature combination, dependency, optimal tree build intractable 68


Download ppt "Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012."

Similar presentations


Ads by Google