Decision Trees (suggested time: 30 min) Definition Mechanism Splitting Functions Issues in Decision-Tree Learning (if time permits) Avoiding overfitting through pruning Numeric and Missing attributes Applications to Security What is machine learning?
Example: Learning to identify Spam Illustration Example: Learning to identify Spam Spam Not Spam Is the user unknown? No Yes Number of Recipients < N ≥ N What is machine learning?
There are two types of nodes: Definition A decision-tree learning algorithm approximates a target concept using a tree representation, where each internal node corresponds to an attribute, and every terminal node corresponds to a class. There are two types of nodes: Internal node.- Splits into different branches according to the different values the corresponding attribute can take. Example: Number of recipients <= N or Number of recipients > N. Terminal Node.- Decides the class assigned to the example. What is machine learning?
X = (Unknown Sender, Number of recipients > N) Classifying Examples X = (Unknown Sender, Number of recipients > N) Spam Not Spam Is the sender unknown? No Yes Number of Recipients < N ≥ N What is machine learning? Example 1 - Basic Spam filter: We decide whether a message is spam based on two tests. Is the sender unknown? This is a categorical variable with “yes/no” values Number of recipients. This is a numerical variable over N. Of course this doesn’t give the complete picture, but it is enough to filter some messages (and probably give some false positives) Assigned Class
Appropriate Problems for Decision Trees Attributes are both numeric and nominal. Target function takes on a discrete number of values. Data may have errors. Some examples may have missing attribute values. What is machine learning?
Issues in Decision-Tree Learning Avoiding overfitting through pruning Decision Trees Definition Mechanism Splitting Functions Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes What is machine learning?
Historical Information Ross Quinlan – Induction of Decision Trees. Machine Learning Journal 1: 81-106, 1986 (over 8 thousand citations) What is machine learning?
Historical Information Leo Breiman – CART (Classification and Regression Trees), 1984. What is machine learning?
There are different ways to construct trees from data. Mechanism There are different ways to construct trees from data. We will concentrate on the top-down, greedy search approach: Basic idea: 1. Choose the best attribute a* to place at the root of the tree. 2. Separate training set D into subsets {D1, D2, .., Dk} where each subset Di contains examples having the same value for a* 3. Recursively apply the algorithm on each new subset until examples have the same class or there are few of them. What is machine learning?
Attributes: Destination Port and Duration Illustration P1 D2 Class A: Attack Class B: Benign Duration D3 Destination Port What is machine learning? Example 2 – Intrusion detection in Networks: The variables chosen are Destination Port of the message and The duration of the message. Other features of the attack can be found on: Brugger, S. Terry. "Data mining methods for network intrusion detection."University of California at Davis (2004). Attributes: Destination Port and Duration Destination Port has two values: > P1 or <= P1 Duration has three values: > D2, <=D2 and > D3, <= D3
Suppose we choose Destination Port Illustration Suppose we choose Destination Port as the best attribute: D2 Destination Port > P1 <= P1 ? Duration D3 What is machine learning? P1 A Class A: Attack Class B: Benign
Suppose we choose Duration as the next best attribute: Illustration Suppose we choose Duration as the next best attribute: Destination Port D2 <= P1 > P1 Duration D3 A > D2 What is machine learning? ≤ D3 P1 B A B Class A: Attack Class B: Benign > D3 and <= D2
Create a root for the tree Formal Mechanism Create a root for the tree If all examples are of the same class or the number of examples is below a threshold return that class If no attributes available return majority class Let a* be the best attribute For each possible value v of a* Add a branch below a* labeled “a = v” Let Sv be the subsets of example where attribute a*=v Recursively apply the algorithm to Sv What is machine learning?
What attribute is the best to split the data? Let us remember some definitions from information theory. A measure of uncertainty or entropy that is associated to a random variable X is defined as H(X) = - Σ pi log pi where the logarithm is in base 2. This is the “average amount of information or entropy of a finite complete probability scheme” (Introduction to I. Theory by Reza F.). What is machine learning?
There are two possible complete events A and B (Example: flipping a biased coin). P(A) = 1/256, P(B) = 255/256 H(X) = 0.0369 bit P(A) = 1/2, P(B) = 1/2 H(X) = 1 bit P(A) = 7/16, P(B) = 9/16 H(X) = 0.989 bit What is machine learning?
Entropy is a function concave downward. 1 bit What is machine learning? 0.5 1
Attributes: Destination Port and Duration Illustration D2 Class A: Attack Class B: Benign Duration D3 Destination Port P1 What is machine learning? Attributes: Destination Port and Duration Destination Port has two values: > P1 or <= P1 Duration has three values: > D2, <=D2 and > D3, <= D3
Splitting based on Entropy Destination Port divides the sample in two: S1 = { 6A, 0B} S2 = { 3A, 5B} D2 Duration D3 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) What is machine learning? P1 Destination Port S1 S2
Splitting based on Entropy Duration divides the sample in three: S1 = { 2A, 2B} S2 = { 5A, 0B} S3 = { 2A, 3B} D2 S2 Duration D3 S3 H(S1) = 1 H(S2) = 0 H(S3) = -(2/5)log2(2/5) -(3/5)log2(3/5) P1 Destination Port What is machine learning?
Information Gain IG(A) = H(S) - Σv (Sv/S) H (Sv) H(S) is the entropy of all examples H(Sv) is the entropy of one subsample after partitioning S based on all possible values of attribute A. What is machine learning?
Components of IG(A) H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) Destination Port P1 D2 D3 Duration S1 S2 H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) H(S) = -(9/14)log2(9/14) -(5/14)log2(5/14) |S1|/|S| = 6/14 |S2|/|S| = 8/14 What is machine learning?
Components of IG(A) H(S1) = 0 H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) |S1|/|S| = 6/14 |S2|/|S| = 8/14 Destination Port P1 D2 D3 Duration S1 S2 What is machine learning?
Gain Ratio Let’s define the entropy of the attribute: H(A) = - Σ pj log pj Where pj is the probability that attribute A takes value Vj. Then GainRatio(A) = IG(A) / H(A) What is machine learning?
Gain Ratio S2 H(size) = -(6/14)log2(6/14) - (8/14)log2(8/14) Destination Port P1 D2 D3 Duration S1 S2 S2 What is machine learning? H(size) = -(6/14)log2(6/14) - (8/14)log2(8/14) where |S1|/|S| = 6/14 |S2|/|S| = 8/14
Security Applications Decision trees have been used in: Intrusion detection [> 11 papers] Online dynamic security assessment [He et al. ISGT 12] Password checking [Bergadano et al. CCS 97] Database inference [Chang, Moskowitz NSPW 98] Analyzing malware [Ravula et al. KDIR 11]