Advanced Artificial Intelligence

Advanced Artificial Intelligence
Lecture 3: Decision Tree

Outline Introduction Constructing a Decision Tree ID3 C4.5
Regression Trees CART Gradient Boosting

Decision Tree Introduction
The Decision Tree is one of the most powerful and popular classification and prediction algorithms in current use in data mining and machine learning. The attractiveness of decision trees is due to the fact that, in contrast to neural networks, decision trees represent rules. Rules can readily be expressed so that humans can understand them or even directly used in a database access language like SQL so that records falling into a particular category may be retrieved.

Decision Tree A decision tree consists of Edges:
Nodes: test for the value of a certain attribute Edges: correspond to the outcome of a test connect to the next node or leaf Leaves: terminal nodes that predict the outcome

Decision Tree Example Start at the root Perform the test
Follow the edge corresponding to outcome Go to 2. unless leaf Predict that outcome associated with the leaf

Decision Tree Learning

Why Decision Tree ?

A Sample Task

Divide-And-Conquer Algorithms
Family of decision tree learning algorithms TDIDT: Top-Down Induction of Decision Trees Learn trees in a Top-Down fashion divide the problem in subproblems solve each problem

Constructing a Decision Tree
Two Aspects Which attribute to choose? Information Gain ENTROPY Where to stop? Termination criteria

Calculation of Entropy
Entropy is a measure of uncertainty in the data S = set of examples Si = subset of S with value vi under the target attribute l = size of the range of the target attribute

Entropy Let us say, I am considering an action like a coin toss. Say, I have five coins with probabilities for heads 0, 0.25, 0.5, 0.75 and 1. When I toss them which one has highest uncertainty and which one has the least? Information gain= Entropy of the system before split – Entropy of the system after split

Entropy： Measure of Randomness

Information Gain When an attribute A splits the set S into subsets Si
The attribute that maximizes the information gain is selected

Information Gain All the records at the node belong to one class
A significant majority fraction of records belong to a single class The segment contains only one or very small number of records The improvement is not substantial enough to warrant making the split

Iterative Dichotomiser 3（ID3）
Quinlan (1986) Each node corresponds to a splitting attribute Each arc is a possible value of that attribute. At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root. The algorithm uses the criterion of information gain to determine the goodness of a split. The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute.

Training Dataset In this dataset, there are five categorical attributes outlook , temperature , humidity , windy , and play. We are interested in building a system which will enable us to decide whether or not to play the game on the basis of the weather conditions, i.e. we wish to predict the value of play using outlook, temperature, humidity and windy. We can think of the attribute we wish to predict, i.e. play, as the output attribute, and the other attributes as input attributes. In this problem we have 14 examples in which:9 examples with play= yes and 5 examples with play = no So

Training Dataset

Consider splitting on Outlook attribute
Outlook = Sunny info([2; 3]) = entropy(2/5 ; 3/5 ) = -2/5 log2 2/5- 3/5 log2 3/5 = bits Outlook = Overcast info([4; 0]) = entropy(4/4,0/4) = -1 log log2 0 = 0 bits Outlook = Rainy info([3; 2]) = entropy(3/5,2/5)= - 3/5 log2 3/5 – 2/5 log2 2/5 =0.971 bits

The expected information needed to classify objects in all sub trees of the outlook attribute is: info outlook (S) = info([2; 3]; [4; 0]; [3; 2]) = 5/14 * /14 * 0 + 5/14 * = bits information gain = info before split - info after split gain(Outlook) = info([9; 5]) - info([2; 3]; [4; 0]; [3; 2]) = = bits

temperature = Hot info([2; 2]) = entropy(2/4 ; 2/4 ) = -2/4 log2 2/4 - 2/4 log2 2/4 = 1 bits temperature = mild info([4; 2]) = entropy(4/6,2/6) = -4/6 log2 4/6 - 2/6 log2 2/6 = bits temperature = cool info([3; 1]) = entropy(3/4,1/4)= - 3/4 log2 3/4 – 1/4 log2 1/4 =0.881 bits

The expected information needed to classify objects in all sub trees of the temperature attribute is: info([2; 2]; [4; 2]; [3; 1]) = 4/14 * 1 + 6/14 * /14 * 0.881= bits information gain = info before split - info after split gain(temperature) = = bits

Other attributes

ID3 Process

The output decision tree

Intrinsic Information of an Attribute
There is only one sample for each category in each attribute, so the attribute information entropy is equal to zero. According to the information gain, the valid classification features can not be selected, So C4.5 chooses to use the information gain rate to improve ID3 Intrinsic information of a split entropy of distribution of instances into branches i.e. how much information do we need to tell which branch an instance belongs to Observation: Attributes with higher intrinsic information are less useful

Gain Ratio modification of the information gain that reduces its bias towards multi-valued attributes takes number and size of branches into account when choosing an attribute corrects the information gain by taking the intrinsic information of a split into account Definition of Gain Ratio:

Gain ratios for weather data
Outlook attribute would still win... one has to be careful which attributes to add... Nevertheless: Gain ratio is more reliable than Information Gain

Regression Trees Differences to Decision Trees (Classification Trees)
Leaf Nodes: Predict the average value of all instances in this leaf Splitting criterion: Minimize the variance of the values in each subset Si Standard deviation reduction Termination criteria: Very important ! (otherwise only single points in each leaf) lower bound on standard deviation in a node lower bound on number of examples in a node

Regression Trees The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.

Standard Deviation A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). We use standard deviation to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous its standard deviation is zero.

Standard Deviation a) Standard deviation for one attribute:

Standard Deviation b) Standard deviation for two attributes:

Standard Deviation Reduction
The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches). a) The standard deviation of the target is calculated. Standard deviation (Hours Played) = 9.32

b) The dataset is then split on the different attributes. The standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

c) The attribute with the largest standard deviation reduction is chosen for the decision node. d) Dataset is divided based on the values of the selected attribute.

e) A branch set with standard deviation more than 0 needs further splitting. In practice, we need some termination criteria. For example, when standard deviation for the branch becomes smaller than a certain fraction (e.g., 5%) of standard deviation for the full dataset OR when too few instances remain in the branch

e) The process is run recursively on the non-leaf branches, until all data is processed. When the number of instances is more than one at a leaf node we calculate the average as the final value for the target.

CART Classification and regression tree
A CART tree is a binary decision tree that is constructed by splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning sample. The main elements of CART (and any decision tree algorithm) are: Rules for splitting data at a node based on the value of one variable; Stopping rules for deciding when a branch is terminal and can be split no more; Finally, a prediction for the target variable in each terminal node.

CART Since the classification tree and the regression tree recursively construct a binary decision tree, the criteria for selecting feature partitioning are different. Binary index (Gini Index) is used as a feature selection criterion in the construction of binary classification tree; Minimized squared error is used as a feature selection criterion in the construction of binary regression tree;

CART Binary Classification Tree
The binary classification tree uses the Gini index as the metric for optimal feature selection. The Gini index is defined as follows: the possible values of the category C in the data set D are c1, c2, ⋯, ck (k is the number of categories), and the probability that one sample belongs to the category ci is p(i). Then the Gini index formula for the probability distribution is expressed as: If all sample types are the same, then p1=1, p2=p3=⋯=pk=0, then Gini(C)=0, and the data has the lowest purity at this time. The physical meaning of Gini(D) is to represent the uncertainty of data set D. The greater the value, the greater the uncertainty.

CART Binary Classification Tree
If k=2 (two-class problem, the class names are positive and negative), if the probability that the sample is positive is p, then the corresponding Gini index is: If the data set D is based on whether the feature f takes a certain possible value f∗, D is divided into two parts D1 and D2, that is, D1={(x,y)∈D|f(x)=f∗}, D2=D −D1. Then feature f in the dataset D Gini index is defined as:

CART Example 1 Sunny Hot High FALSE No 2 TRUE 3 Overcast Yes 4 Rainy
ID Outlook Temperature Humidity Windy Play 1 Sunny Hot High FALSE No 2 TRUE 3 Overcast Yes 4 Rainy Mild 5 Cool Normal 6 7 8 9 10 11 12 13 14

CART Example ID Outlook Temperature Humidity Windy Play 1 Sunny Hot High FALSE No 2 TRUE 3 Overcast Yes 4 Rainy Mild 5 Cool Normal 6 7 8 9 10 11 12 13 14 When outlook =”sunny”， D1, D2 data own corresponding number of categories are(+2,−3)、(+7,−2). D1={1,2,8,9,11}，|D1|=5 D2={3,4,5,6,7,10,12,13,14}，|D2|=9 Gini(D1)=2*3/5*2/5=12/255; Gini(D2)=2*7/9*2/9=28/81; 𝐺𝑖𝑛𝑖(𝐶,"𝑜𝑢𝑡𝑙𝑜𝑜𝑘"="𝑠𝑢𝑛𝑛𝑦")= 5 14 𝐺𝑖𝑛𝑖( 𝐷 1 ) 𝐺𝑖𝑛𝑖( 𝐷 2 )=0.394

CART Example ID Outlook Temperature Humidity Windy Play 1 Sunny Hot High FALSE No 2 TRUE 3 Overcast Yes 4 Rainy Mild 5 Cool Normal 6 7 8 9 10 11 12 13 14 When outlook =”overcast”， D1, D2 data own corresponding number of categories(+4,0)、(+5,−5)。 D1={2,7,12,13}，|D1|=4 D2={1,3,4,5,6,8,9,10,11,14}，|D2|=10 Gini(D1)=2*1*0=0; Gini(D2)=2*5/10*5/10=1/2; 𝐺𝑖𝑛𝑖(𝐶,"𝑜𝑢𝑡𝑙𝑜𝑜𝑘"="𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡")= 4 14 𝐺𝑖𝑛𝑖( 𝐷 1 ) 𝐺𝑖𝑛𝑖( 𝐷 2 )=0.357

CART Example ID Outlook Temperature Humidity Windy Play 1 Sunny Hot High FALSE No 2 TRUE 3 Overcast Yes 4 Rainy Mild 5 Cool Normal 6 7 8 9 10 11 12 13 14 When outlook =“rainy”， D1, D2 data own corresponding number of categories(+3,-2)、(+6,−3)。 D1={4,5,6,10,14}，|D1|=5 D2={1,2,3,7,8,9,11,12,13}，|D2|=9 Gini(D1)=2*3/5*2/5=12/25; Gini(D2)=2*6/9*3/9=4/9; 𝐺𝑖𝑛𝑖(𝐶,"𝑜𝑢𝑡𝑙𝑜𝑜𝑘"="𝑟𝑎𝑖𝑛𝑦")= 5 14 𝐺𝑖𝑛𝑖( 𝐷 1 ) 𝐺𝑖𝑛𝑖( 𝐷 2 )=0.457 If the feature “outlook" is the optimal feature, then the feature value "overcast" should be used as the partitioning node.

CART Binary Regression Tree
The CART regression tree predicts regression continuous data. Assuming X and Y are input and output variables, respectively, and Y is a continuous variable. In the input space where the training data set is located, recursively divide each area into two sub-areas and determine the output value on each subarea to construct a binary decision tree. CART回归树预测回归连续型数据，假设X与Y分别是输入和输出变量，并且Y是连续变量。在训练数据集所在的输入空间中，递归的将每个区域划分为两个子区域并决定每个子区域上的输出值，构建二叉决策树。

Select the optimal segmentation variable j and the segmentation point s : Divide the two subareas according to the segmentation point s, and select the (j, s) pair when the minimum value is obtained. Where R is the divided input space and c is the fixed output value corresponding to the space R(Mean of Y in subarea). Use the selected (j, s) pair to divide the area and determine the corresponding output value CART回归树的度量目标是，对于任意划分特征A，对应的任意划分点s两边划分成的数据集D1和D2，求出使D1和D2各自集合的均方差最小，同时D1和D2的均方差之和最小所对应的特征和特征值划分点。其中，c1c1为D1数据集的样本输出均值，c2c2为D2数据集的样本输出均值。

Continue to call the above steps for the two sub-areas, divide the input space into M areas R1, R2, ..., Rm, and generate a decision tree. The binary regression tree uses the MSE as the metric for optimal feature selection. Solve the optimal output value on each unit with the criterion of the MSE. 对于决策树建立后做预测的方式，上面讲到了CART分类树采用叶子节点里概率最大的类别作为当前节点的预测类别。而回归树输出不是类别，它采用的是用最终叶子的均值或者中位数来预测输出结果。

CART Example Consider 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5 cut points based on the given data points. R1, R2, c1, c2, and m(s) are sequentially obtained for each segmentation point. For example, when the segmentation point s=1.5, R1={1} is obtained, and R2={2,3,4,5,6 , 7, 8, 9, 10}, where c1, c2, m(s) are as follows.

CART Example By changing the (j, s) pair in turn, you can get the calculation results of s and m(s). When x = 6.5, R1 = {1, 2, 3, 4, 5, 6}, R2 = {7, 8, 9, 10}, c1 = 6.24, and c2 = 8.9. The regression tree T1(x) is

Gradient Boosting Fit an additive model (ensemble) in a forward stage-wise manner. In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners. In Gradient Boosting, "shortcomings" are identified by gradients. Recall that, in Adaboost, "shortcomings" are identified by high-weight data points. Both high-weight data points and gradients tell us how to improve our model.

what is gradient boosting?
To begin with, gradient boosting is an ensemble technique, which means that prediction is done by an ensemble of simpler estimators. While this theoretical framework makes it possible to create an ensemble of various estimators, in practice we almost always use GBDT — gradient boosting over decision trees. The aim of gradient boosting is to create (or "train") an ensemble of trees, given that we know how to train a single decision tree. This technique is called boosting because we expect an ensemble to work much better than a single estimator.

How an Ensemble Is Built?
Gradient boosting builds an ensemble of trees one-by-one, then the predictions of the individual trees are summed: The next decision tree tries to cover the discrepancy between the target function f(x) and the current ensemble prediction by reconstructing the residual. For example, if an ensemble has 3 trees the prediction of that ensemble is: The next tree (tree 4) in the ensemble should complement well the existing trees and minimize the training error of the ensemble. In the ideal case we'd be happy to have:

How an Ensemble Is Built?
To get a bit closer to the destination, we train a tree to reconstruct the difference between the target function and the current predictions of an ensemble, which is called the residual: Did you notice? If decision tree completely reconstructs , the whole ensemble gives predictions without errors (after adding the newly-trained tree to the ensemble)! That said, in practice this never happens, so we instead continue the iterative process of ensemble building.

GBDT DT-Decision Tree GB-Gradient Boosting
GBDT-Gradient Boosting Decision Tree The meaning of GBDT is to use the DT model trained by Gradient Boosting. The result of the model is a set of CART Tree Ensemble: T_1...T_K. Where T_j learns the residuals of the previous j-1 tree predictions.

GBDT Example1 A,B,C,D's age was 15,19,23, 27, (for high school students, college students, athletes and coders). Decision tree learning process:

GBDT Example1 A,B,C,D's age was 15,19,23, 27, (for high school students, college students, athletes and coders). GBDT learning process: A: 15 years old ,high school students ，low income， no time to play computer everyday ； Predicted age A = 17– 2 = 15 B: 19-year-old college student ；low income ， play computer everyday ； Predicted age B = = 19 C: 23-year-old athlete ；high income ， no time to play computer everyday ； Predicted age C = 25 – 2 = 23 D: 27 years old coders；high income ， play computer everyday ； Predicted age D = = 27

GBDT Example2 As shown in the following table: a set of data, characterized by age, weight and height(label) . There are 5 data in total, the first four are training samples, and the last one is the sample to be predicted. ID AGE WEIGHT HEIGHT(label) 1 5 20 1.1 2 7 30 1.3 3 21 70 1.7 4 60 1.8 5(need predict) 25 65 ?

GBDT Example2 1. Initialize the weak learner. Then 𝛾−y=0
由于此时只有根节点，样本１,２，３,４都在根节点，此时要找到使得平方损失函数最小的参数r，怎么求呢？平方损失显然是一个凸函数，直接求导，倒数等于零，得到r。所以初始化时，r取值为所有训练样本标签值的均值。r=( )/，此时得到初始学习器f0(x)

GBDT Example2 2. Iterations m=1: Calculate the negative gradient ---- residual ID True Label f0(x) Residual 1 1.1 1.475 -0.375 2 1.3 -0.175 3 1.7 0.225 4 1.8 0.325 对迭代轮数m=1: 　　计算负梯度——残差在此例中，残差在下表列出：

GBDT Example2 3. Training f1(x) with residual as the true value of the sample ID AGE WEIGHT HEIGHT(label) 1 5 20 -0.375 2 7 30 -0.175 3 21 70 0.225 4 60 0.325 Then search for the optimal partition node of the regression tree and traverse each possible value of each feature. From the beginning of the 5 age features to the end of the 70 weight features, the variances were calculated respectively, and the partition node that minimized the variance was found to be the optimal partition node. 此时将残差作为样本的真实值训练f1(x)f1(x)，即下表数据接着，寻找回归树的最佳划分节点，遍历每个特征的每个可能取值。从年龄特征的5开始，到体重特征的70结束，分别计算方差，找到使方差最小的那个划分节点即为最佳划分节点。

GBDT Example2 For example, in the case of dividing nodes by age 7, samples less than 7 are divided into one category, and those greater than or equal to 7 are divided into another category . Sample 1 is a group, sample 2,3,4 is a group, variance of the two groups is 0,0.047 respectively, sum of square differences of the two groups is All possible scenarios are shown in the table below: ID DividingPoint < DividingPoint > DividingPoint Square Deviation 1 5 (age) / 1、2、3、4 0.082 2 7 (age) 2、3、4 0.047 3 21 (age) 1、2 3、4 0.0125 4 30 (age) 1、2、3 0.062 5 20 (weight) 6 30 (weight) 7 60 (weight) 8 70 (weight) 1、2、4 0.0867 以年龄7为划分节点，将小于7的样本划分为一类，大于等于7的样本划分为另一类。样本1为一组，样本2，3，4为一组，两组的方差分别为0，0.047，两组方差之和为0.047。所有可能划分情况如下表所示：

GBDT Example2 4. The minimum variance of the above classification point is There are two classification points: age 21 and weight 60, so choose one randomly as the classification point. Select age 21 Nodes are divided according to the above: Sample 1,2 is left leaf node ,3,4 is right leaf node. 以上划分点是的总方差最小为0.0125有两个划分点：年龄21和体重60，所以随机选一个作为划分点，这里我们选年龄21。这里其实和上面初始化学习器是一个道理，平方损失，求导，令导数等于零，化简之后得到每个叶子节点的参数ΥΥ，其实就是标签值的均值。根据上述划分节点：此时可更新强学习器

GBDT Example2 Just iterate once, just one tree GBDT
5.预测样本5：　　样本5在根节点中（即初始学习器）被预测为1.475，样本5的年龄为25，大于划分节点21岁，所以被分到了右边的叶子节点，同时被预测为0.275。此时便得到样本5的最总预测值为1.75。 5. Sample 5 is predicted to be in the root node (i.e., the initial learning device), and the age of sample 5 is 25, which is greater than 21 years of the dividing node, so it is divided into the right leaf node, which is also predicted to be At this point, the maximum predicted value of sample 5 is 1.75.

Advanced Artificial Intelligence

Similar presentations

Presentation on theme: "Advanced Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Artificial Intelligence

Similar presentations

Presentation on theme: "Advanced Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

Feedback