Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supervised Machine Learning Algorithms

Similar presentations


Presentation on theme: "Supervised Machine Learning Algorithms"— Presentation transcript:

1 Supervised Machine Learning Algorithms

2 Taxonomy of Machine Learning Methods
The main idea of machine learning (ML) To use computers to learn from massive amounts of data For tedious or unstructured data, machines can often make better and more unbiased decisions than a human learner ML forms the core of artificial intelligence (AI) Especially in the era of big data Need to write a computer program based on a model algorithm Learning from given data objects, one can reveal the categorical class or experience affiliation of future data to be tested Essentially defines ML as an operational term 1.1 CLOUD COMPUTING IN A NUTSHELL

3 Taxonomy of Machine Learning Methods (cont.)
To implement an ML task Need to explore or construct computer algorithms to learn from data Make predictions on data based on their specific features, similarities, or correlations ML algorithms are operated by building a decision-making model from sample data inputs Defines the relationship between features and labels A feature is an input variable for the algorithm A label is an output variable for the algorithm The outputs are data-driven predictions or decisions One can handle the ML process subjectively By finding the best fit to solve the decision problem based on the characteristics in data sets 1.1 CLOUD COMPUTING IN A NUTSHELL

4 Classification by Learning Paradigms
ML algorithms can be built with different styles in order to model a problem The style is dictated by the interaction with the data environment Expressed as the input to the model The data interaction style decides the learning models that a ML algorithm can produce The user must understand the roles of the input data and the model’s construction process The goal is to select the ML model that can solve the problem with the best prediction result ML sometime overlaps with the goal of data mining 1.1 CLOUD COMPUTING IN A NUTSHELL

5 Classification by Learning Paradigms (cont.)
Three classes of ML algorithms based on different learning styles Supervised, unsupervised, and semi-supervised Three ML methods are viable in real-life applications The style is hinged on how training data is used in the learning process 1.1 CLOUD COMPUTING IN A NUTSHELL

6 Classification by Learning Paradigms (cont.)
Supervised learning The input data is called training data with a known label or result A model is constructed through training by using the training dataset Improved by receiving feedback predictions The learning process continues until the model achieves a desired level of accuracy on the training data Future incoming data without known labels is tested on the model with an acceptable level of accuracy Unsupervised learning All input data are not labeled with a known result 1.1 CLOUD COMPUTING IN A NUTSHELL

7 Classification by Learning Paradigms (cont.)
A model is generated by exploring the hidden structures present in the input data To extract general rules, go through a mathematical process to reduce redundancy, or organize data by similarity testing Semi-supervised learning The input data is a mixture of labeled and unlabeled examples The model must learn the structures to organize the data in order to make predictions possible Under different assumptions on how to model the unlabeled data 1.1 CLOUD COMPUTING IN A NUTSHELL

8 Supervised Machine Learning Algorithms
In a supervised ML system The computer learns from a training data set of {input, output} pairs The input comes from sample data given in a certain format e.g., The credit reports of borrowers The output may be discrete e.g., yes or no to a loan application The output can be also continuous e.g., The probability distribution that the loan can be paid off in a timely manner The goal is to work out a reliable ML model Can map or produce the correct outputs from new inputs that were unseen before 1.1 CLOUD COMPUTING IN A NUTSHELL

9 Supervised Machine Learning Algorithms (cont.)
Four families of supervised ML algorithms Regression, decision trees, Bayesian networks, and support vector machines The ML system acts like a finely tuned predictor function g(x) The learning system is built with a sophisticated algorithm to optimize this function e.g., Given an input data x in a credit report of a borrower, the bank will make a loan decision based on the predicted outcome The learning process is iteratively refined using an error criterion to make better predictions Minimizes the error between predicted value and actual experience in input data 1.1 CLOUD COMPUTING IN A NUTSHELL

10 Supervised Machine Learning Algorithms (cont.)
The iterative trial-and-error process Suggested for machine learning algorithms to train a model 1.1 CLOUD COMPUTING IN A NUTSHELL

11 Regression Analysis The outputs of regression are continuous rather than discrete Finds the causal relationship between the input and output variables Apply mathematical statistics to establish dependent variables and independent variables in learning The independent variables are the inputs of the regression process, aka the predictors The dependent variable is the output of the process Essentially performs a sequence of parametric or nonparametric estimations Careful to make the predictions Causality may lead to illusions or false relationships to mislead the users 1.1 CLOUD COMPUTING IN A NUTSHELL

12 Regression Analysis (cont.)
The estimation function can be determined By experience using a priori knowledge or visual observation of the data The regression method can be applied to classify data by predicting the category tag of data Regression analysis determines the quantitative relation in a learning process How the value of the dependent variable changes When any independent variable varies while the other independent variables are left unchanged Regression analysis estimates the average value of the dependent variable when the independent variables are fixed 1.1 CLOUD COMPUTING IN A NUTSHELL

13 Regression Analysis (cont.)
The estimated value is a function of the independent variables known as the regression function Can be described by a probability distribution Most regression methods are parametric naturally Need to calculate the undetermined coefficients of the function by using some error criteria With a finite dimension in the analysis space Nonparametric regression may be infinite-dimensional Accuracy or performance depends on the quality of the dataset used Related to the data generation process and the underlying assumptions made 1.1 CLOUD COMPUTING IN A NUTSHELL

14 Regression Analysis (cont.)
Regression offers estimation of continuous response variables As opposed to the discrete decision values used in classification that demand higher accuracy In the formulation of a regression process The unknown parameters are often denoted as β May appear as a scalar or a vector The independent variables are denoted by a vector X and a dependent variable as Y When multiple dimensions are involved, these parameters are vectors in form A regression model establishes the approximated relation between X, β, and Y: 1.1 CLOUD COMPUTING IN A NUTSHELL

15 Regression Analysis (cont.)
The function f(X, β) is approximated by the expected value E(Y|X) The regression function f is based on the knowledge of the relationship between a continuous variable Y and vector X If no such knowledge is available, an approximated handy form is chosen for f Measuring the Height after Tossing a Small Ball in the Air Measure its height of ascent h at the various time instant t The relationship is modeled as β1 determines the initial velocity of the ball 1.1 CLOUD COMPUTING IN A NUTSHELL

16 Regression Analysis (cont.)
β2 is proportional to standard gravity ε is due to measurement errors Linear regression is used to estimate the values of β1 and β2 from the measured data This model is nonlinear with respect to the time variable t But it is linear with respect to parameters β1 and β2 Consider k components in the vector of unknown parameters β Three models to relate the inputs to the outputs Depending on the relative magnitude between the number N of observed data points of the form (X, Y) and the dimension k of the sample space 1.1 CLOUD COMPUTING IN A NUTSHELL

17 Regression Analysis (cont.)
When N < k, most classical regression analysis methods can be applied Most classical regression analysis methods can be applied The defining equation is underdetermined No enough data to recover the unknown parameters β When N = k and the function f is linear The equation Y = f (X, β) can be solved exactly without approximation There are N equations to solve N components in β The solution is unique as long as the X components are linearly independent If f is nonlinear, many solutions may exist or no solution at all 1.1 CLOUD COMPUTING IN A NUTSHELL

18 Regression Analysis (cont.)
In general, the situation with N > k data points Enough information in the data that can estimate a unique value for β under an overdetermined situation The measurement errors εi follows a normal distribution There exists an excess of information contained in (N - k) measurements Known as the degrees of freedom of the regression Regression with a Necessary Set of Independent Measurements Need the necessary number of independent data to perform the regression analysis of continuous data measurements 1.1 CLOUD COMPUTING IN A NUTSHELL

19 Regression Analysis (cont.)
Consider a regression model with four unknown parameters, 𝛽0, 𝛽1, 𝛽2 and 𝛽3 An experimenter performs 10 measurements All at exactly the same value of independent variable vector X = (X1, X2, X3, X4) Regression analysis fails to give a unique set of estimated values for the four unknown parameters Not get enough information to perform the prediction Only can estimate the average value and the standard deviation of the dependent variable Y Measuring at two different values of X Only gives enough data for a regression with two unknowns, but not for three or more unknowns Only if performs measurements at four different values of the independent variable vector X 1.1 CLOUD COMPUTING IN A NUTSHELL

20 Regression Analysis (cont.)
Regression analysis will provide a unique set of estimates for the four unknown parameters in β Basic assumptions on regression analysis under various error conditions The sample is representative of the data space involved The error is a random variable with a mean of zero conditioned over the input variables The independent variables are measured with no error The predictors are linearly independent The errors are uncorrelated The variance of error is a constant across observations 1.1 CLOUD COMPUTING IN A NUTSHELL

21 Linear Regression Regression analysis includes linear regression and nonlinear regression Unitary linear regression analysis Only one independent variable and one dependent variable are included in the analysis The approximate representation for the relation between the two can be conducted with a straight line Multivariate linear regression analysis Two or more independent variables are included in regression analysis Linear relation between dependent variable and independent variables The model of a linear regression y = f(X) 1.1 CLOUD COMPUTING IN A NUTSHELL

22 Linear Regression (cont.)
X = (x1, x2,⋯, xn) with n  1 is a multidimensional vector and y is scalar variable f(X) is a linear predictor function used to estimate the unknown parameters from data Linear regression is applied mainly in the two areas An approximation process for prediction, forecasting, or error reduction Predictive linear regression models for an observed data set of y and X values The fitted model makes a prediction of the value of y for future unknown input vector X To quantify the strength of the relationship between output y and each input component Xj 1.1 CLOUD COMPUTING IN A NUTSHELL

23 Linear Regression (cont.)
Assess which Xj is irrelevant to y and which subsets of the Xj contain redundant information about y Major steps in linear regression 1.1 CLOUD COMPUTING IN A NUTSHELL

24 Unitary Linear Regression
Crickets chirp more frequently on hotter days than on cooler days 1.1 CLOUD COMPUTING IN A NUTSHELL

25 Unitary Linear Regression (cont.)
Consider a set of data points in a 2D sample space (x1, y1), (x2, y2), ..., (xn, yn) Mapped into a scatter diagram If can be covered approximately by a straight line: y = ax + b + ε x is an input variable, y is an output variable in the real number range, a and b are coefficients ε is a random error, and follows a normal distribution with mean E(ε)and variance Var(ε) Need to work out the expectation by using a linear regression expression: y = ax + b The main task is to conduct estimations for coefficient a and b via observation on n groups of input samples 1.1 CLOUD COMPUTING IN A NUTSHELL

26 Unitary Linear Regression (cont.)
Fit linear regression models with a least squares approach The approximation is shown by a linear line Amid the middle or center of all data points in the data space The residual error (loss) of a unitary model 1.1 CLOUD COMPUTING IN A NUTSHELL

27 Unitary Linear Regression (cont.)
The convex objective function is given by To minimize the sum of squares, need to calculate the partial derivative of Q with respect to , and make them zero are mean value for input variable and dependent variable, respectively 1.1 CLOUD COMPUTING IN A NUTSHELL

28 Unitary Linear Regression (cont.)
After working out the specific expression for the model Need to know the fitting degree to the dataset If the expression can express the relation between the two variables and can be used in actual predictions To figure out the estimated value of the dependent variable with For each sample in the training data set 1.1 CLOUD COMPUTING IN A NUTSHELL

29 Unitary Linear Regression (cont.)
The closer the coefficient of determination R2 is to 1, the better the fitting degree is The further R2 is away from 1, the worse fitting degree is Linear regression can also be used for classification Only used in a binary classification problem Decide between the two classes For multivariate linear regression, this method is also applied to classify a dataset 1.1 CLOUD COMPUTING IN A NUTSHELL

30 Unitary Linear Regression (cont.)
Healthcare Data Analysis Obesity is reflected by the weight index More likely to have high blood pressure or diabetes Predict the relationship between obesity and high blood pressure The dataset for body weight index and blood pressure of some people at a hospital in Wuhan 1.1 CLOUD COMPUTING IN A NUTSHELL

31 Unitary Linear Regression (cont.)
Conduct a preliminary judgment on what is the datum of blood pressure of a person with a body weight index of 24 A prediction model with two variables The unitary linear regression may be considered Determine distribution of the data points Scatter diagram for body weight index-blood pressure 1.1 CLOUD COMPUTING IN A NUTSHELL

32 Unitary Linear Regression (cont.)
All data points are almost on or below the straight line Being linearly distributed The data space is modeled by a unitary linear regression process By the least square method Get a = 1.32 and b = 96.58 Therefore we have y = 1.32x A significance test is needed to verify whether the model will fit well with the current data A prediction is made through calculation The mean residual and coefficient of determination of the model are: average error is 1.17 and R2 = 0.90 1.1 CLOUD COMPUTING IN A NUTSHELL

33 Unitary Linear Regression (cont.)
The mean residual is much less than the mean value of blood pressure The coefficient of determination is close to 1 This regression equation is significant Can fit well into the dataset Predictions may be conducted for unknown data on this basis Given body weight index, the value of blood pressure of a person may be determined with the model Substitute 24 for x Can get the value of blood pressure of that person as y = 1.32 × = 128 1.1 CLOUD COMPUTING IN A NUTSHELL

34 Multiple Linear Regression
During solving actual problems Often encounter many variables e.g., The scores of a student may be influenced by factors like earnestness in class, preparation before class and review after class e.g., The health of a man is not only influenced by the environment, but also related to the dietary habits The model of unitary linear regression is not adapted to many conditions Improve it with a model of multivariate linear regression analysis Consider the case of m input variables The output is expressed as a linear combination of the input variables 1.1 CLOUD COMPUTING IN A NUTSHELL

35 Multiple Linear Regression (cont.)
𝛽0, 𝛽1,⋯, 𝛽m, 𝜎2 are unknown parameters ε complies with normal distribution The mean value is 0 and the variance is equal to 𝜎2 By working out the expectation for the structure to get the multivariate linear regression equation Substituted y for E(y) Its matrix form is given as E(y) = X𝛽 X = [1, x1,⋯, xm], 𝛽 = [𝛽0, 𝛽1,⋯, 𝛽m]T Our goal is to compute the coefficients by minimizing the objective function 1.1 CLOUD COMPUTING IN A NUTSHELL

36 Multiple Linear Regression (cont.)
Defined over n sample data points To minimize Q, need to make the partial derivative of Q with respect to each βi zero The multiple linear regression equation 1.1 CLOUD COMPUTING IN A NUTSHELL

37 Multiple Linear Regression (cont.)
Multivariate regression is an expansion and extension of unitary regression Identical in nature The range of applications is different Unitary regression has limited applications Multivariate regression is applicable to many real-life problems Estimate the Density of Pollutant Nitric Oxide in a Spotted Location Estimation of the density of nitric oxide (NO) gas, an air pollutant, in an urban location Vehicles discharge NO gas during their movement 1.1 CLOUD COMPUTING IN A NUTSHELL

38 Multiple Linear Regression (cont.)
Creates a pollution problem proven harmful to human health The NO density is attributed to four input variables Vehicle traffic, temperature, air humidity, and wind velocity 16 data points collected in various observed spotted locations in the city Apply the multiple linear regression method to estimate the NO density In testing a spotted location measured with a data vector of {1436, 28.0, 68, 2.00} for four features {x1, x2, x3, x4}, respectively X = [1, xn1, xn2, xn3, xn4]T and the weight vector W = [b, β1, β2, β3, β4]T for n = 1,2,.…,16 1.1 CLOUD COMPUTING IN A NUTSHELL

39 Multiple Linear Regression (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

40 Multiple Linear Regression (cont.)
e.g., for the first row of training data, [1300, 20, 80, 0.45, 0.066], X1 = [1, 1300, 20, 80, 0.45]T, which gives the output value y1 = 0.066 Need to compute W = [b, β1, β2, β3, β4]T and minimize the mean square error The 16 × 5 matrix directly obtained from the sample data table y = [0.066, 0.005,.…, 0.039]T is the given column vector of data labels 1.1 CLOUD COMPUTING IN A NUTSHELL

41 Multiple Linear Regression (cont.)
To make the prediction results on the testing sample vector x = [1, 1300, 20, 80, 0.45]T By substituting the weight vector obtained The final answer is {β1 = 0.029, β2 = 0.015, β3 = 0.002, β4 = −0.029, b = 0.070} The NO gas density is predicted as = or 6.5% 1.1 CLOUD COMPUTING IN A NUTSHELL

42 Logistic Regression Method
Many problems require a probability estimate as output Logistic regression is an extremely efficient mechanism for calculating probabilities Commonly used in fields like data mining, automatic diagnosis for diseases, and economic predictions The logistic model may be used to solve problems of binary classification In solving a classification problem The inputs are divided into two or more classes The learner must produce a model that assigns unseen inputs to one or more of these classes Typically tackled in a supervised way 1.1 CLOUD COMPUTING IN A NUTSHELL

43 Logistic Regression Method
Spam filtering is a good example of classification The inputs are s, blogs, or document files The output classes are spam and non-spam For logistic regression classification The principle is to conduct classification to sample data with a logistic function Maps logistic regression output to probabilities Known as a sigmoid function The input domain of the sigmoid function is (-∞, +∞) and the range is (0, 1) Can regard the sigmoid function as a probability density function for sample data 1.1 CLOUD COMPUTING IN A NUTSHELL

44 Logistic Regression Method (cont.)
The function image is sensitive, if z = 0 And not sensitive if z ≫ 0 or z ≪ 0 1.1 CLOUD COMPUTING IN A NUTSHELL z

45 Logistic Regression Method (cont.)
The basic idea for logistic regression Sample data may be concentrated at both ends of the by the use of intermediate feature z of the sample Can be divided into two classes Consider vector X = (x1,⋯, xm) with m independent input variables Each dimension of X stands for one attribute (feature) of the sample data (training data) Multiple features of the sample data are combined into one feature by Figure out the probability of the z feature with designated data 1.1 CLOUD COMPUTING IN A NUTSHELL

46 Logistic Regression Method (cont.)
And apply the sigmoid function to act on that feature Obtain the expression for the logistic regression During combining of multiple features into one feature 1.1 CLOUD COMPUTING IN A NUTSHELL

47 Logistic Regression Method (cont.)
Make use of the linear function The coefficient of the linear function, i.e., feature weight of sample data, needs to be determined Maximum likelihood Estimation (MLE) is adopted to transform it into an optimization problem Attempts to find the parameter values that maximize the likelihood function, given the observations The coefficient is determined through the optimization method The loss function is Log Loss D is the data set containing many labeled examples, i.e., (x, y) pairs 1.1 CLOUD COMPUTING IN A NUTSHELL

48 Logistic Regression Method (cont.)
y is the label in a labeled example and its value must either be 0 or 1 y′ is the predicted value, somewhere between 0 and 1, given the set of features in x Minimizing this negative logarithm of the likelihood function yields a maximum likelihood estimate Logistic regression returns a probability To map a regression value to a binary category must define a classification or decision threshold Thresholds are problem-dependent Tempting to assume that it should always be 0.5 Its value must be tuned Part of choosing a threshold is assessing how much one will suffer for making a mistake 1.1 CLOUD COMPUTING IN A NUTSHELL

49 Logistic Regression Method (cont.)
General steps for logistic regression Accuracy is one metric for evaluating classification models The fraction of predictions the model gets right 1.1 CLOUD COMPUTING IN A NUTSHELL

50 Logistic Regression Method (cont.)
Four possible statuses for binary classification TP (True Positive) refers to an outcome where the model correctly predicts the positive class TN (True Negative) means an outcome where the model correctly predicts the negative class FP (False Positive) is an outcome where the model incorrectly predicts the positive class FN (False Negative) an outcome where the model incorrectly predicts the negative class 1.1 CLOUD COMPUTING IN A NUTSHELL

51 Logistic Regression Method (cont.)
Accuracy alone does not tell the full story for a class-imbalanced data set A significant disparity between the number of positive and negative labels Precision identifies the frequency a model is correct as predicting the positive class Measures the percentage of cases flagged as positive that are correctly classified 1.1 CLOUD COMPUTING IN A NUTSHELL

52 Logistic Regression Method (cont.)
Recall measures the percentage of actual positives that are identified correctly To evaluate the effectiveness of a model must examine both precision and recall But precision and recall are often in tension Improving precision reduces recall and vice versa 1.1 CLOUD COMPUTING IN A NUTSHELL

53 Supervised Classification Methods
The classification algorithm is often used in supervised machine learning The input data is the training data Every training data is given a specific label e.g., The label of spam or legitimate mail may be given to label each sample mail in the training set of a spam filtering system Typically, the generation of a classifier model goes through three steps 1.1 CLOUD COMPUTING IN A NUTSHELL

54 Supervised Classification Methods (cont.)
For a two-class problem Divides the sample dataset into two subsets Positive versus negative The classified model is built by training The model accuracy is determined by using a likelihood probability Four families of supervised classification methods The decision tree, rule-based classifier, nearest neighbor classifier and support-vector machines 1.1 CLOUD COMPUTING IN A NUTSHELL

55 Decision Trees for Machine Learning
Decision trees offers a predictive model to solve classification and regression problems Map observations about an item to conclusions about the item’s target value Along various feature nodes in a tree-structured decision process Various decision paths fork in the tree structure Until a prediction decision is made hierarchically at the leaf node The goal is to create a model Predicts the value of an output target variable at the leaf nodes of the tree Based on several input variables or attributes at the root and interior nodes of the tree 1.1 CLOUD COMPUTING IN A NUTSHELL

56 Decision Trees for Machine Learning (cont.)
Decision trees for classification use are known as classification trees In a classification tree Leaves represent class labels The target variable (output) can take two values such as yes or no Or multiple discrete values like outcomes 1, 2, 3, or 4 of an event Each leaf of the tree is labeled with a class or a probability distribution over the classes Branches represent conjunctions of attributes that lead to the class labels The arcs from a node are labeled with each of the possible values of the attribute 1.1 CLOUD COMPUTING IN A NUTSHELL

57 Decision Trees for Machine Learning (cont.)
Decision trees with the target variable being continuous values like real numbers is called regression trees Decision trees follow a multi-level tree structure to make decisions at the leaf nodes of a tree e.g., Need to decide whether or not to go out to play tennis under various weather conditions The weather conditions are indicated by three attributes: outlook, humidity, and windy The outlook is checked at the root Three possible outgoing arcs marked as sunny, overcast, or rain 1.1 CLOUD COMPUTING IN A NUTSHELL

58 Decision Trees for Machine Learning (cont.)
The humidity is fanning out to two arcs labeled as more than 70 or not The windy values are simply true or false To traverse this tree, starts from the root to a leaf along a path of one or two levels Inside each tree node, the target counts are given to determine the probability if a leaf node is reached e.g., If the outlook value is overcast, one reaches a leaf node with a probability of 4/5 to play tennis If the outlook is sunny and the humidity is below 70, one reaches the leaf node at the extreme left with a probability of 5/8 to play tennis One can also reach other leaf nodes with different probabilities 1.1 CLOUD COMPUTING IN A NUTSHELL

59 Decision Trees for Machine Learning (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL 1/ /2

60 Decision Trees for Machine Learning (cont.)
For a simple prediction decision The target value can be class labels like yes or no Bank Loan Approval using Decision Tree with Training Data Using the decision tree to make a decision on whether a bank will approve a loan application from its customer The dataset is classified with three attributes Age, annual income and marital status The internal node at each level tests on one attribute Each leaf node represents one class decision Yes means approval and no for the opposite 1.1 CLOUD COMPUTING IN A NUTSHELL

61 Decision Trees for Machine Learning (cont.)
The decision tree already built by the bank Applicant age is considered first at the root node Age partitions training samples into two categories Those less than 40 versus otherwise 1.1 CLOUD COMPUTING IN A NUTSHELL

62 Decision Trees for Machine Learning (cont.)
The annual income and marital status attributes are tested at the second level to make the decision on approving the loan or not Use the decision tree to test the acceptability of an applicant Younger than 40 and with an annual income lower than $80,000 By traversing through the tree, obtain the decision to deny the loan application The bank acts in favor of young applicants with higher income and for the older applicants who are not single or divorced 1.1 CLOUD COMPUTING IN A NUTSHELL

63 Decision Tree Learning
The effectiveness of the decision tree is dependent upon the root chosen i.e. The first attribute for splitting out multiple choices The successive attribute order applied may lead to entirely different tree topologies The goal is to cover all correct paths for all labeled sample data provided The tree must be able to predict accurately for all future testing data All features have finite discrete domains A tree can be learned by splitting the data set into subsets based on an attribute value test 1.1 CLOUD COMPUTING IN A NUTSHELL

64 Decision Tree Learning (cont.)
This process is repeated on each derived subset in a recursive manner The process is completed when the subset at a node belongs to the same class This greedy algorithm of top-down induction of decision trees (TDIDT) is a common strategy for learning decision trees from input data Three greedy top-down methods, ID3, C4.5, and CART for constructing the decision tree The C4.5 is the improved successor of ID3 CART method combines Classification And Regression in the Tree construction The regression tree is used if the predicted outcome is continuous, like a real number 1.1 CLOUD COMPUTING IN A NUTSHELL

65 ID3 Algorithm Tagging The ID3 algorithm takes the information gain of attribute as the measure Splits the attribute with the largest information gain after splitting To make the output partition on each branch belong to the same class as far as possible To select the most information-rich attribute as the root to grow successive nodes in the decision tree The measure of information gain is entropy Depicts the purity of any example set Given a training set S of positive and negative examples, the entropy function of S 1.1 CLOUD COMPUTING IN A NUTSHELL

66 ID3 Algorithm Tagging (cont.)
p+ represents positive examples p- represents negative examples If the attribute possesses m different values The entropy of S relative to classifications of m classes The measure standard of the effectiveness of training data is defined as the entropy The information gain of an attribute The decrease of expected entropy caused by segmented examples The gain Gain(S, A) of an attribute A in set S 1.1 CLOUD COMPUTING IN A NUTSHELL

67 ID3 Algorithm Tagging (cont.)
V(A) is the range of A, S is the sample set Sv is the sample set with A value equal to v Decision Tree Prediction using the ID3 Algorithm Given a training set D with 500 samples Class label attribute load has two different values i.e., {yes, no}, two different categories with m = 2 1.1 CLOUD COMPUTING IN A NUTSHELL

68 ID3 Algorithm Tagging (cont.)
Category C1 corresponds to yes Category C2 corresponds to no 300 tuples in category yes 200 tuples in category no Root node N is created for tuples in D The information gain of each attribute must be calculated to find the split criterion of those tuples The entropy value used to classify the tuples in D Then calculate the expected information demand of each attribute For the income attribute of equal or greater than 80K, there are 250 yes tuples and 100 no tuples 1.1 CLOUD COMPUTING IN A NUTSHELL

69 ID3 Algorithm Tagging (cont.)
For the income attribute of less than 80K, there are 50 yes tuples and 100 no tuples If the tuples are partitioned by annual income The expected entropy for classifying the tuples in D The information gain of such a partition 1.1 CLOUD COMPUTING IN A NUTSHELL

70 ID3 Algorithm Tagging (cont.)
Information gains of age and marital status can be also calculated The attribute with the largest information gain is selected to construct the tree, i.e., the age attribute Bank Loan Approval Using Decision Tree A credit card enables the cardholder to borrow money or pay for a purchase from the card issuing bank The bank expects a customer to pay it back by a given deadline The bank keeps statistics of customer payback records Consider three cardholder attributes as the input variables to a decision process Gender, age, and income 1.1 CLOUD COMPUTING IN A NUTSHELL

71 ID3 Algorithm Tagging (cont.)
Credit cardholder data from a card issuing bank Use these sample data points with labeled decisions to construct a decision tree To predict if a customer will pay in a timely manner or reveal the probability to do so 1.1 CLOUD COMPUTING IN A NUTSHELL

72 ID3 Algorithm Tagging (cont.)
Apply the decision tree obtained to a testing customer characterized by The data vector [Gender: female, Age: 26 ~ 40, Income: middle level] At the root level, if choose gender as the split attribute, its conditional entropy is computed by If choose age at the root, the conditional entropy If choose income at the root, the conditional entropy 1.1 CLOUD COMPUTING IN A NUTSHELL

73 ID3 Algorithm Tagging (cont.)
Select the attribute with the lowest conditional entropy as the root The root could be either age or income The above procedure is repeated to the next level of the tree Until all attributes are exhausted and can cover the entire sample data set Finally we obtain two decision trees At each level, each training data gets branched to a different leaf node or waits for the next to split 1.1 CLOUD COMPUTING IN A NUTSHELL

74 ID3 Algorithm Tagging (cont.)
Both trees have at most three levels before reaching the decision leaves Both trees have equal search costs For an optimal solution, need to choose the shortest decision tree with the minimum number of levels 1.1 CLOUD COMPUTING IN A NUTSHELL

75 ID3 Algorithm Tagging (cont.)
Considering the testing customer data [Gender: female, Age: 26~40, Income: middle level] Neither of the two trees can lead to a unique solution Implies the tree has an over-fitting situation Heavily biased toward the sample data provided A solution is to shorten the trees by computing the probability at a shortened leaf node Based on a majority vote on customers falling in the user category e.g., Stop the tree construction at the root level By splitting the outgoing edges to three paths corresponding to three possible choices from the root With different income values, i.e., high, middle, low 1.1 CLOUD COMPUTING IN A NUTSHELL

76 ID3 Algorithm Tagging (cont.)
Or produce the shortened tree With four possible leaf nodes pointed by age values (> 40, 26-40, 15-25, <15), respectively The leaf nodes are marked with probability values Using either of the shortened decision trees The testing customer will end up with a yes prediction The customer will pay back in a timely manner The two trees end up with different probability values predicted for a yes vote 1.1 CLOUD COMPUTING IN A NUTSHELL Yes: 1 No: 3

77 Rule-based Classification
The rule-based classifier is a classification forecasting methods A technique to use a set of if then… rules to classify records Usually representing model rules in disjunctive normal form as given by R = (r1 ∨ r2 ⋯ ∨ rk) R means rule set ri is the classification rule or disjunction e.g., The use of three prediction rules r1: (Body temperature = Cold blood) → Non-mammalian r2: (Body temperature = Constant temperature) ∧ (Viviparity = Yes) → Mammalian 1.1 CLOUD COMPUTING IN A NUTSHELL (c) Rule-Based Algorithms

78 Rule-based Classification (cont.)
r3: (Body temperature = Constant) ∧ (Viviparity = No) → Non-mammalian Each classification rule is represented by ri : (Conditioni) → yi The left side is called the premise or rule antecedent The right side called the conclusion or rule consequent If a record meets a rule The record is activated or triggered Or it is covered by the rule In general, the rule antecedent is represented by Conditioni = (A1 op v1) ∧ (A2 op v2) ∧⋯(Ak op vk) Each (Ai op vi) is called a conjunct and consists of attribute-value pairs and a logical operator op 1.1 CLOUD COMPUTING IN A NUTSHELL

79 Rule-based Classification (cont.)
Generally op ∈ {=,≠,<,>,≤,≥} For each class, there may be more than one rule that can apply To determine which rule is superior Define the coverage and precision functions for the quality of classification rules For dataset D and classification rule r : A → y Rule coverage is defined as the proportion of records triggering rule r in D Rule precision or confidence factor is defined as the proportion of records with a class label equal to y in the records of triggering r 1.1 CLOUD COMPUTING IN A NUTSHELL

80 Rule-based Classification (cont.)
|A| is the number of records that meet the rule antecedent |A∩y| is the number of records that meet both the rule antecedent and the rule consequent |D| is the total number of records Determine whether some of the rules in a given rule set are ineffective Some records can be triggered by more than one rule, this will lead to duplication of rules Others may not be covered by any rule Two important properties to improve the applicability of the rules Mutual exclusion property 1.1 CLOUD COMPUTING IN A NUTSHELL

81 Rule-based Classification (cont.)
No rules triggered by the same record in the rule set R Rules in the rule set R are mutually exclusive Ensures that all records are covered by one rule at most in R The above example rule set is mutually exclusive Exhaustive property For any combination of property values, there is a rule in R to cover it The rule set R is with exhaustive coverage Ensures that all records are covered by one rule at least in R A rule set with both mutually exclusive and exhaustive properties Any record can be covered by one and only one rule 1.1 CLOUD COMPUTING IN A NUTSHELL

82 Rule-based Classification (cont.)
Many rule sets cannot meet the above two properties If a rule set cannot meet the exhaustive property A default rule rd: () → yd must be added to cover those uncovered records If the antecedent of the default rule is empty, triggering will occur in case of the failure of all the rules and yd is the default class Often values of most of the classes of records are not covered by rules If a rule set does not meet the mutually exclusive property A record may be covered by more than one rule 1.1 CLOUD COMPUTING IN A NUTSHELL

83 Rule-based Classification (cont.)
The classification of these rules may conflict To determine the classification result of the record Ordered rules This kind of rule set is ordered from large to small in accordance with the rule priority Defined generally with precision, coverage and so on When classifying, rules are scanned in sequence until a rule covering the record is found This rule will be the classification result of this record General rule-based classifiers adopt this method Unordered rules All rules are equal to each other 1.1 CLOUD COMPUTING IN A NUTSHELL

84 Rule-based Classification (cont.)
The rules are scanned successively After a record occurs, each will be chosen The one getting the most votes will be the final classification result of the record Rule Extraction from Decision Tree Rule extraction from decision tree modeling is a common indirect method for rule extraction Each path of the decision tree from its root node to its leaf node may express a classification rule Conditions for each path constitute the rule antecedent The class label of the leaf node constitutes the consequent rule 1.1 CLOUD COMPUTING IN A NUTSHELL

85 Rule-based Classification (cont.)
Rule set generated from using decision tree 1.1 CLOUD COMPUTING IN A NUTSHELL

86 Rule-based Classification (cont.)
r2, r3, r5 may be replaced by r6, r7 Simpler to describe the decision tree modeling by the rules consisting of r1, r4, r6, r7 This is the content described by the C4.5 rule algorithm C4.5 is an extension of the earlier ID3 algorithm First use the decision tree to generate the rule set Then simplify the rule set Finally sort the rules 1.1 CLOUD COMPUTING IN A NUTSHELL

87 Rule Extraction with Direct Rule
The sequential coverage algorithm is often used to directly extract rules from data The growth of rules is usually in a greedy manner based on some kind of evaluation measure The algorithm extracts a class of rules at a time from the record containing more than one training data A flow chart to illustrate the data flow E represents the training dataset A is the attribute-value pairs {(Aj, vj)} R is the rule set Input collection of training dataset E and attribute-value pairs A 1.1 CLOUD COMPUTING IN A NUTSHELL

88 Rule Extraction with Direct Rule (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

89 Rule Extraction with Direct Rule (cont.)
Make Y the ordered set of the class {y1, y2,⋯, yk}, and R = {} is the initial rule set For each class y in Y While the rule set can cover y class of training data Using function Rule() generates a rule r Deleting the record covered by rule r from E Adding r to the end of the rule set, namely R = R ∨ r Otherwise, end the circulation Add the default rule () → yd to end of the rule set The rule function is to extract a classification rule Covers a larger number of positive examples with concentrated training 1.1 CLOUD COMPUTING IN A NUTSHELL

90 Rule Extraction with Direct Rule (cont.)
Covers none or only a few counter examples To avoid exponential explosion, the function increases rules in a greedy manner First creates a rule r Constantly makes improvements to the rule until satisfying certain conditions Can also prunes the rule to improve its generalization error A case of rule generation strategy from general to specific properties In general, set up an initial rule r: {} → y The rule antecedent is an empty set The consequent rule contains the target class 1.1 CLOUD COMPUTING IN A NUTSHELL

91 Rule Extraction with Direct Rule (cont.)
This rule covers all training set records The quality is very poor Add new conjuncts to improve the quality of rules Shall be continued until satisfying end conditions Conjuncts added cannot improve the quality of rules 1.1 CLOUD COMPUTING IN A NUTSHELL

92 Rule Extraction with Direct Rule (cont.)
From specific to general sample properties Can randomly select a positive example Then delete a conjunct of rules to cover more positive examples to generalize rules until satisfying end conditions e.g., Rules appearing a counter example 1.1 CLOUD COMPUTING IN A NUTSHELL

93 Rule Extraction with Direct Rule (cont.)
Diabetes Prediction using Rule-Based Classification A dataset from the physical examination of some people in Wuhan Blood glucose (high, low), weight (overweight, normal), lipid content and diabetes (yes, no) 1.1 CLOUD COMPUTING IN A NUTSHELL

94 Rule Extraction with Direct Rule (cont.)
The corresponding rule sets may be constituted to classify people into two categories i.e., The diabetic and the normal Need to determine the rule set to classify people into two categories The consequent rule will be the diabetic, Yes The normal expressed with No Use the sequential coverage algorithm to generate rules Determine the classes {Yes, No} and the normal people class to be (No class) Use the strategy from general to specific to generate the rule {} → No Add the property of blood glucose (A) and generate the following rule: r1: {A = L} → No 1.1 CLOUD COMPUTING IN A NUTSHELL

95 Rule Extraction with Direct Rule (cont.)
Delete records with id of 1, 4 and 7 and add the above rules to the rule set R, i.e., R = {r1} Continue to add the property of weight (B) and generate r2: {A = H, B = Normal} → No Delete records with id of 2, 6 and 9 and add rules to the rule set R, i.e., R = {r1, r2} Consider blood lipid (C) and get the rule r3: {A = H, B = Overweight, C < 1.8} → No Delete records with id of 3 and add rules to the rule set R, i.e., R = {r1, r2, r3} Inspect the diabetic class (Yes class) Analyze it and generate the rule r4: {A = H, B = Overweight, C > 1.8} → Yes Delete records with id of 5, 8 and 10 and add rules to the rule set R, i.e., R = {r1, r2, r3, r4} 1.1 CLOUD COMPUTING IN A NUTSHELL

96 Rule Extraction with Direct Rule (cont.)
Now all training datasets have been deleted, so stop the circulation Finally, output the rule set R From the above description, get the rule set 1.1 CLOUD COMPUTING IN A NUTSHELL

97 Instance-Based Learning
Instance-based learning models a decision problem with instances or critical training data The data instance is built up with a database of reliable examples A similarity test is conducted to find the best match to make a prediction Also known as memory-based learning Representative data instances and similarity measures are stored in the database 1.1 CLOUD COMPUTING IN A NUTSHELL

98 The Nearest Neighbor Classifier
The decision trees and rule-based classification have a training dataset To build a mapping model from input property to class label Call this the active learning method The passive learning method When the training dataset modeling is postponed until the test dataset can be used The Rote classifier is a kind of passive learning method Will not classify the test data until it matches a certain training dataset instance completely 1.1 CLOUD COMPUTING IN A NUTSHELL

99 The Nearest Neighbor Classifier (cont.)
The above method has an apparent disadvantage Most of the test data instances cannot be classified because no training dataset matches them An improved model called the nearest neighbor classifier appears To find all the training dataset instances with the most similar properties as the test sample The collection of these training dataset instances is called the nearest neighbor of the test sample The class labels are determined according to these instances Considers each sample as the an n-dimensional point 1.1 CLOUD COMPUTING IN A NUTSHELL

100 The Nearest Neighbor Classifier (cont.)
n is the total number of properties Generally determines the nearest neighbor between two given points with Euclidean distance An instance of three kinds of nearest neighbors 1.1 CLOUD COMPUTING IN A NUTSHELL

101 The Nearest Neighbor Classifier (cont.)
Very important to choose a proper distance threshold k If k is too small, the nearest neighbor classifier tends to be affected by overfitting Due to the noise from the training data If k is too large, the nearest neighbor classifier may misclassify the test samples Due to containing data far from the nearest neighbor Can determine the class label of the test samples based on the class label in the nearest neighbor As the nearest neighbor of the test samples is decided If the class labels of the test samples are inconsistent with the nearest neighbor 1.1 CLOUD COMPUTING IN A NUTSHELL

102 The Nearest Neighbor Classifier (cont.)
The class label in the nearest neighbor should be taken as the class label of the test samples If some of the nearest neighbor samples are very important e.g. The nearest neighbor with the smallest distance The class label choice can be carried out by the method of conferring weight coefficient The two methods of choosing the class label of test samples Majority voting Weighted distance voting 1.1 CLOUD COMPUTING IN A NUTSHELL

103 The Nearest Neighbor Classifier (cont.)
v means class label, DZ means the nearest neighbor of the test sample, yi is a class label of the nearest neighbor I(⋅) is an indicator function Flow chart for the nearest neighbor classification algorithm The variable k represents distance threshold D is training dataset and z is test instance First input k, D, z Then calculate the distance between the test instance and the training dataset sample The samples whose d(z, D) are lower than k are collected into set DZ 1.1 CLOUD COMPUTING IN A NUTSHELL

104 The Nearest Neighbor Classifier (cont.)
Then making statistical use of the class label in DZ Decide the class label of test instance by the majority voting method 1.1 CLOUD COMPUTING IN A NUTSHELL

105 The Nearest Neighbor Classifier (cont.)
Hyperlipidemia Prediction using Nearest Neighbor Algorithm The hyperlipidemia disease is attributed to two medical indexes Namely triglyceride and total cholesterol Generally people who have similar indexes may have similar health problems Can use nearest classifier to make a judgment as if a patient has acquired hyperlipidemia or not The dataset of triglyceride content, total cholesterol content And whether to have hyperlipidemia (Yes, No) from nine potential patients 1.1 CLOUD COMPUTING IN A NUTSHELL

106 The Nearest Neighbor Classifier (cont.)
Consider those people to have hyperlipidemia if they have triglyceride content above 1.33 and a total cholesterol count above 4.32 Can be classified by the nearest neighbor classifier with the test sample (1.33, 4.32) 1.1 CLOUD COMPUTING IN A NUTSHELL

107 The Nearest Neighbor Classifier (cont.)
Set the threshold as 0.2 for this case The training dataset samples in the case of id = 1 The training dataset samples should be added into Dz The training dataset sample in the case of id = 2 The training dataset sample for id = 3 The training dataset samples should be discarded With the remainder dealt with in the same way 1.1 CLOUD COMPUTING IN A NUTSHELL

108 The Nearest Neighbor Classifier (cont.)
Obtain a collection of the nearest neighbors DZ DZ = {x | id = 1, 2, 6, 9} There are Yes and No in the statistics of the nearest neighbor collection Finally collect statistics of the above class label by a majority voting method Classifying those as No in the case of id = 1, 2 and 6 And those as Yes in the case of id = 9 The voting result is: Yes = 1, No = 3 The physically examined people are not suffering from hyperlipidemia When the triglyceride content is 1.33 and the total cholesterol 4.32 1.1 CLOUD COMPUTING IN A NUTSHELL

109 Support Vector Machines (SVM)
Support vector machines (SVMs) are often used in supervised learning methods for regression and classification applications Decide how to generate a hyperplane to separate the training sample data space into distinct subspaces e.g., Can use a straight line to separate the points in 2D space Or use planes to separate points in 3D space Use a hyperplane to separate the points in high-dimensional space Builds a model to predict whether a new sample falls into one subspace or another 1.1 CLOUD COMPUTING IN A NUTSHELL

110 Support Vector Machines (SVM) (cont.)
Offers another approach to classifying multidimensional data sets Regard the points in the same area as one class Can use SVM to solve the issues of classification Samples on the margin are called support vectors The original problem may be stated in a finite dimensional space Often happens that the sets to discriminate are not linearly separable in that space The original finite-dimensional space can be mapped into a much higher-dimensional space Presumably making the separation easier in that space Can use a hyperplane to cluster these points in high-dimensional space 1.1 CLOUD COMPUTING IN A NUTSHELL

111 Linear Decision Boundary
Consider a 2D plane with two kinds of data These data are linearly separable Infinite straight lines can be adopted To find the best line i.e., The one with the minimum classification error 1.1 CLOUD COMPUTING IN A NUTSHELL

112 Linear Decision Boundary (cont.)
The two-class problems in an n-dimensional space The two classes are separated by an (n-1)-dimensional hyperplane Data points (X1, y1),.…, (X|D|, y|D|) Xi is the training sample of n-dimension with class label yi Each yi can assume a value either of +1 for one class and -1 for the other classes The (n-1)-dimensional hyperplane w and b are the parameters Correspond to a straight line in the 2D plane 1.1 CLOUD COMPUTING IN A NUTSHELL

113 Linear Decision Boundary (cont.)
The hyperplane intends to separate the two kinds of data i.e. All the yi corresponded by the data points on one side of the hyperplane are -1, and 1 on the other side Make f(x) = wTx + b Use f(x) > 0 point pairs for data points with y = 1 f(x) < 0 point pairs for data points with y = -1 Classification using Support Vector Machine with Training Samples of 2-D Data 1.1 CLOUD COMPUTING IN A NUTSHELL

114 Linear Decision Boundary (cont.)
One straight line 2x1 + x2 - 3 = 0 can be found to separate the data in the table 1.1 CLOUD COMPUTING IN A NUTSHELL

115 Maximal Margin Hyperplane
Consider those training samples nearest to the decision boundary Adjust parameters w and b Two parallel hyperplanes H1 and H2 can be obtained 1.1 CLOUD COMPUTING IN A NUTSHELL

116 Maximal Margin Hyperplane (cont.)
The margin of the decision boundary is the distance between those two hyperplanes To calculate the margin Make x1 the data point on H1 And x2 the data point on H2 Insert x1 and x2 into the formula Margin d can be obtained by subtracting the formulas wT(x1 - x2) = 2  1.1 CLOUD COMPUTING IN A NUTSHELL

117 Formal SVM Model The training phase of SVM includes estimation of parameters w and b from the training data The selected parameters must meet Those two inequalities can be written to Maximization of the margin is equivalent to minimization of the objective convex function 1.1 CLOUD COMPUTING IN A NUTSHELL

118 Formal SVM Model (cont.)
SVM is obtained by finding the minimum objective function A convex optimization problem The objective function is quadratic The constraint condition is linear Can be solved through the standard Lagrange multiplier 1.1 CLOUD COMPUTING IN A NUTSHELL

119 Non-linear Hyperplanes
May need to adjust the model when the samples are not linearly inseparable An outlier or noise may make the sample space linearly inseparable Outliers are values distant from most others Weights with high absolute values Predicted values relatively far away from the actual values Input data whose values are more than roughly 3 standard deviations from the mean Outliers often cause problems in model training Some slack variable is introduced to avoid this case 1.1 CLOUD COMPUTING IN A NUTSHELL

120 Non-linear Hyperplanes (cont.)
With no restriction on misclassified samples on the boundary The learning algorithm may find such a boundary with a wider margin by allowing many misclassified training samples The objective function can be modified as flows with huge slack variable values C and k are parameters designated by the user Meaning the punishment to the misclassified training instances 1.1 CLOUD COMPUTING IN A NUTSHELL

121 Non-linear Hyperplanes (cont.)
i.e., The more the outliers, the larger the objective function values C means the weight of the outlier The final model is given as If cannot find a hyperplane to separate the data i.e., The linear SVM cannot be found a feasible solution Need to extend the linear SVM to a nonlinear SVM model 1.1 CLOUD COMPUTING IN A NUTSHELL

122 Non-linear Hyperplanes (cont.)
Convert the input data to a space of higher dimension through nonlinear mapping Search the separating hyperplane in the new space If the low dimensional linear data is inseparable Can be mapped to a higher dimension to be separable after using the Gaussian function 1.1 CLOUD COMPUTING IN A NUTSHELL

123 Bayesian Learning Algorithms
Bayesian methods are based on statistical decision theory Often applied in pattern recognition, feature extraction, and regression applications Offers a directed acyclic graph (DAG) model Represented by a set of statistically independent random variables e.g., A Bayesian network can represent the probabilistic relationships between diseases and symptoms Given symptoms, the system computes the probabilities of having various diseases Used in medical diagnosis to assist doctors, nurses, and patients in the healthcare industry 1.1 CLOUD COMPUTING IN A NUTSHELL

124 Naive Bayesian Classifier
Bayesian methods: Naive Bayesian classifier and Bayesian Belief networks Improve the accuracy of the classification used in medical, financial, and many other fields Consider a pair of random variables: X & Y Their joint probability P(X = x, Y = y) is P(X, Y) = P(Y|X) × P(X) = P(X|Y) × P(Y) Can compute the inverse conditional probability The well-known Bayesian Theorem During classification The random variable Y is the class to be decided 1.1 CLOUD COMPUTING IN A NUTSHELL

125 Naive Bayesian Classifier (cont.)
X is the attribute set Need to compute the class probability P(Y|X0) Given the attribute vector X0 for a testing data item Y with the maximum value of P(Y|X0) corresponds to the class for testing data X0 For an attribute vector X = (X1, X2,⋯, Xk) And l possible values (or classes) for random variable Y = {Y1, Y2,..., Yl} P(Y|X) is the posterior probability of Y P(Y) is the prior probability of Y Assume all attributes are statistically independent Compute the conditional probability as 1.1 CLOUD COMPUTING IN A NUTSHELL

126 Naive Bayesian Classifier (cont.)
The naive Bayesian classifier calculates the posterior probability for each class Y by The Bayesian classification method predicts X to the class with the highest posterior probability Compute the posterior probability P(Yi|X), i = 1, 2, ..., l for each combination of X and Y Then decides Yr by finding And classify X to class Yr As P(X) is the same for all classes, sufficient to find the maximum of the numerator 1.1 CLOUD COMPUTING IN A NUTSHELL

127 Naive Bayesian Classifier (cont.)
Only need to compute The six steps of Naive Bayesian classifier 1.1 CLOUD COMPUTING IN A NUTSHELL

128 Naive Bayesian Classifier (cont.)
Bayesian Classifier and Analysis of Classification Results The training data is a set of animals Each data item can be labeled as mammal or non-mammal, but not both Each data item is characterized by four independent attributes A = <A1, A2, A3, A4> = <gives birth, can fly, live in water, have legs> To build a Bayesian classifier model from the training set The model will be applied to classify any unlabeled animal as either mammals (M) or non-mammals (N) 1.1 CLOUD COMPUTING IN A NUTSHELL

129 Naive Bayesian Classifier (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

130 Naive Bayesian Classifier (cont.)
The attribute A3: live in water means the animal primarily lives in water, not just occasionally swims in the water The value sometimes in water is considered a no entry Compute the prior probabilities P(M) = 7/20 and P(N) = 13/20 An unlabeled testing data item characterized by An attribute vector: A* = <A1, A2, A3, A4> = <yes, no, yes, no> A creature gives birth, cannot fly, lives in water, and has no legs First calculate the testing probability values P(M|A*) = > P(N|A*) = This creature with attribute vector A* is a mammal 1.1 CLOUD COMPUTING IN A NUTSHELL

131 Naive Bayesian Classifier (cont.)
P(M) = 7/20 and P(N) = 13/20 Analyze the accuracy of using the Bayesian classifier By testing four creatures using the above method Obtain the posterior probabilities P(M|A1, A2, A3, A4) and P(N|A1, A2, A3, A4) for each testing animals 1.1 CLOUD COMPUTING IN A NUTSHELL

132 Naive Bayesian Classifier (cont.)
Choose the class with highest probability as the predicted class Comparing the predicted results with the actual classes Discover four possible prediction statuses TP = 2/4 = 0.5, TN = 1/4 = 0.25 FP = 0, and FN = 1/4 = 0.25 1.1 CLOUD COMPUTING IN A NUTSHELL

133 Naive Bayesian Classifier (cont.)
Use two performance metrics to assess the accuracy of the Bayesian classifier Prediction accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.75 Prediction error = (FP + FN) / (TP + TN + FP + FN) = 0.25 The accuracy or error comes from the weak assumption that all attributes are independent The larger the training set to cover all possible attribute vectors, the higher the prediction accuracy If any of the individual conditional probability P(Ai|C) = Nic /Nc= 0 1.1 CLOUD COMPUTING IN A NUTSHELL

134 Naive Bayesian Classifier (cont.)
Due to Nic = 0 from the training dataset The entire posterior probability for each class becomes zero Can be avoided by assuming an offset value P(Ai|C) = (Nic+1)/(Nc+c) = 1/(Nc+c) c is the number of classes being considered 1.1 CLOUD COMPUTING IN A NUTSHELL

135 Bayesian Belief Networks
Naive Bayesian networks assume that all attributes are statistically independent This assumption is too strict in some cases To relax this assumption, the Bayesian belief network is introduced with class conditional probability The Bayesian belief network is a graphical representation of the relationship among attributes Two main components A directed acyclic graph (DAG) representing the dependencies between the variables A probability table connecting each node and its parent node directly 1.1 CLOUD COMPUTING IN A NUTSHELL

136 Bayesian Belief Networks (cont.)
Consider 3 random variables, A, B and C A and B are independent of each other Both have a direct impact on C Consider five variables, A, B, C, D and E The variables A and B, D and E are independent of each other D and E have a direct impact on C Variable C affects A and B 1.1 CLOUD COMPUTING IN A NUTSHELL

137 Bayesian Belief Networks (cont.)
Consider the situation where variables are independent Namely the Naive Bayesian Network A special kind of Bayesian belief network Use Y to denote the target class {X1, X2,⋯, Xd} are its set of attributes The corresponding Bayesian belief network 1.1 CLOUD COMPUTING IN A NUTSHELL

138 Bayesian Belief Networks (cont.)
To indicate the relationship between the variables more vividly If there is an arc from X to Y, then X is Y’s parent Y is the child of X If a path exists in the network from X to Z, then X is the ancestor of Z Z is the offspring of X e.g., D is C’s parent and the ancestor of A and B C is the children of D; A and B is the offspring of D Each node is also associated with a probability table If node X has no parent node, this table contains only a priori probability P(X) 1.1 CLOUD COMPUTING IN A NUTSHELL

139 Bayesian Belief Networks (cont.)
If node X only has one parent node Y, this table contains conditional probability P(X|Y) If node X has multiple parent nodes {Y1, Y2, ..., Ym}, this table contains conditional probability P(X|Y1, Y2, ..., Ym) Bayesian belief network modeling consists of two steps First create a network structure The network topology is obtained through encoding supported by subjective domain knowledge Then estimate the probability in the arcs The probability values are obtained by conditional probability 1.1 CLOUD COMPUTING IN A NUTSHELL

140 Bayesian Belief Networks (cont.)
A systematic procedure in using the Bayesian belief network 1.1 CLOUD COMPUTING IN A NUTSHELL

141 Bayesian Belief Networks (cont.)
Use of Bayesian Belief Network in Diabetes Prediction In general, diabetes is attributed to many factors e.g., Obesity, family history, blood sugar (blood glucose) and blood lipid, etc Obesity and family history are also antecedents Other symptoms may be induced by blood glucose and blood sugar, aka consequences The Bayesian belief network can model the antecedents and consequences at the same time The Naive Bayesian network can only model the antecedents Part of the physical examination data of patients 1 indicates having the symptom and 0 otherwise 1.1 CLOUD COMPUTING IN A NUTSHELL

142 Bayesian Belief Networks (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

143 Bayesian Belief Networks (cont.)
According to experience and common sense Obesity and family history are both related to diabetes The blood glucose and blood lipid are also related to diabetes The attributes are ordered as T = {obesity, family history, blood sugar, blood lipid, diabetes} Obtain the Bayesian belief network 1.1 CLOUD COMPUTING IN A NUTSHELL

144 Bayesian Belief Networks (cont.)
To simplify the conditional probability P(A|B) = P(A) because obesity (A) and family history (B) are independent of each other In combination with the given table to get the probability of diabetes e.g., The calculation results of this probability table P(C|A, B) An obese patient with a family history of diabetes has the probability of diabetes of 2/3 1.1 CLOUD COMPUTING IN A NUTSHELL

145 Bayesian Belief Networks (cont.)
To find out whether diabetes has effects on blood glucose and lipids The results of probability table for P(D, E|C) Using the Bayesian conditional probability to calculate the possibility of diabetes in other cases 1.1 CLOUD COMPUTING IN A NUTSHELL

146 Bayesian Belief Networks (cont.)
e.g., Suppose some person with high blood glucose and high blood fat Namely P(C = yes|D = yes, E = yes) With the known results to calculate probability Using the Bayesian conditional probability theorem The patient suffers from diabetes Consider an obese patient with a family history of diabetes and high blood glucose The probability that the patient has diabetes is 10/11 11 1.1 CLOUD COMPUTING IN A NUTSHELL

147 Bayesian Belief Networks (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

148 Ensemble Methods General classification techniques use single classifiers obtained from training data to predict unknown class labels e.g., Bayesian networks, decision trees and support vector machines Sometimes a single ML algorithm cannot reach the specified accuracy requirements Can improve the accuracy of the algorithm through the aggregation of multiple classifiers This technique is known as an ensemble model or a combination of classifiers To make reliable decisions Can combine several simple weak learning methods to improve accuracy beyond 50% 1.1 CLOUD COMPUTING IN A NUTSHELL

149 Ensemble Methods (cont.)
Ensemble methods are composed of multiple weaker models These models are independently trained Much effort is put into what types of weak learners to combine and the ways in which to combine them effectively Consists of mixed learners applying supervised, unsupervised, or semi-supervised algorithms 1.1 CLOUD COMPUTING IN A NUTSHELL

150 Random Forests Random forest is one of the combination classification methods A kind of special combination method designated for decision tree classification Through the combination of multiple decision trees, the predictions are made Each tree is generated by a value of an independent set on the basis of random vector e.g., To decide whether a certain day is suitable for playing tennis according to weather, temperature, humidity and wind conditions Using multiple decision trees to increase the accuracy Can divide the four attributes into multiple groups of attributes 1.1 CLOUD COMPUTING IN A NUTSHELL

151 Random Forests (cont.) e.g., {weather, humidity, wind}, {temperature, humidity, wind}, {weather, temperature} and so on 1.1 CLOUD COMPUTING IN A NUTSHELL

152 Random Forests (cont.) One decision tree can be divided into three ones When decision making every single decision tree will correspond to a result i.e., Playing tennis, or not playing tennis Can get three decision-making results The final result is the one with the most votes 1.1 CLOUD COMPUTING IN A NUTSHELL

153 Random Forests (cont.) e.g., For the case {sunny, greater than 20 degrees, normal air humidity, no wind} With the above three decision tree The results of the first, the third decision tree method is not playing tennis That of the second decision tree is playing tennis As playing tennis gets 1 vote and not playing tennis gets 2 votes, the final result is not playing tennis A random vector is obtained from random attributes similarly as above The random vector is used to construct the decision tree Once the decision tree is constructed, majority voting results are used to combine predictions 1.1 CLOUD COMPUTING IN A NUTSHELL

154 Random Forests (cont.) If d is too small Known as Forest-RI
RI refers to t-random input selection The strength of the random forest decision is dependent on the dimension of random vectors Namely the number of characteristic numbers obtained by each tree F Usually F = log2d + 1 d is the total number of attributes If d is too small Difficult to select a random independent set of attributes to construct the decision tree A method to increase the space of attributes To create the linear combination of features 1.1 CLOUD COMPUTING IN A NUTSHELL

155 Random Forests (cont.) The process of random forest establishment
Using L input attributes of the linear combination to create a new attribute Using created new attributes to form a random vector to construct the multiple pieces of decision tree This method is called as Forest-RC The process of random forest establishment 1.1 CLOUD COMPUTING IN A NUTSHELL

156 Random Forests (cont.) The general process of random forest decisions
1.1 CLOUD COMPUTING IN A NUTSHELL

157 Random Forests (cont.) Diabetics Prediction using the Random Forests Model Part of the physical examination data of a real-life hospital from central China 1.1 CLOUD COMPUTING IN A NUTSHELL

158 Random Forests (cont.) Includes body weight, blood sugar, lipid content and whether the patient is suffering from diabetes 1: patients, 0: normal The physical index data for a person is {weight: 60, blood sugar: 6.8, blood lipids: 1.5} To find that the person is suffering from diabetes To improve the prediction accuracy Can consider making predictions with the random forest method Need to determine the dimension of the random vector F = log2d + 1 = 2 The attributes in the example are few To make the correlation between random vectors lower 1.1 CLOUD COMPUTING IN A NUTSHELL

159 Random Forests (cont.) Three random vectors: {weight, blood sugar}, {weight, blood lipid} and {blood sugar, blood lipid} The sequence of several properties is determined by the information entropy The property whose information entropy increases the most is at the top of the decision tree, and the like The entropy increases of weight (A), blood sugar (B) and blood lipids (C) are respectively 1.1 CLOUD COMPUTING IN A NUTSHELL

160 Random Forests (cont.) The content of blood sugar and blood lipids is more important Should be placed closer to the root of the decision tree The order is blood sugar, blood lipid and weight The result of the examination index {weight: 60, blood sugar 6.8, blood lipid: 1.5} The decision tree 1 is YES The decision tree 2 is NO The decision tree 3 is YES The final situation is 2 votes for yes and 1 vote for no 1.1 CLOUD COMPUTING IN A NUTSHELL

161 Random Forests (cont.) The preliminary result indicates that the person is suffering from diabetes Diabetes random forest representation 1.1 CLOUD COMPUTING IN A NUTSHELL


Download ppt "Supervised Machine Learning Algorithms"

Similar presentations


Ads by Google