Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 6 Machine Learning Algorithms and Prediction Model Fitting

Similar presentations


Presentation on theme: "Chapter 6 Machine Learning Algorithms and Prediction Model Fitting"— Presentation transcript:

1 Chapter 6 Machine Learning Algorithms and Prediction Model Fitting

2 Taxonomy of Machine Learning Methods
The main idea of machine learning (ML) To use computers to learn from massive amounts of data ML forms the core of artificial intelligence (AI) Especially in the era of big data This field is highly relevant to statistical decision making and data mining In building AI or expert systems For tedious or unstructured data Machines can often make better and more unbiased decisions than a human learner Learning from given data objects one can reveal the categorical class or experience affiliation of future data to be tested 1.1 CLOUD COMPUTING IN A NUTSHELL

3 Taxonomy of Machine Learning Methods (cont.)
This concept essentially defines ML as an operational term To implement an ML task Need to explore or construct computer algorithms to learn from data Make predictions on data based on their specific features, similarities, or correlations ML algorithms are operated by building a decision-making model from sample data inputs The outputs of the ML model are data-driven predictions or decisions 1.1 CLOUD COMPUTING IN A NUTSHELL

4 Classification by Learning Paradigms
ML algorithms can be built with different styles in order to model a problem The style is dictated by the interaction with the data environment Expressed as the input to the model The data interaction style decides the learning models that a ML algorithm can produce The user must understand the roles of the input data and the model’s construction process The goal is to select the ML model that can solve the problem with the best prediction result ML sometime overlaps with the goal of data mining 1.1 CLOUD COMPUTING IN A NUTSHELL

5 Classification by Learning Paradigms
Three classes of ML algorithms based on different learning styles Supervised, unsupervised, and semi-supervised Three ML methods are viable in real-life applications The style is hinged on how training data is used in the learning process 1.1 CLOUD COMPUTING IN A NUTSHELL

6 Classification by Learning Paradigms (cont.)
Supervised learning The input data is called training data with a known label or result A model is constructed through training by using the training data set Improved by receiving feedback predictions The learning process continues Until the model achieves a desired level of accuracy on the training data Future incoming data without known labels is tested on the model with an acceptable level of accuracy Unsupervised learning All input data are not labeled with a known result 1.1 CLOUD COMPUTING IN A NUTSHELL

7 Classification by Learning Paradigms (cont.)
A model is generated by exploring the hidden structures present in the input data To extract general rules, go through a mathematical process to reduce redundancy, or organize data by similarity testing Semi-supervised learning The input data is a mixture of labeled and unlabeled examples The model must learn the structures to organize the data in order to make predictions possible Such problems and other ML algorithms will be treated under different assumptions on how to model the unlabeled data 1.1 CLOUD COMPUTING IN A NUTSHELL

8 Methodologies for Machine/Deep Learning
ML algorithms are distinguishable By applying different similarity testing functions in the learning process e.g., Tree-based methods apply decision trees A neural network is inspired by artificial neurons in a connectionist brain model One can handle the ML process subjectively By finding the best fit to solve the decision problem based on the characteristics in data sets Ensemble methods are composed of multiple weaker models The prediction results of these models are combined 1.1 CLOUD COMPUTING IN A NUTSHELL

9 Methodologies for Machine/Deep Learning (cont.)
Makes the collective prediction more accurate These models are independently trained Much effort is put into what types of weak learners to combine And the ways in which to combine them effectively Consists of mixed learners applying supervised, unsupervised, or semi-supervised algorithms Deep learning methods extend from Artificial neural networks (ANNs) By building much deeper and complex neural networks Built of multiple layers of interconnected artificial neurons 1.1 CLOUD COMPUTING IN A NUTSHELL

10 Methodologies for Machine/Deep Learning (cont.)
Often used to mimic the human brain process in response to light, sound, and visual signals Often applied to semi-supervised learning problems Large data sets contain very little labeled data 1.1 CLOUD COMPUTING IN A NUTSHELL

11 Supervised Machine Learning Algorithms
In a supervised ML system The computer learns from a training data set of {input, output} pairs The input comes from sample data given in a certain format e.g., The credit reports of borrowers The output may be discrete e.g., yes or no to a loan application The output can be also continuous e.g., The probability distribution that the loan can be paid off in a timely manner The goal is to work out a reliable ML model Can map or produce the correct outputs from new inputs that were unseen before 1.1 CLOUD COMPUTING IN A NUTSHELL

12 Supervised Machine Learning Algorithms (cont.)
The ML system acts like a finely tuned predictor function g(x) The learning system is built with a sophisticated algorithm to optimize this function e.g., Given an input data x in a credit report of a borrower, the bank will make a loan decision based on the predicted outcome Four families of important supervised ML algorithms Including regression, decision trees, Bayesian networks, and support vector machines In solving a classification problem, the inputs are divided into two or more classes 1.1 CLOUD COMPUTING IN A NUTSHELL

13 Supervised Machine Learning Algorithms (cont.)
The learner must produce a model that assigns unseen inputs to one or more of these classes Typically tackled in a supervised way Spam filtering is a good example of classification The inputs are s, blogs, or document files The output classes are spam and non-spam 1.1 CLOUD COMPUTING IN A NUTSHELL

14 Supervised Machine Learning Algorithms (cont.)
Regression is also a supervised problem The outputs are continuous in general but discrete in special cases Uses statistical learning Models the relationship between input and output data The regression process is iteratively refined using an error criterion to make better predictions Minimizes the error between predicted value and actual experience in input data Decision trees offers a predictive model Solve classification and regression problems Map observations about an item to conclusions about the item’s target value 1.1 CLOUD COMPUTING IN A NUTSHELL

15 Supervised Machine Learning Algorithms (cont.)
Along various feature nodes in a tree-structured decision process Various decision paths fork in the tree structure Until a prediction decision is made hierarchically at the leaf node Trained on given data for better accuracy Bayesian methods are based on statistical decision theory Often applied in pattern recognition, feature extraction, and regression applications Offers a directed acyclic graph (DAG) model Represented by a set of statistically independent random variables 1.1 CLOUD COMPUTING IN A NUTSHELL

16 Supervised Machine Learning Algorithms (cont.)
e.g., A Bayesian network can represent the probabilistic relationships between diseases and symptoms Given symptoms, the system computes the probabilities of having various diseases Many prediction algorithms are used in medical diagnosis to assist doctors, nurses, and patients in the healthcare industry Both prior and posterior probabilities are applied in making predictions Can also be improved with the provisioning of a better training data set Support vector machines (SVMs) are often used in supervised learning methods 1.1 CLOUD COMPUTING IN A NUTSHELL

17 Supervised Machine Learning Algorithms (cont.)
For regression and classification applications Decide how to generate a hyperplane To separate the training sample data space into distinct subspaces e.g., A surface in a 3D space Builds a model to predict whether a new sample falls into one subspace or another 1.1 CLOUD COMPUTING IN A NUTSHELL

18 Unsupervised Machine Learning Algorithms
Unsupervised learning is typically used In finding special relationships within the data set No training examples used in this process The system is given a set of data to find the patterns and correlations therein Attempt to reveal hidden structures or properties in the entire input data set Some reported ML algorithms that operate without supervision Including clustering methods, association analysis, dimension reduction, and artificial neural networks 1.1 CLOUD COMPUTING IN A NUTSHELL

19 Unsupervised Machine Learning Algorithms (cont.)
Association rule learning generates inference rules Used to discover useful associations in large multidimensional data sets Best explain observed relationships between variables in the data 1.1 CLOUD COMPUTING IN A NUTSHELL

20 Unsupervised Machine Learning Algorithms (cont.)
These association patterns are often exploited by enterprises or large organizations e.g., Association rules are generated from input data to identify close-knit groups of friends in a social network database In clustering, a set of inputs is to be divided into groups Grouping similar data objects as clusters Modeled by using centroid-based clustering and/or hierarchal clustering All clustering methods are based on similarity testing Unlike supervised classification, the groups are not known in advance Making this an unsupervised task 1.1 CLOUD COMPUTING IN A NUTSHELL

21 Unsupervised Machine Learning Algorithms (cont.)
Density estimation finds the distribution of inputs in some space Dimensionality reduction exploits the inherent structure in the data In an unsupervised manner The purpose is to summarize or describe data using less information Done by visualizing multidimensional data with principal components or dimensions Then simplifies the inputs by mapping them into a lower-dimensional space The simplified data can then be applied in a supervised learning method 1.1 CLOUD COMPUTING IN A NUTSHELL

22 Unsupervised Machine Learning Algorithms (cont.)
Artificial neural networks (ANNs) are cognitive models Inspired by the structure and function of biological neurons Tries to model the complex relationships between inputs and outputs This forms a class of pattern matching algorithms Used for solving regression and classification problems 1.1 CLOUD COMPUTING IN A NUTSHELL

23 Regression Analysis Regression analysis is widely used in ML for prediction, classification, and forecasting Essentially performs a sequence of parametric or nonparametric estimations Finds the causal relationship between the input and output variables Careful to make such predictions Causality may lead to illusions or false relationships to mislead the users The estimation function can be determined By experience using a priori knowledge or visual observation of the data Need to calculate the undetermined coefficients of the function by using some error criteria 1.1 CLOUD COMPUTING IN A NUTSHELL

24 Regression Analysis (cont.)
The regression method can be applied to classify data by predicting the category tag of data Regression analysis tries to understand How the value of the dependent variable changes While the independent variables are held unchanged The independent variables are the inputs of the regression process, aka the predictors The dependent variable is the output of the process Regression analysis estimates the average value of the dependent variable When the independent variables are fixed The estimated value is a function of the independent variables known as the regression function Can be described by a probability distribution 1.1 CLOUD COMPUTING IN A NUTSHELL

25 Regression Analysis (cont.)
Most regression methods are parametric naturally With a finite dimension in the analysis space Nonparametric regression may be infinite-dimensional Accuracy or performance depends on the quality of the data set used Related to the data generation process and the underlying assumptions made Regression offers estimation of continuous response variables As opposed to the discrete decision values used in classification that demand higher accuracy 1.1 CLOUD COMPUTING IN A NUTSHELL

26 Regression Analysis (cont.)
In the formulation of a regression process The unknown parameters are often denoted as β May appear as a scalar or a vector The independent variables are denoted by a vector X and a dependent variable as Y When multiple dimensions are involved, these parameters are vectors in form A regression model establishes the approximated relation between X, β, and Y The function f(X, β) is approximated by the expected value E(Y|X) 1.1 CLOUD COMPUTING IN A NUTSHELL

27 Regression Analysis (cont.)
The regression function f is based on the knowledge of the relationship between a continuous variable Y and vector X If no such knowledge is available, an approximated handy form is chosen for f Measuring the Height after Tossing a Small Ball in the Air Measure its height of ascent h at the various time instant t The relationship is modeled as β1 determines the initial velocity of the ball β2 is proportional to standard gravity ε is due to measurement errors 1.1 CLOUD COMPUTING IN A NUTSHELL

28 Regression Analysis (cont.)
Linear regression is used to estimate the values of β1 and β2 from the measured data This model is nonlinear with respect to the time variable t But it is linear with respect to parameters β1 and β2 Consider k components in the vector of unknown parameters β Three models to relate the inputs to the outputs Depending on the relative magnitude between the number N of observed data points of the form (X, Y) and the dimension k of the sample space When N < k, most classical regression analysis methods can be applied Most classical regression analysis methods can be applied 1.1 CLOUD COMPUTING IN A NUTSHELL

29 Regression Analysis (cont.)
The defining equation is underdetermined No enough data to recover the unknown parameters β When N = k and the function f is linear The equation Y = f (X, β) can be solved exactly without approximation There are N equations to solve N components in β The solution is unique as long as the X components are linearly independent If f is nonlinear, many solutions may exist or no solution at all In general, the situation that N > k data points Enough information in the data that can estimate a unique value for β under an overdetermined situation The measurement errors εi follows a normal distribution 1.1 CLOUD COMPUTING IN A NUTSHELL

30 Regression Analysis (cont.)
There exists an excess of information contained in (N - k) measurements Known as the degrees of freedom of the regression Basic assumptions on regression analysis under various error conditions The sample is representative of the data space involved The error is a random variable with a mean of zero conditioned over the input variables The independent variables are measured with no error The predictors are linearly independent The errors are uncorrelated 1.1 CLOUD COMPUTING IN A NUTSHELL

31 Regression Analysis (cont.)
The variance of the error is a constant across observations If not, a weighted least squares method is needed Regression analysis is a statistical method Determines the quantitative relation in a machine learning process Two or more variables are dependent on each other Including linear regression and nonlinear regression 1.1 CLOUD COMPUTING IN A NUTSHELL

32 Linear Regression Unitary linear regression analysis
Only one independent variable and one dependent variable are included in the analysis The approximate representation for the relation between the two can be conducted with a straight line Multivariate linear regression analysis Two or more independent variables are included in regression analysis Linear relation between dependent variable and independent variables The model of a linear regression y = f(X) X = (x1, x2,⋯, xn) with n  1 is a multidimensional vector and y is scalar variable 1.1 CLOUD COMPUTING IN A NUTSHELL

33 Linear Regression (cont.)
A linear predictor function used to estimate the unknown parameters from data Linear regression is applied mainly in the two areas An approximation process for prediction, forecasting, or error reduction Predictive linear regression models for an observed data set of y and X values The fitted model makes a prediction of the value of y for future unknown input vector X To quantify the strength of the relationship between output y and each input component Xj Assess which Xj is irrelevant to y and which subsets of the Xj contain redundant information about y 1.1 CLOUD COMPUTING IN A NUTSHELL

34 Unitary Linear Regression
Fit linear regression models with a least squares approach Consider a set of data points in a 2D sample space (x1, y1), (x2, y2), ..., (xn, yn) Mapped into a scatter diagram If they can be covered approximately by a straight line: y = ax + b + ε x is an input variable, y is an output variable in the real number range, a and b are coefficients ε is a random error, follows a normal distribution Need to work out the expectation by using a linear regression expression: y = ax + b 1.1 CLOUD COMPUTING IN A NUTSHELL

35 Unitary Linear Regression (cont.)
The residual error of a unitary model The approximation is shown by a linear line Amid the middle or center of all data points in the data space The main task for regression analysis 1.1 CLOUD COMPUTING IN A NUTSHELL

36 Unitary Linear Regression (cont.)
To conduct estimations for coefficient a and b via observation on n groups of input samples The common method applies a least squares method The objective function is given by To minimize the sum of squares, need to calculate the partial derivative of Q With respect to , and make them zero 1.1 CLOUD COMPUTING IN A NUTSHELL

37 Unitary Linear Regression (cont.)
are mean value for input variable and dependent variable, respectively After working out the specific expression for the model Need to know the fitting degree to the dataset If the expression can express the relation between the two variables and can be used in actual predictions To figure out the estimated value of the dependent variable with For each sample in the training data set 1.1 CLOUD COMPUTING IN A NUTSHELL

38 Unitary Linear Regression (cont.)
The closer the coefficient of determination R2 is to 1, the better the fitting degree is The further R2 is away from 1, the worse fitting degree is Linear regression can also be used for classification Only used in a binary classification problem Decide between the two classes For multivariate linear regression, this method is also applied to classify a dataset 1.1 CLOUD COMPUTING IN A NUTSHELL

39 Multiple Linear Regression
During solving actual problems Often encounter many variables e.g., The scores of a student may be influenced by factors like earnestness in class, preparation before class and review after class e.g., The health of a man is not only influenced by the environment, but also related to the dietary habits The model of unitary linear regression is not adapted to many conditions Improve it with a model of multivariate linear regression analysis Consider the case of m input variables The output is expressed as a linear combination of the input variables 1.1 CLOUD COMPUTING IN A NUTSHELL

40 Multiple Linear Regression (cont.)
𝛽0, 𝛽1,⋯, 𝛽m, 𝜎2 are unknown parameters ε complies with normal distribution The mean value is 0 and the variance is equal to 𝜎2 By working out the expectation for the structure to get the multivariate linear regression equation Substituted y for E(y) Its matrix form is given as y = X𝛽 X = [1, x1,⋯, xm], 𝛽 = [𝛽0, 𝛽1,⋯, 𝛽m]T Our goal is to compute the coefficients By minimizing the scalar objective function 1.1 CLOUD COMPUTING IN A NUTSHELL

41 Multiple Linear Regression (cont.)
Defined over n sample data points To minimize Q, need to make the partial derivative of Q with respect to each βi zero The multiple linear regression equation 1.1 CLOUD COMPUTING IN A NUTSHELL

42 Multiple Linear Regression (cont.)
Multivariate regression is an expansion and extension of unitary regression Identical in nature The range of applications is different Unitary regression has limited applications Multivariate regression is applicable to many real-life problems Estimate the Density of Pollutant Nitric Oxide in a Spotted Location Estimation of the density of nitric oxide (NO) gas, an air pollutant, in an urban location Vehicles discharge NO gas during their movement 1.1 CLOUD COMPUTING IN A NUTSHELL

43 Multiple Linear Regression (cont.)
Creates a pollution problem proven harmful to human health The NO density is attributed to four input variables Vehicle traffic, temperature, air humidity, and wind velocity 16 data points collected in various observed spotted locations in the city Apply the multiple linear regression method to estimate the NO density In testing a spotted location measured with a data vector of {1436, 28.0, 68, 2.00} for four features {x1, x2, x3, x4}, respectively X = [1, xn1, xn2, xn3, xn4]T and the weight vector W = [b, β1, β2, β3, β4]T for n = 1,2,.…,16 1.1 CLOUD COMPUTING IN A NUTSHELL

44 Multiple Linear Regression (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

45 Multiple Linear Regression (cont.)
e.g., for the first row of training data, [1300, 20, 80, 0.45, 0.066], X1 = [1, 1300, 20, 80, 0.45]T, which gives the output value y1 = 0.066 Need to compute W = [b, β1, β2, β3, β4]T and minimize the mean square error The 16 × 5 matrix directly obtained from the sample data table y = [0.066, 0.005,.…, 0.039]T is the given column vector of data labels 1.1 CLOUD COMPUTING IN A NUTSHELL

46 Multiple Linear Regression (cont.)
To make the prediction results on the testing sample vector x = [1, 1300, 20, 80, 0.45]T By substituting the weight vector obtained The final answer is {β1 = 0.029, β2 = 0.015, β3 = 0.002, β4 = −0.029, b = 0.070} The NO gas density is predicted as = or 6.5% 1.1 CLOUD COMPUTING IN A NUTSHELL

47 Logistic Regression Method
A linear regression analysis model can be extended to a broader application For prediction and classification Commonly used in fields like data mining, automatic diagnosis for diseases, and economic predictions The logistic model may only be used to solve problems of dichotomy The principle is to conduct classification to sample data with a logistic function The expression for the logistic function Known as a sigmoid function 1.1 CLOUD COMPUTING IN A NUTSHELL

48 Logistic Regression Method (cont.)
The input domain of the sigmoid function is (-∞, +∞) and the range is (0, 1) The sigmoid function is a probability density function for sample data The basic idea for logistic regression Sample data may be concentrated at both ends of the by the use of intermediate feature z of the sample Can be divided into two classes 1.1 CLOUD COMPUTING IN A NUTSHELL

49 Logistic Regression Method (cont.)
Consider vector X = (x1,⋯, xm) with m independent input variables Each dimension of X stands for one attribute (feature) of the sample data (training data) Multiple features of the sample data are combined into one feature by Need to figure out the probability of the feature with designated data And apply the sigmoid function to act on that feature 1.1 CLOUD COMPUTING IN A NUTSHELL

50 Logistic Regression Method (cont.)
During combining of multiple features into one feature Make use of the linear function The coefficient of the linear function, i.e., feature weight of sample data, needs to be determined Maximum likelihood estimation is adopted to transform it into an optimization problem The coefficient is determined through the optimization method 1.1 CLOUD COMPUTING IN A NUTSHELL

51 Decision Trees for Machine Learning
The goal is to create a model Predicts the value of an output target variable at the leaf nodes of the tree Based on several input variables or attributes at the root and interior nodes of the tree Decision trees for classification use are known as classification trees In a classification tree Leaves represent class labels Branches represent conjunctions of attributes that lead to the class labels The target variable (output) can take two values such as yes or no 1.1 CLOUD COMPUTING IN A NUTSHELL

52 Decision Trees for Machine Learning (cont.)
Or multiple discrete values like outcomes 1, 2, 3, or 4 of an event The arcs from a node are labeled with each of the possible values of the attribute Each leaf of the tree is labeled with a class Or a probability distribution over the classes Decision trees where the target variable assumes continuous values like real numbers Called regression trees Decision trees follow a multilevel tree structure To make decisions at the leaf nodes of a tree. e.g., Need to decide whether or not to go out to play tennis under various weather conditions 1.1 CLOUD COMPUTING IN A NUTSHELL

53 Decision Trees for Machine Learning (cont.)
The weather conditions are indicated by three attributes: outlook, humidity, and windy The outlook is checked at the root Three possible outgoing arcs marked as sunny, overcast, or rain The humidity is fanning out to two arcs labeled as > 70 or not The windy values are simply true or false To traverse this tree, one starts from the root to a leaf along a path of one or two levels Inside each tree node, the target counts are given to determine the probability if a leaf node is reached e.g., If the outlook value is overcast, one reaches a leaf node with a probability of 4/5 to play tennis 1.1 CLOUD COMPUTING IN A NUTSHELL

54 Decision Trees for Machine Learning (cont.)
If the outlook is sunny and the humidity is below 70, one reaches the leaf node at the extreme left with a probability of 5/8 to play tennis One can also reach other leaf nodes with different probabilities. a simple prediction decision The target value can be class labels like yes or no 1.1 CLOUD COMPUTING IN A NUTSHELL

55 Decision Tree Learning
The effectiveness of the decision tree is dependent upon the root chosen Or the first attribute for splitting out multiple choices The successive attribute order applied may lead to entirely different tree topologies The goal is to cover all correct paths for all labeled sample data provided The tree must be able to predict accurately for all future testing data All features have finite discrete domains A tree can be learned by splitting the data set into subsets based on an attribute value test 1.1 CLOUD COMPUTING IN A NUTSHELL

56 Decision Tree Learning (cont.)
This process is repeated on each derived subset in a recursive manner The process is completed when the subset at a node belongs to the same class This greedy algorithm of top-down induction of decision trees (TDIDT) is a common strategy for learning decision trees from input data Three greedy top-down methods, ID3, C4.5, and CART for constructing the decision tree The C4.5 is the improved successor of ID3 The regression tree is used if the predicted outcome is continuous, like a real number CART method combines Classification And Regression in the Tree construction 1.1 CLOUD COMPUTING IN A NUTSHELL

57 ID3 Algorithm Tagging The ID3 algorithm takes the information gain of attribute as the measure Splits the attribute with the largest information gain after splitting To make the output partition on each branch belong to the same class as far as possible Apply an entropy function to select the most information-rich attribute as the root to grow successive nodes in the decision tree The measure of information gain is entropy Depicts the purity of any example set Given a training set S of positive and negative examples, the entropy function of S 1.1 CLOUD COMPUTING IN A NUTSHELL

58 ID3 Algorithm Tagging (cont.)
p+ represents positive examples p- represents negative examples If the attribute possesses m different values The entropy of S relative to classifications of m classes This measure standard is called the information gain The information gain of an attribute The decrease of expected entropy caused by segmented examples The gain Gain(S, A) of an attribute A in set S 1.1 CLOUD COMPUTING IN A NUTSHELL

59 ID3 Algorithm Tagging (cont.)
V(A) is the range of A, S is the sample set Sv is the sample set with A value equal to v Bank Loan Approval Using Decision Tree with Training Data A credit card enables the cardholder to borrow money or pay for a purchase from the card issuing bank The bank expects a customer to pay it back by a given deadline The bank keeps statistics of customer payback records Consider three cardholder attributes as the input variables to a decision process Gender, age, and income 1.1 CLOUD COMPUTING IN A NUTSHELL

60 ID3 Algorithm Tagging (cont.)
Credit cardholder data from a card issuing bank Use these sample data points with labeled decisions to construct a decision tree To predict if a customer will pay in a timely manner or reveal the probability to do so 1.1 CLOUD COMPUTING IN A NUTSHELL

61 ID3 Algorithm Tagging (cont.)
Apply the decision tree obtained to a testing customer characterized by The data vector [Gender: female, Age: 26 ~ 40, Income: middle level] At the root level, if choose gender as the split attribute, its conditional entropy is computed by If choose age at the root, the conditional entropy If choose income at the root, the conditional entropy 1.1 CLOUD COMPUTING IN A NUTSHELL

62 ID3 Algorithm Tagging (cont.)
Select the attribute with the lowest conditional entropy as the root The root could be either age or income This procedure is repeated to the next level of the tree Until all attributes are exhausted and can cover the entire sample data set Finally we obtain two decision trees At each level, each training data gets branched to a different leaf node or waits for the next to split 1.1 CLOUD COMPUTING IN A NUTSHELL

63 ID3 Algorithm Tagging (cont.)
Both trees have at most three levels before reaching the decision leaves Both trees have equal search costs For an optimal solution, need to choose the shortest decision tree with the minimum number of levels 1.1 CLOUD COMPUTING IN A NUTSHELL

64 ID3 Algorithm Tagging (cont.)
Considering the testing customer data [Gender: female, Age: 26~40, Income: middle level] Neither of the two trees can lead to a unique solution This implies the tree has an over-fitting situation Heavily biased toward the sample data provided A solution is to shorten the trees by computing the probability at a shortened leaf node Based on a majority vote on customers falling in the user category e.g., Stop the tree construction at the root level By splitting the outgoing edges to three paths corresponding to three possible choices from the root i.e., High, middle, low 1.1 CLOUD COMPUTING IN A NUTSHELL

65 ID3 Algorithm Tagging (cont.)
Or produce the shortened tree With four possible leaf nodes pointed by age values (> 40, 26-40, 15-25, <15), respectively The leaf nodes are marked with probability values Using either of the shortened decision trees The testing customer will end up with a yes prediction The customer will pay back in a timely manner The two trees end up with different probability values predicted for a yes vote 1.1 CLOUD COMPUTING IN A NUTSHELL Yes: 1 No: 3

66 Bayesian Classifier with Training Samples
Naive Bayes and Bayesian network Improve the accuracy of the classification used in medical, financial, and many other fields Consider a pair of random variables: X & Y Their joint probability P(X = x, Y = y) is P(X, Y) = P(Y|X) × P(X) = P(X|Y) × P(Y) Can compute the inverse conditional probability The well-known Bayesian Theorem During classification The random variable Y is the class to be decided 1.1 CLOUD COMPUTING IN A NUTSHELL

67 Bayesian Classifier with Training Samples (cont.)
X is the attribute set Need to compute the class probability P(Y|X0) Given the attribute vector X0 for a testing data item Y with the maximum value of P(Y|X0) corresponds to the class for testing data X0 For an attribute vector X = (X1, X2,⋯, Xk) And l possible values (or classes) for random variable Y = {Y1, Y2,..., Yl} P(Y|X) is the posterior probability of Y P(Y) is the prior probability of Y Assume all attributes are statistically independent Compute the conditional probability as 1.1 CLOUD COMPUTING IN A NUTSHELL

68 Bayesian Classifier with Training Samples (cont.)
The naive Bayesian classifier calculates the posterior probability for each class Y by The Bayesian classification method predicts X to the class with the highest posterior probability Compute the posterior probability P(Yi|X), i = 1, 2, ..., l for each combination of X and Y Then decides Yr by finding And classify X to class Yr As P(X) is the same for all classes, sufficient to find the maximum of the numerator 1.1 CLOUD COMPUTING IN A NUTSHELL

69 Bayesian Classifier with Training Samples (cont.)
Only need to compute Bayesian Classifier and Analysis of Classification Results The training data is a set of animals Each data item can be labeled as mammal or non-mammal, but not both Each data item is characterized by four independent attributes A = <A1, A2, A3, A4> = <gives birth, can fly, lives in water, has legs> To build a Bayesian classifier model from the training set 1.1 CLOUD COMPUTING IN A NUTSHELL

70 Bayesian Classifier with Training Samples (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

71 Bayesian Classifier with Training Samples (cont.)
The model will be applied to classify any unlabeled animal as either mammals (M) or non-mammals (N) The attribute A3: lives in water means the animal primarily lives in water, not just occasionally swims in the water The value sometimes in water is considered a no entry Compute the prior probabilities P(M) = 7/20 and P(N) = 13/20 An unlabeled testing data item characterized by An attribute vector: A* = <A1, A2, A3, A4> = <yes, no, yes, no> First, calculate the testing probability values. P(M|A*) > P(N|A*) This creature with attribute vector A* is detected as a mammal 1.1 CLOUD COMPUTING IN A NUTSHELL

72 Bayesian Classifier with Training Samples (cont.)
A creature that gives birth, cannot fly, lives in water, and has no legs is classified as a mammal Analyze the accuracy of using the Bayesian classifier By testing four creatures using the above method Obtain the posterior probabilities P(M|A1, A2, A3, A4) and P(N|A1, A2, A3, A4) for each testing animals 1.1 CLOUD COMPUTING IN A NUTSHELL

73 Bayesian Classifier with Training Samples (cont.)
Choose the class with highest probability as the predicted class Comparing the predicted results with the actual classes Discover four possible prediction statuses TP (true positive) refers to a true case correctly predicted TN (true negative) for a true case incorrectly predicted FP (false positive) means a false case correctly predicted FN (false negative) for the false case that is incorrectly predicted TP = 2/4 = 0.5, TN = 1/4 = 0.25 FP = 0, and FN = 1/4 = 0.25 1.1 CLOUD COMPUTING IN A NUTSHELL

74 Bayesian Classifier with Training Samples (cont.)
Use two performance metrics to assess the accuracy of the Bayesian classifier Prediction accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.75 Prediction error = (FP + FN) / (TP + TN + FP + FN) = 0.25 The accuracy or error comes from the weak assumption that all attributes are independent 1.1 CLOUD COMPUTING IN A NUTSHELL

75 Bayesian Classifier with Training Samples (cont.)
The larger the training set to cover all possible attribute vectors The higher the prediction accuracy If any of the individual conditional probability P(Ai|C) = Nic /Nc= 0 Due to Nic = 0 from the training dataset The entire posterior probability for each class becomes zero Can be avoided by assuming an offset value P(Ai|C) = (Nic+1)/(Nc+c) = 1/(Nc+c) c is the number of classes being considered 1.1 CLOUD COMPUTING IN A NUTSHELL

76 Support Vector Machines (SVM)
Support vector offers another approach to classifying multidimensional data sets e.g., Can use a straight line to separate the points in 2D space And use planes to separate points in 3D space Use a hyperplane to separate the points in high-dimensional space Regard the points in the same area as one class Can use SVM to solve the issues of classification Samples on the margin are called the support vectors The original problem may be stated in a finite dimensional space 1.1 CLOUD COMPUTING IN A NUTSHELL

77 Support Vector Machines (SVM) (cont.)
Often happens that the sets to discriminate are not linearly separable in that space The original finite-dimensional space be mapped into a much higher-dimensional space Presumably making the separation easier in that space Can use a hyperplane to cluster these points in high-dimensional space 1.1 CLOUD COMPUTING IN A NUTSHELL

78 Linear Decision Boundary
Consider a 2D plane with two kinds of data. These data are linearly separable. Infinite straight lines can be adopted How do you find the best line? i.e., The one with the minimum classification error 1.1 CLOUD COMPUTING IN A NUTSHELL

79 Linear Decision Boundary
The two-class problems in an n-dimensional space The two classes are separated by an (n-1)-dimensional hyperplane Data points (X1, y1),.…, (X|D|, y|D|) Xi is the training sample of n-dimension with class label yi Each yi can assume a value either of +1 for one class and/or -1 for the other classes The (n-1)-dimensional hyperplane w and b are the parameters Correspond to a straight line in the 2D plane 1.1 CLOUD COMPUTING IN A NUTSHELL

80 Linear Decision Boundary (cont.)
The hyperplane intends to separate the two kinds of data i.e. All the yi corresponded by the data points on one side of the hyperplane are -1, and 1 on the other side Make f(x) = wTx + b Use f(x) > 0 point pairs for data points with y = 1 f(x) < 0 point pairs for data points with y = -1 Classification using Support VectorMachine with Training Samples The given 2-D data 1.1 CLOUD COMPUTING IN A NUTSHELL

81 Linear Decision Boundary (cont.)
One straight line 2x1 + x2 - 3 = 0 can be found to separate the data in the table 1.1 CLOUD COMPUTING IN A NUTSHELL

82 Maximal Margin Hyperplane
Consider those squares and circles nearest to the decision boundary Adjust parameters w and b Two parallel hyperplanes H1 and H2 can be obtained 1.1 CLOUD COMPUTING IN A NUTSHELL

83 Maximal Margin Hyperplane (cont.)
The margin of the decision boundary The distance between those two hyperplanes To calculate the margin Make x1 the data point on H1 And x2 the data point on H2 Insert x1 and x2 into the formula Margin d can be obtained by subtracting the formulas wT(x1 - x2) = 2  1.1 CLOUD COMPUTING IN A NUTSHELL

84 Formal SVM Model The training phase of SVM
Includes estimation of parameters w and b from the training data The selected parameters must meet Those two inequalities can be written to Maximization of the margin is equivalent to minimization of the objective function 1.1 CLOUD COMPUTING IN A NUTSHELL

85 Formal SVM Model (cont.)
SVM is obtained by finding the minimum objective function A convex optimization problem The objective function is quadratic The constraint condition is linear Can be solved through the standard Lagrange multiplier 1.1 CLOUD COMPUTING IN A NUTSHELL

86 Non-linear Hyperplanes
May need to adjust the model when the samples are not linearly inseparable An outlier (or noise) may make the sample space linearly inseparable Some slack variable is introduced to avoid this case With no restriction on misclassified samples on the boundary The learning algorithm may find such a boundary with a wider margin By allowing many misclassified training samples 1.1 CLOUD COMPUTING IN A NUTSHELL

87 Non-linear Hyperplanes (cont.)
The objective function can be modified as flows to avoid boundaries With huge slack variable values C and k are parameters designated by the user Meaning the punishment to the misclassified training instances i.e., The more the outliers, the larger the objective function values C means the weight of the outlier The final model is given as 1.1 CLOUD COMPUTING IN A NUTSHELL

88 Non-linear Hyperplanes (cont.)
If we cannot find a hyperplane to separate the data i.e., The linear SVM cannot be found a feasible solution Need to extend the linear SVM to a nonlinear SVM model Convert the input data to a space of higher dimension through nonlinear mapping Search the separating hyperplane in the new space e.g., When the low dimensional linear data is inseparable Can be mapped to a higher dimension to be separable after using the Gaussian function 1.1 CLOUD COMPUTING IN A NUTSHELL

89 Non-linear Hyperplanes (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

90 Clustering Methods without Labels
Try to discover the hidden information about the unlabeled data Also to discover the relationship among data The common analytical method is the clustering method One of the classical unsupervised learning methods Divide data into meaningful or useful groups Called clusters For data analysis, cluster is the potential class Clustering analysis is the technique to find this class automatically 1.1 CLOUD COMPUTING IN A NUTSHELL

91 Cluster Analysis for Prediction and Forecasting
Cluster analysis assigns a set of observations To partition a data set into clusters Based on a Euclidean distance or similarity function Aimed to separate data for classification purpose Data elements grouped into the same cluster Similar or have some common properties According to a predefined similarity metric The clusters are separated by dissimilar features or properties Other clustering methods are based on estimated density and graph connectivity 1.1 CLOUD COMPUTING IN A NUTSHELL

92 Cluster Analysis for Prediction and Forecasting (cont.)
Cluster analysis is a process to divide data objects into clusters X is the set of n data objects Xi the cluster label The clusters are the disjoint subsets Cluster Analysis of Hospital Exam Records The physical examination groups are divided into Conformity group and nonconformity group Based on clustering of characteristics Nonconformity may be divided into subgroups With hyperlipidemia or with heart disease 1.1 CLOUD COMPUTING IN A NUTSHELL

93 Cluster Analysis for Prediction and Forecasting (cont.)
The group with hyperlipidemia is divided into High-risk and low-risk subgroups 1.1 CLOUD COMPUTING IN A NUTSHELL

94 Cluster Analysis for Prediction and Forecasting (cont.)
The difference between clustering and classification Clustering-based division is uncertain From the perspective of machine learning Clustering is an unsupervised learning process with a constant search for clusters Clustering requires determination of the labels by user The classification is a supervised learning process to divide existing objects into different groups with various labels How to do clustering for given certain datasets requires the design of specific algorithms K-means clustering is a basic clustering method 1.1 CLOUD COMPUTING IN A NUTSHELL

95 K-Nearest Neighbor (kNN) Clustering
A k-nearest neighbor method is for simple clustering, which is easier to implement A shortcoming of the kNN algorithm Sensitive to the local structure of the data space Amended by a more sophisticated k-means clustering method A kind of lazy learning method Or a type of instance-based learning The objective function is only approximated locally with a deferred classification Consider the input data elements as among the k closest training examples in the feature space The output depends on whether kNN is used for classification or for regression 1.1 CLOUD COMPUTING IN A NUTSHELL

96 K-Nearest Neighbor (kNN) Clustering (cont.)
For kNN classification An object is classified by a majority vote of its neighbors The element being classified as a member by the most common among its k nearest neighbors For kNN regression The output is the property value for the data object The average of the values of its k nearest neighbors The data object is weighted by the nearer neighbors 1.1 CLOUD COMPUTING IN A NUTSHELL

97 K-means Clustering for Classification
Groups a large set of unrelated data elements (or vectors) into k clusters Assume that the dataset D contains n objects in Euclidean space Need to divide objects in the D into k clusters C1, C2,⋯, Ck, making 1 ≤ i, j ≤ k, Ci ⊂ D, Ci ∩ Cj = ∅ Necessary to evaluate the quality of the division Defining an objective function with the object of high similarity in a cluster and low inter-cluster similarity To embody a cluster more visually The centroid heart of a cluster represents the cluster 1.1 CLOUD COMPUTING IN A NUTSHELL

98 K-means Clustering for Classification (cont.)
ni denotes the number of elements in a cluster denotes the vector coordinate of cluster elements denotes the centroid’s coordinates of Ci Use d(x, y) to denote the Euclidean distance between the two vectors The objective function The error sum of squares of all objects in data sets D to the centroid of a cluster The objective of K-means clustering For a given dataset and given k Find a group of clusters C1, C2,⋯, Ck to minimize the objective function E 1.1 CLOUD COMPUTING IN A NUTSHELL

99 K-means Clustering for Classification (cont.)
K-means clustering is implemented with an iterative refinement technique Also known as Lloyd’s Algorithm Let S be the set of n data elements Si is the i-th cluster subset The clusters, Si for i = 1, 2, … k, are disjoint subsets of S forming a partition of the data set S Given an initial set of k means The algorithm proceeds by alternating between Assignment step 1.1 CLOUD COMPUTING IN A NUTSHELL

100 K-means Clustering for Classification (cont.)
Assign each observation to the cluster with mean being the centroid of set Si to yield the least within-cluster sum of squares (WCSS) The squared Euclidean distance at time t = 1, 2,.… Update step Calculate the new means at time step t+1 as the centroids of the observations in the new clusters The arithmetic mean minimizes the WCSS This algorithm will converge when the assignments no longer reduce the WCSS Both above steps optimize the WCSS objective 1.1 CLOUD COMPUTING IN A NUTSHELL

101 K-means Clustering for Classification (cont.)
Only a finite number of iterations to yield the final partitioning The algorithm must converge to a (local) optimum The idea is to assign data objects to the nearest cluster by distance Construct the clusters iteratively The initial choice of centroids and four steps to build three clusters out of 15 data points 1.1 CLOUD COMPUTING IN A NUTSHELL

102 K-means Clustering for Classification (cont.)
k initial means are randomly generated within the data domain In this case k = 3 k clusters are created by associating each data point with the nearest mean The partitions correspond to the Voronoi diagram generated by the means The centroid of each of the k clusters becomes the new mean Steps 2 and 3 are repeated until convergence has been reached Using K-Means Clustering to Classify Iris Solving an iris flower classification problem With k-means clustering for k = 3 clusters 1.1 CLOUD COMPUTING IN A NUTSHELL

103 K-means Clustering for Classification (cont.)
Given a data set of 150 unlabeled data points on iris flowers These flowers are classified into three clusters Iris setosa, Iris versicolour, and Iris virginica There are four features in this data set Sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm Only consider the data points with the last two most important features 1.1 CLOUD COMPUTING IN A NUTSHELL

104 K-means Clustering for Classification (cont.)
Identify the clustering centers (centroids) in successive steps The final means after two iterations 1.1 CLOUD COMPUTING IN A NUTSHELL

105 K-means Clustering for Classification (cont.)
The k-means clustering in R code 1.1 CLOUD COMPUTING IN A NUTSHELL

106 K-means Clustering for Classification (cont.)
The k-means(inputData, 3) program is executed iteratively until convergence is reached Initially, choose three distance means that are separated as far apart as possible The initial means are (2, 0.8), (4, 1.6), and (6, 2.5) Then compute its distances with all its neighbors Form three new clusters, based on the minimum distances criteria applied Then compute the new means (centroids) (1.4, 0.24), (4.2, 1.35), and (5.62, 2.04) among three current cluster members The means are the same after the 2nd iteration Convergence is reached 1.1 CLOUD COMPUTING IN A NUTSHELL

107 Dimensionality Reduction
With high dimension, the Euclidean distance between any two vectors is used The distance like Euclidean distance between points is quite similar Or any two vectors are orthogonal Causing difficulties in classification and regression Especially clustering The phenomenon is called curse of dimensionality Many dimensionality reduction algorithms are proposed to solve the problem Transfers the points in high-dimensional space to low-dimensional one via the mapping function To relieve the curse of dimensionality 1.1 CLOUD COMPUTING IN A NUTSHELL

108 Dimensionality Reduction (cont.)
May not only reduce the correlation of data May also accelerate the operation speed of the algorithm with decrease of data volume 1.1 CLOUD COMPUTING IN A NUTSHELL

109 Dimensionality Reduction (cont.)
The essence of dimensionality reduction Learning a mapping function f: x → y x is the expression of original data points in vector expression form y is the expression of low-dimensional vector after data point mapping. Generally, the dimension of y is lower than x f may be explicit, implicit, linear, or nonlinear Most dimensionality reduction algorithms process vector expression data Some dimensionality reduction algorithms process high-order tensor expression data Tensors describes linear relations between geometric vectors, scalars, and other tensors 1.1 CLOUD COMPUTING IN A NUTSHELL

110 Dimensionality Reduction (cont.)
e.g., The dot product and the cross product The reason for using dimensionality reduction data Redundant information and noise information in the original high-dimensional space May cause errors and reduce accuracy in practical applications like image identification Through dimensionality reduction To reduce errors caused by redundant information and improve recognition accuracy or other applications To seek the essential structure characteristics of data Linear dimensionality reduction includes principal component analysis (PCA) 1.1 CLOUD COMPUTING IN A NUTSHELL

111 Principal Component Analysis
In actual situations, the object has many property compositions e.g., A medical examination report of the human body includes many physical examination items Each property is a reflection of the object More or less correlation among these objects The correlation causes information overlapping The high overlapping and correlation of property information like variables or characteristics May pose many obstacles to the application and data analysis of statistical methods Property dimensionality reduction is required to solve the information overlapping 1.1 CLOUD COMPUTING IN A NUTSHELL

112 Principal Component Analysis (cont.)
May greatly reduce the number of variables participating in data modeling Will not cause information loss PCA is a kind of analysis method Widely used to effectively reduce variable dimension Designed to transfer multiple variables to several principal component by dimensionality reduction Each principal component may reflect most of the information of the original variables The information included is not repeated Generally, each principal component is the linear combination of the original variables Each principal component is unrelated 1.1 CLOUD COMPUTING IN A NUTSHELL

113 Principal Component Analysis (cont.)
The method may summarize the complex factors into several principal components While introducing many variables to simplify the problem and obtain scientific and effective data information There is information loss in PCA Due to the decrease in property dimensions This loss information may be amplified in ML algorithm iteration Causing an inaccuracy of conclusions Careful considerations need to be taken into account when using PCA Principal components are new variables 1.1 CLOUD COMPUTING IN A NUTSHELL

114 Principal Component Analysis (cont.)
Comprehensively formed by original variables. Called the first principal component, the second principal component, etc According to the information volume in the component Several relations between principal components and original variables The principal component retains most of the information from the original variables The number of principal components is much fewer than that of the original variables Each principal component is unrelated Each principal component is the linear combination of original variables 1.1 CLOUD COMPUTING IN A NUTSHELL

115 Principal Component Analysis (cont.)
The purpose of PCA is to recombine the related variables To a group of new unrelated comprehensive variables to replace the original variables The mathematically processing method is the linear combination of the original variables as the new comprehensive variables The first comprehensive variable F1 need to reflect more information of the original variables In PCA, information is measured by variance i.e., If Var(F1) is larger, F1 contains more information The first comprehensive variable F1 enjoys the largest variance selected in all linear combinations F1 is called the first principal component 1.1 CLOUD COMPUTING IN A NUTSHELL

116 Principal Component Analysis (cont.)
If the first principal component is insufficient to represent the information of p variables Selecting the second linear combination To effectively reflect the original information The information in F1 will not appear in F2 i.e., Cov(F1, F2) = 0 F2 is called the second principal component May construct the 3rd, 4th and pth principal component by doing like this For n evaluation samples like the person conducting physical examinations m evaluation indicators like height, weight, etc May compose a matrix sized n×m Denoted as x = (xij)n×m 1.1 CLOUD COMPUTING IN A NUTSHELL

117 Principal Component Analysis (cont.)
xi, i = 1, 2,⋯,m, is a column vector The matrix is called an evaluation matrix After obtaining the evaluation matrix Calculate the mean value and variance of the original sample data The mean value and standard deviation are calculated in each column Calculate standard data Evaluation matrix changes to the matrix after standardization 1.1 CLOUD COMPUTING IN A NUTSHELL

118 Principal Component Analysis (cont.)
Apply the matrix after standardization To calculate the correlation matrix or covariance matrix C = (cij)m×m of each evaluation indicator C is the symmetric and positive matrix cij = XiTXj ∕(n-1) Calculate the characteristic value 𝜆 and characteristic vector 𝜉 of the correlation matrix Arrange the characteristic values in decreasing order: 𝜆1 > 𝜆2 > ⋯ > 𝜆m Then arrange characteristic vectors corresponding to the characteristic value 1.1 CLOUD COMPUTING IN A NUTSHELL

119 Principal Component Analysis (cont.)
Supposing the j characteristic vector is 𝜉j = (𝜉1j, 𝜉2j,⋯, 𝜉mj)T, the jth principal component is When j = 1, F1 is the first principal component According to the characteristic value of the correlation matrix Calculate contribution rate η and accumulative contribution rate Q of the principal component According to the contribution rate designated by the user Determine the number of principal components Obtain the principal component of the evaluation matrix 1.1 CLOUD COMPUTING IN A NUTSHELL

120 Principal Component Analysis (cont.)
Generally, the contribution rate is 0.85, 0.9, and 0.95 Three different contribution rate levels are determined according to the specific scene Principal Component Analysis of Patient Data The set of triglyceride, total cholesterol, high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), age, weight, total protein, and blood sugar data Obtained during physical examinations in a certain grade-A hospital of the second class in Wuhan City Use PCA to determine the principal component of persons To achieve dimensionality reduction of data 1.1 CLOUD COMPUTING IN A NUTSHELL

121 Principal Component Analysis (cont.)
The data in each column reflects different aspects of the persons for the physical examination The indicator unit is different The original data needs to be standardized e.g., The indicator of No. 1 person is given as 1.1 CLOUD COMPUTING IN A NUTSHELL

122 Principal Component Analysis (cont.)
The evaluation matrix and correlation matrix Calculate characteristic value and characteristic vector according to the correlation matrix The characteristic vector corresponding to the first characteristic value The contribution rate of each principal component The appointed contribution rate is 85% 1.1 CLOUD COMPUTING IN A NUTSHELL

123 Principal Component Analysis (cont.)
Use PCA to determine three principal components m = 8 is the number of physical examination indicator 1.1 CLOUD COMPUTING IN A NUTSHELL

124 Principal Component Analysis (cont.)
The explanation of three principal components May use three principal components to reflect eight indicators in the physical examination data The information retention rate is 86.86% Greatly reduces the dimensions of the data and facilitates the data analysis in later stages 1.1 CLOUD COMPUTING IN A NUTSHELL

125 Semi-Supervised Learning Methods
This approach offers a mixture between supervised and unsupervised learning The trainer is given an incomplete training data set with some of the target outputs missing Transduction is a special case of this principle The entire set of problem instances is known at learning time, except part of the targets is missing The joint use of unlabeled data with a small amount of labeled data can improve learning accuracy The discovery of some useful labeled data Often demands domain expertise or a set of physical experiments to be conducted 1.1 CLOUD COMPUTING IN A NUTSHELL

126 Semi-Supervised Learning Methods (cont.)
The cost associated with the labeling process prevents the use of a fully labeled training set The use of partially labeled data is more practicable Semi-supervised learning is closer to human learning due to the ability to handle fuzziness Three basic assumptions of semi-supervised learning To make any use of unlabeled data, must assume some data distribution Different semi-supervised learning algorithms may assume different assumptions Smoothness assumption Sample data points close to one another are more likely to share a label 1.1 CLOUD COMPUTING IN A NUTSHELL

127 Semi-Supervised Learning Methods (cont.)
Also generally assumed in supervised learning May lead to a preference for geometrically simple decision boundaries Semi-supervised learning yields a preference for decision boundaries in low-density regions Cluster assumption The data tends to form discrete clusters Points in the same cluster are more likely to share a label Data sharing a label may be spread across multiple clusters Related to feature learning with clustering algorithms Manifold assumption A manifold is a topological space that locally resembles Euclidean space 1.1 CLOUD COMPUTING IN A NUTSHELL

128 Semi-Supervised Learning Methods (cont.)
e.g., One-dimensional manifolds include lines and circles, but not figure eights  The data lies on a manifold of much lower dimension than the input space Learn the manifold using both the labeled and unlabeled data to avoid the curse of dimensionality Semi-supervised learning can proceed using distances and densities defined on the manifold The manifold assumption is practical As high-dimensional data sets are encountered e.g., The human voice is controlled by a few vocal cords and various facial expressions are controlled by a few muscles In these cases to use distances and smoothness in the natural data space 1.1 CLOUD COMPUTING IN A NUTSHELL

129 Semi-Supervised Learning Methods (cont.)
Rather than in the space of all possible acoustic waves or images, respectively Advantage of Semi-Supervised Machine Learning The influence of unlabeled data in semi-supervised learning A decision boundary after seeing only one white circle and one black circle example Labeled data for supervised learning 1.1 CLOUD COMPUTING IN A NUTSHELL

130 Semi-Supervised Learning Methods (cont.)
A decision boundary if given an extra collection of unlabeled gray circles Viewed as performing clustering and then labeling the clusters with the labeled data Unlabeled data are used in unsupervised clustering or SVM classification Pushing the decision boundary away from high-density regions Or learning an 1D manifold where the data resides 1.1 CLOUD COMPUTING IN A NUTSHELL

131 Performance Metrics and Model-Fitting Cases
How to select the right model for ML Each ML algorithm has its application potential Given a data set, the performance of a given algorithm may be excellent But that of another algorithm may be the opposite Changing to a different data set may also change the conclusion drastically Rather difficult to judge which algorithm is more ideal than others in general cases Critical to introduce some common metrics to evaluate ML algorithms Some metrics may be adopted to reveal the relative merits Others may be used to find similar algorithms easier to implement 1.1 CLOUD COMPUTING IN A NUTSHELL

132 Performance Metrics of ML Algorithms
Three basic metrics to assess the quality of the various performance metrics used Accuracy The most important criterion to evaluate ML performance, based on testing the data set Two cases: over-fitting or under-fitting algorithms The higher performance exhibited by the training set, the better the expected fit of the algorithm Training time Refers to the convergence speed of an algorithm Or the time needed to establish the optimality of a working model The shorter the training time, the better the model is built to perform with a lower implementation cost 1.1 CLOUD COMPUTING IN A NUTSHELL

133 Performance Metrics of ML Algorithms (cont.)
Linearity Reflects the complexity of the ML algorithm applied Linear performance implies some form of scalable performance In practice, linear algorithms with lower complexity is more often desired May lead to shorter training time or even higher accuracy at a reduced cost 1.1 CLOUD COMPUTING IN A NUTSHELL

134 Machine Learning Performance Scores
To quantify the performance of an ML algorithm Can define some performance scores Normalized as a percentage with 100% for the perfect score and small fractions for lower scores. Often a weighted function of all the above three performance metrics Different user groups apply different weighting functions to emphasize the preferred choices Often the accuracy weights the highest The training time may be the secondary The linearity can be the least important or simply ignored if not limited by implementation costs The performance of an ML algorithm 1.1 CLOUD COMPUTING IN A NUTSHELL

135 Machine Learning Performance Scores (cont.)
Represented as a learning performance curve Two competing scores The training score is driven by the training data set applied The cross validation score is based on the progressive testing of all incoming data In general, the training score is higher than the validation score Due to the fact that the model is built from the training data set Training score and cross validation score match nicely in a well-fitting machine learning model An ideal case where the two scores converge quickly after sufficient testing 1.1 CLOUD COMPUTING IN A NUTSHELL

136 Machine Learning Performance Scores (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

137 Model-Fitting Cases in Machine Learning Process
Two model cases in the process of choosing an acceptable ML algorithm to apply Under various performance conditions of the training and testing data sets Over-fitting modeling In this case the training score is very high The cross validation score is very low for testing data sets applied Two scores are separated far apart from one another The model fits the training set very closely The model has ignored the noise margins in the validation data set 1.1 CLOUD COMPUTING IN A NUTSHELL

138 Model-Fitting Cases in Machine Learning Process (cont.)
The training set is heavily biased on a particular training dataset This sample dataset stays far away from common data distribution or characteristics in general applications The over-fitting model simply cannot model the testing data accurately Under-fitting modeling The model produced by a given training set ends up with a very low performance score Far below the user’s expectation A poor training set was chosen The training model obtained cannot perform well at all on real testing datasets The model obtained is totally unacceptable 1.1 CLOUD COMPUTING IN A NUTSHELL

139 Model-Fitting Cases in Machine Learning Process (cont.)
Both over-fitting and under-fitting are not acceptable Have to choose the proper training dataset that can represent the general datasets An ideal model must work well in both the training dataset and other datasets in general applications Using the linear SVC algorithm with a regulation factor C = 1 With a small data set up to 160 samples LinearSVC is another implementation of Support Vector Classification for the case of a linear kernel With the increase of sample data size 1.1 CLOUD COMPUTING IN A NUTSHELL

140 Model-Fitting Cases in Machine Learning Process (cont.)
The training score decreases slightly Small increase of cross-validation score is observed 1.1 CLOUD COMPUTING IN A NUTSHELL

141 Model-Fitting Cases in Machine Learning Process (cont.)
The model is more often falling into an over-fitting case The gap between the training score and the cross-validation score should be minimized The model is making every effort to follow the behavior of the training set When following the behavior of the training set, be concerned about lowering the cross-validation score The lower cross-validation score is under the influence of having much higher noise in the testing dataset Must overcome the difficulties coming from both over-fitting and under-fitting problems 1.1 CLOUD COMPUTING IN A NUTSHELL

142 Methods to Reduce Model Over-Fitting
The main reason for over-fitting The model deliberately memorizes the distribution properties of the training samples The model being created is overly biased by the sample data behavior The over-fitting model scores very high on a particular training set But scores badly on other data sets Big score gaps must be closed up across data sets Several methods for reducing this adverse effect 1.1 CLOUD COMPUTING IN A NUTSHELL

143 Increasing the Training Data Size
Increasing the quantity of samples may make the training set more representative To cover more variety and scarcity of data The increase in samples applied reflects the noise effects better The mean value of the noise can be reduced to zero The influence of noise on the testing data could be greatly reduced The common method to increase sample size Collect more data under the same scenario Manual labeling is added to generate some artificial training samples 1.1 CLOUD COMPUTING IN A NUTSHELL

144 Increasing the Training Data Size (cont.)
e.g., One can apply image recognition, mirror transformation, and rotation to enlarge the sample quantity These operations may be labor-intensive The method enhances the dependency of the samples This will improve the model by avoiding training biases Enlarging the Sample Data Set for the Linear SVC Algorithm The sample number is enlarged to 400 from 150 This enlarged data set has resulted in good convergence of the score curves as the data set increases beyond 200 The two scores become very close to one another as the sample data size increases beyond 300 1.1 CLOUD COMPUTING IN A NUTSHELL

145 Increasing the Training Data Size (cont.)
As training samples increase The difference between the training-score and cross-validation score can be reduced 1.1 CLOUD COMPUTING IN A NUTSHELL

146 Feature Screening and Dimension Reduction Methods
Under circumstances when small sample data sizes cannot be increased further Can reduce the noise effect by transforming the existing sample set e.g., Using a wavelet analysis The purpose is to reduce the mean noise to zero The noise variance is also reduced Thus reducing the influence of noise on all data to be tested For a large training data set characterized by many sample features By revealing the correlation between features 1.1 CLOUD COMPUTING IN A NUTSHELL

147 Feature Screening and Dimension Reduction Methods (cont.)
One can cut off some features to reduce the over-fitting effect Those features with limited representative power are removed Called feature screening or dimension reduction Can traverse all combination styles of features and select the more important features In the case of samples with high dimensionality Association analysis may be adopted to eliminate some weak features by reducing the dimensions Occasionally rather difficult to decide the relationships between orthogonal features In this case, the PCA algorithm applies dimension reduction 1.1 CLOUD COMPUTING IN A NUTSHELL

148 Feature Screening and Dimension Reduction Methods (cont.)
In a case when the dimensionality of feature space is not high Conduct feature screenings to decrease the complexity of the model Three methods exist to do feature screenings Decrease the degree of polynomial in the ML model Reduce the layers of ANNs and quantity of nodes in each layer Increase the bandwidth of the RBF-kernel in the SVM algorithm 1.1 CLOUD COMPUTING IN A NUTSHELL

149 Feature Screening and Dimension Reduction Methods (cont.)
SVM is the best known kernel method The Radial Basis Function (RBF) kernel is a popular kernel function used in various kernel algorithms Using Fewer Features to Reduce Over-Fitting Effects Demonstrate the effects of using fewer features in the linear SVC algorithm Feature 7 and feature 8 are selected manually after observation of heavier roles in the learning process One can apply an association analysis to assess various feature impacts in the PCA algorithm 1.1 CLOUD COMPUTING IN A NUTSHELL

150 Feature Screening and Dimension Reduction Methods (cont.)
1.1 CLOUD COMPUTING IN A NUTSHELL

151 Methods to Avoid Model Under-Fitting
Under-fitting takes place in two situations The dataset is poorly prepared Simply cannot perform well in the training and validation processes The ML algorithm is wrongly chosen Without considering the nature of the problem environment Different datasets may apply to the chosen algorithm differently An under-fitting problem is difficult to solve completely The more feasible approach is to find ways to avoid it 1.1 CLOUD COMPUTING IN A NUTSHELL

152 Mixed Parameter Changes
Consider the under-fitting problem When using an SVM model for solving a classification problem One way is to utilize an ANN to train the system to yield a better model fit Can also modify the kernel function to cover the case of nonlinear classification e.g., One can replace a stochastic gradient descent (SGD) classifier with a multilayer ANN Kernel approximation is adopted to complete the task The modification of two parameters in the linear SVC algorithm To yield a better fitting 1.1 CLOUD COMPUTING IN A NUTSHELL

153 Mixed Parameter Changes (cont.)
The change in score on data is small After an iteration of 50 mini-batches of sampling data The scores are low to reflect a status of under-fitting The performance becomes worse in the case of using an under-fitting model For the linear SVC model Can reduce the regularization factor C to 0.1 from 1 and apply the L1 regularization penalty The improved high scores in both scores Across a wide range of sample data set sizes from 50 to 300, around 0.91 Through a mixed change of model parameters End up with a fairly close match between the training and cross validation scores 1.1 CLOUD COMPUTING IN A NUTSHELL

154 Mixed Parameter Changes (cont.)
The model score sets a low upper bound on the testing validation scores 1.1 CLOUD COMPUTING IN A NUTSHELL

155 Changing the Loss Functions
ML problems also can be viewed As a minimization of some loss function on the training examples Loss functions express the discrepancy between the trained model prediction and actual problem instances e.g., The classification problem requires the user to assign a label to instances Can use the trained model to predict the labels of training samples Loss function reflects the difference between these two kinds of label sets The loss function reveals the effects of losing the expected performance of an ML algorithm An optimized algorithm should minimize the loss on a training set 1.1 CLOUD COMPUTING IN A NUTSHELL

156 Changing the Loss Functions (cont.)
ML is concerned with minimizing the loss on unseen samples The selection of loss function is critical to obtain a better or optimal prediction model Five design choices of loss functions Zero-one loss function This policy offers a very sharp division between success and failure The 0-1 loss function just counts the number of mispredictions in the classification problems A non-convex function Not practical in real-life applications Hinge loss function 1.1 CLOUD COMPUTING IN A NUTSHELL

157 Changing the Loss Functions (cont.)
Often used in SVM applications for its relative strength to reflect the unusual sensitivity to noise effects Not supported by probabilistic distribution Log loss function Can reflect the probabilistic distribution nicely The scenario of multi-class classification needs to know the confidence of classification The log loss function is rather suitable The shortcoming lies in lower sensitivity to noise and lack of judgment strength Exponential loss function This has been applied in AdaBoost AdaBoost (Adaptive Boosting) can be used in conjunction with many other types of learning algorithms to improve performance 1.1 CLOUD COMPUTING IN A NUTSHELL

158 Changing the Loss Functions (cont.)
Very sensitive to separation from the crowd and noise Its prediction style is simple Effective in dealing with boosting algorithms Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias and variance in supervised learning Perceptron loss function Can be regarded as a variation from the hinge loss Hinge loss poses a heavy penalty to misjudgments of the boundary points Perceptron loss is satisfied with accurate classification from the sample data Ignores the distance from the judgment boundary Simpler than using the hinge loss function 1.1 CLOUD COMPUTING IN A NUTSHELL

159 Changing the Loss Functions (cont.)
Offers a weaker model to apply to general problems for lack of maximum margin boundary Effects of using different loss functions in machine learning model selection 1.1 CLOUD COMPUTING IN A NUTSHELL

160 Model Modifications or the Ensemble Approach
The PCA algorithm offers a dimension reduction approach For reducing model complexity Can also consider using several principal components to interconnect the data elements The independence of each principal component is strong Greatly reduces the internal connection among the data The emergence of the ensemble algorithm Another solution to the under-fitting problem In a case when each individual model does not perform well on a given data set 1.1 CLOUD COMPUTING IN A NUTSHELL

161 Model Modifications or the Ensemble Approach (cont.)
Can use several algorithms concurrently on the same data set Choose the one that best fits in performance scores e.g., Applying both the AdaBoost and decision tree models jointly To improve the accuracy of the prediction results 1.1 CLOUD COMPUTING IN A NUTSHELL

162 Dataset Selection Some methods for solving under-fitting problems
In particular when one is trying to improve the classification problem The model performance is so sensitive to the data sets applied Three options to select the data sets The data set selection is driven by the performance demand Common data sets Involves dividing the original data set into two parts With equal characteristics and distributions in the training set and testing set 1.1 CLOUD COMPUTING IN A NUTSHELL

163 Dataset Selection (cont.)
This subdivision should result in a good model performance to avoid either the over-fitting or the under-fitting problems Cross validation The original data set is divided into k parts Each part in turn is selected as a test set while the remaining stays as the training set This demands k validation testing run Shows the average accuracy of the model over many subdivided test sets Bootstrap cycle Involves random sampling with replacement of some data elements repeatedly in different training samples Let the sampled data be the training set The remaining are the test sets 1.1 CLOUD COMPUTING IN A NUTSHELL

164 Dataset Selection (cont.)
Repeat these sampling cycles k times May end up with a weighted mean performance of all test sets Some important ML algorithm families implemented with Spark MLlib 1.1 CLOUD COMPUTING IN A NUTSHELL


Download ppt "Chapter 6 Machine Learning Algorithms and Prediction Model Fitting"

Similar presentations


Ads by Google