Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSI 5388:Topics in Machine Learning Inductive Learning: A Review.

Similar presentations


Presentation on theme: "1 CSI 5388:Topics in Machine Learning Inductive Learning: A Review."— Presentation transcript:

1 1 CSI 5388:Topics in Machine Learning Inductive Learning: A Review

2 2 Course Outline Overview Overview Theory Theory Version Spaces Version Spaces Decision Trees Decision Trees Neural Networks Neural Networks Bagging Bagging Boosting Boosting

3 3 Inductive Learning : Overview Different types of inductive learning: Different types of inductive learning: Supervised Learning: The program attempts to infer an association between attributes and their inferred class.Supervised Learning: The program attempts to infer an association between attributes and their inferred class. Concept Learning Concept Learning Classification Classification Unsupervised Learning: The program attempts to infer an association between attributes but no class is assigned.:Unsupervised Learning: The program attempts to infer an association between attributes but no class is assigned.: Reinforced learning. Reinforced learning. Clustering Clustering Discovery Discovery Online vs. Batch LearningOnline vs. Batch Learning  We will focus on supervised learning in batch mode.  We will focus on supervised learning in batch mode.

4 4 Inductive Inference Theory (1) Given X the set of all examples. Given X the set of all examples. A concept C is a subset of X. A concept C is a subset of X. A training example T is a subset of X such that some examples of T are elements of C (the positive examples) and some examples are not elements of C (the negative examples) A training example T is a subset of X such that some examples of T are elements of C (the positive examples) and some examples are not elements of C (the negative examples)

5 5 Inductive Inference Theory (2) Learning: { }   f: X  Y { }   f: X  Y with i=1..n, with i=1..n, xi T, yi  Y (={0,1}) xi T, yi  Y (={0,1}) yi= 1, if x1 is positive ( C) yi= 1, if x1 is positive ( C) yi= 0, if xi is negative ( C) yi= 0, if xi is negative ( C) Goals of learning: f must be such that for all xj  X (not only  T) - f(xj) =1 si xj  C - f(xj) = 0, si xj  C - f(xj) = 0, si xj  C Learning system

6 6 Inductive Inference Theory (3) Problem: La task or learning is not well formulated because there exist an infinite number of functions that satisfy the goal.  It is necessary to find a way to constrain the search space of f. Problem: La task or learning is not well formulated because there exist an infinite number of functions that satisfy the goal.  It is necessary to find a way to constrain the search space of f. Definitions: Definitions: The set of all fs that satisfy the goal is called hypothesis space.The set of all fs that satisfy the goal is called hypothesis space. The constraints on the hypothesis space is called the inductive bias.The constraints on the hypothesis space is called the inductive bias. There are two types of inductive bias:There are two types of inductive bias: The hypothesis space restriction bias The hypothesis space restriction bias The preference bias The preference bias

7 7 Inductive Inference Theory (4) Hypothesis space restriction bias  We restrain the language of the hypothesis space. Hypothesis space restriction bias  We restrain the language of the hypothesis space.Examples: k-DNF: We restrict f to the set of Disjunctive Normal form formulas having an arbitrary number of disjunctions but at most, k conjunctive in each conjunctions. k-DNF: We restrict f to the set of Disjunctive Normal form formulas having an arbitrary number of disjunctions but at most, k conjunctive in each conjunctions. K-CNF: We restrict f to the set of Conjunctive Normal Form formulas having an arbitrary number of conjunctions but with at most, k disjunctive in each disjunction. K-CNF: We restrict f to the set of Conjunctive Normal Form formulas having an arbitrary number of conjunctions but with at most, k disjunctive in each disjunction. Properties of that type of bias: Properties of that type of bias: Positive: Learning will by simplified (Computationally)Positive: Learning will by simplified (Computationally) Negative: The language can exclude the “good” hypothesis.Negative: The language can exclude the “good” hypothesis.

8 8 Inductive Inference Theory (5) Preference Bias: It is an order or unit of measure that serves as a base to a relation of preference in the hypothesis space. Preference Bias: It is an order or unit of measure that serves as a base to a relation of preference in the hypothesis space. Examples: Examples: Occam’s razor: We prefer a simple formula for f. Occam’s razor: We prefer a simple formula for f. Principle of minimal description length (An extension of Occam’s Razor): The best hypothesis is the one that minimise the total length of the hypothesis and the description of the exceptions to this hypothesis. Principle of minimal description length (An extension of Occam’s Razor): The best hypothesis is the one that minimise the total length of the hypothesis and the description of the exceptions to this hypothesis.

9 9 Inductive Inference Theory (6) How to implement learning with these bias? How to implement learning with these bias? Hypothesis space restriction bias: Hypothesis space restriction bias: Given:Given: A set S of training examples A set S of training examples A set of restricted hypothesis, H A set of restricted hypothesis, H Find: An hypothesis f  H that minimizes the number of incorrectly classified training examples of S.Find: An hypothesis f  H that minimizes the number of incorrectly classified training examples of S.

10 10 Inductive Inference Theory (7) Preference Bias: Preference Bias: Given:Given: A set S of training examples A set S of training examples An order of preference better(f1, f2) for all the hypothesis space (H) functions. An order of preference better(f1, f2) for all the hypothesis space (H) functions. Find: the best hypothesis f  H (using the “better” relation) that minimises the number of training examples S incorrectly classified.Find: the best hypothesis f  H (using the “better” relation) that minimises the number of training examples S incorrectly classified. Search techniques: Search techniques: Heuristic searchHeuristic search Hill ClimbingHill Climbing Simulated Annealing et Genetic AlgorithmSimulated Annealing et Genetic Algorithm

11 11 Inductive Inference Theory (8) When can we trust our learning algorithm? When can we trust our learning algorithm?  Theoretical answer Experimental answerExperimental answer Theoretical answer : PAC-Learning (Valiant 84) Theoretical answer : PAC-Learning (Valiant 84) PAC-Learning provides the limit on the necessary number of example (given a certain bias) that will let us believe with a certain confidence that the results returned by the learning algorithm is approximately correct (similar to the t-test). This number of example is called sample complexity of the bias. PAC-Learning provides the limit on the necessary number of example (given a certain bias) that will let us believe with a certain confidence that the results returned by the learning algorithm is approximately correct (similar to the t-test). This number of example is called sample complexity of the bias. If the number of training examples exceeds the sample complexity, we are confident of our results. If the number of training examples exceeds the sample complexity, we are confident of our results.

12 12 Inductive Inference Theory (9): PAC-Learning Given Pr(X) The probability distribution with which the examples are selected from X Given Pr(X) The probability distribution with which the examples are selected from X Given f, an hypothesis from the hypothesis space. Given f, an hypothesis from the hypothesis space. Given D the set of all examples for which f and C differ. Given D the set of all examples for which f and C differ. The error associated with f and the concept C is: The error associated with f and the concept C is: Error(f) =  xD Pr(x)Error(f) =  xD Pr(x) f is approximately correct with an exactitude of  iff: Error(f)  f is approximately correct with an exactitude of  iff: Error(f)   f is probably approximately correct (PAC) with probability  and exactitude  if Pr(Error(f) > ) ) < 

13 13 Inductive Inference Theory (10): PAC-Learning Theorem: A program that returns any hypothesis consistent with the training examples is PAC if n, the number of training examples is greater than ln(/|H|)/ln(1-) where |H| represents the number of hypothesis in H. Theorem: A program that returns any hypothesis consistent with the training examples is PAC if n, the number of training examples is greater than ln(/|H|)/ln(1-) where |H| represents the number of hypothesis in H. Examples: Examples: For 100 hypothesis, you need 70 examples to reduce the error under 0.1 with a probability of 0.9 For 100 hypothesis, you need 70 examples to reduce the error under 0.1 with a probability of 0.9 For 1000 hypothesis, 90 are required For 1000 hypothesis, 90 are required For 10,000 hypothesis, 110 are required. For 10,000 hypothesis, 110 are required.  ln(/|H|)/ln(1-) grows slowly. That’s good!  ln(/|H|)/ln(1-) grows slowly. That’s good!

14 14 Inductive Inference Theory (11) When can we trust our learning algorithm? When can we trust our learning algorithm? - Theoretical answer  Experimental answer Experimental answer : error estimation Experimental answer : error estimation Suppose you have access to 1000 examples for a concept f. Suppose you have access to 1000 examples for a concept f.  Divide the data in 2 sets:  One training set  One test set  Train the algorithm on the training set only.  Test the resulting hypothesis to have an estimation of that hypothesis on the test set.

15 15 A Taxonomy of Machine Learning Techniques: Highlight on Important Approaches Supervised Unsupervised Linear Nonlinear Single Combined Easy to Interpret Hard to Interpret Linear Regression Logistic Regression Perceptron Bagging Boosting Random Forests Decision Rule Trees Learning Naïve k-Nearest Bayes Neighbours Multi-Layer SVM Perceptron K-Means EM Self-Organizing Maps

16 16 Version Spaces: Definitions Given C1 and C2, two concepts represented by sets of examples. If C1  C2, then C1 is a specialisation of C2 and C2 is a generalisation of C1. Given C1 and C2, two concepts represented by sets of examples. If C1  C2, then C1 is a specialisation of C2 and C2 is a generalisation of C1. C1 is also considered more specific than C2 C1 is also considered more specific than C2 Example: The set off all blue triangles is more specific than the set of all the triangles. Example: The set off all blue triangles is more specific than the set of all the triangles. C1 is an immediate specialisation of C2 if there is no concept that are a specialisation of C2 and a generalisation of C1. C1 is an immediate specialisation of C2 if there is no concept that are a specialisation of C2 and a generalisation of C1. A version space define a graph where the nodes are concepts and the arcs specify that a concept is an immediate specialisation of another one. A version space define a graph where the nodes are concepts and the arcs specify that a concept is an immediate specialisation of another one. (See in class example) (See in class example)

17 17 Version Spaces: Overview (1) A Version Space has two limits: The general limit and the specific limit. A Version Space has two limits: The general limit and the specific limit. The limits are modified after each addition of a training example. The limits are modified after each addition of a training example. The starting general limit is simply (?,?,?); The specific limit has all the leaves of the Version Space tree. The starting general limit is simply (?,?,?); The specific limit has all the leaves of the Version Space tree. When adding a positive example all the examples of the specific limit are generalized until it is compatible with the example. When adding a positive example all the examples of the specific limit are generalized until it is compatible with the example. When a negative example is added, the general limit examples are specialised until they are no longer compatible with the example. When a negative example is added, the general limit examples are specialised until they are no longer compatible with the example.

18 18 Version Spaces: Overview (2) If the specific limits and the general limits are maintained with the previous rules, then a concept is guaranteed to include all the positive examples and exclude all the negative examples if they fall between the limits. If the specific limits and the general limits are maintained with the previous rules, then a concept is guaranteed to include all the positive examples and exclude all the negative examples if they fall between the limits. General Limit more specific More general Specific Limit If f is here, it includes all examples + And exclude all examples - (See in class example)

19 19 Decision Tree: Introduction The simplest form of learning is the memorization of all the training examples. The simplest form of learning is the memorization of all the training examples. Problem: Memorization is not useful for new examples  We need to find ways to generalize beyond the training examples. Problem: Memorization is not useful for new examples  We need to find ways to generalize beyond the training examples. Possible Solution: Instead of memorizing each attributes of each examples, we can memorize only those that distinguish between positive and negative examples. That is what the decision tree does. Possible Solution: Instead of memorizing each attributes of each examples, we can memorize only those that distinguish between positive and negative examples. That is what the decision tree does. Notice: The same set of example can be represented by different trees. Occam’s Razor tells you to take the smallest tree. Notice: The same set of example can be represented by different trees. Occam’s Razor tells you to take the smallest tree.

20 20 Supervised Learning: Example Patient Attributes Class Temperature Cough Sore Throat Sinus Pain 1 37 yes no no no flu 2 39 no yes yes flu 3 38.4 no no no no flu 4 36.8 no yes no no flu 5 38.5 yes no yes flu 6 39.2 no no yes flu Goal: Learn how to predict whether a new patient with a given set of symptoms does or does not have the flu.

21 21 Decision Trees: An example Decision Trees: An example Temperature Cough Sore Throat Medium Low High Yes NoYes No A Decision Tree for the Flu Concept Flu No Flu Flu No Flu

22 22 Decision tree: Construction Step 1: We choose an attribute A (= node 0) and split the example by the value of this attribute. Each of these groups correspond to a child of node 0. Step 1: We choose an attribute A (= node 0) and split the example by the value of this attribute. Each of these groups correspond to a child of node 0. Step 2: For each descendant of node 0, if the examples of this descendant are homogenous (have the same class), we stop. Step 2: For each descendant of node 0, if the examples of this descendant are homogenous (have the same class), we stop. Step 3: If the examples of this descendent are not homogenous, then we call the procedure recursively on that descendent. Step 3: If the examples of this descendent are not homogenous, then we call the procedure recursively on that descendent.

23 23 Construction of Decision Trees I D13 D12 D11 D10 D9 D4 D7 D5 D3 D14 D8D6D2D1 What is the most informative attribute? Assume: Temperature

24 24 D13 D12 D11 D10 D9 D4 D7 D5 D3 D14 D8 D6 D2 D1 Temperature Medium Low High What are the most informative attributes? Cough and Sore Throat Construction of Decision Trees II Assume:

25 25 Construction of Decision Trees III The informativeness of an attribute is an information-theoretic measure that corresponds to the attribute that produces the purest children nodes. The informativeness of an attribute is an information-theoretic measure that corresponds to the attribute that produces the purest children nodes. This is done by minimizing the measure of entropy in the trees that the attribute split generates. This is done by minimizing the measure of entropy in the trees that the attribute split generates. The entropy and information are linked in the following way: The more there is entropy in a set S, the more information is necessary in order to guess correctly an element of this set. The entropy and information are linked in the following way: The more there is entropy in a set S, the more information is necessary in order to guess correctly an element of this set. Info[x,y] = Entropy[x,y] = - x/(x+y) log x/(x+y) Info[x,y] = Entropy[x,y] = - x/(x+y) log x/(x+y) - y/(x+y) log y/(x+y) - y/(x+y) log y/(x+y)

26 26 Construction of Decision Trees IV Medium Low High Temperature Sore Throat No Yes Info[2,3] =.971 bits Info[4,0]= 0 bits Info[3,2]=.971 bits Avg Tree Info = 5/14 *.971 + 4/14 * 0 + 5/14 *.971 =.693 Prior Info[9,5] =.940  Gain =.940 -.693 =.247 Info[2,6]=.811 Info[3,3]= 1 Avg Tree Info = 8/14 +.811 + 6/14 * 1 =.892 Prior Info[9,5] =.940  Gain =.940 -.892 =.048

27 27 Decision Tree: Other questions. We have to find a way to deal with attributes with continuous values or discrete values with a very large set. We have to find a way to deal with attributes with continuous values or discrete values with a very large set. We have to find a way to deal with missing values. We have to find a way to deal with missing values. We have to find a way to deal with noise (errors) in the example’s class and in the attribute values. We have to find a way to deal with noise (errors) in the example’s class and in the attribute values.

28 28 Neural Network: Introduction (I) What is a neural network? What is a neural network? It is a formalism inspired by biological systems and that is composed of units that perform simple mathematical operations in parallel. It is a formalism inspired by biological systems and that is composed of units that perform simple mathematical operations in parallel. Examples of simple mathematical operation units: Examples of simple mathematical operation units: Addition unitAddition unit Multiplication unitMultiplication unit Threshold (Continous (example: the Sigmoïd) or not)Threshold (Continous (example: the Sigmoïd) or not)

29 29 Multi-Layer Perceptrons: An Opaque Approach Examples: Input Units Hidden Units Output units weights Autoassociation Heteroassociation

30 30 Representation in a Multi-Layer Perceptron h j =g(  w ji.x i ) y k =g(  w kj.h j ) where g(x)= 1/(1+e ) x 1 x 2 x 3 x 4 x 5 x 6 h 1 h 2 h 3 y1y1 kjikji w ji ’s w kj ’s g (sigmoid): 0 1/2 0 1 Typically, y 1 =1 for positive example and y 1 =0 for negative example -x i j

31 31 Neural Network: Learning (I) The units are connected in order to create a network capable of computing complicated functions. The units are connected in order to create a network capable of computing complicated functions. Since the network has a sigmoid output, it implements a function f(x1,x2,x3,x4) where the output is in the range [0,1] Since the network has a sigmoid output, it implements a function f(x1,x2,x3,x4) where the output is in the range [0,1] We are interested in neural network capable of learning that function. We are interested in neural network capable of learning that function.

32 32 Learning in a Multi-Layer Perceptron I  Learning consists of searching through the space of all possible matrices of weight values for a combination of weights that satisfies a database of positive and negative examples (multi-class as well as regression problems are possible).  It is an optimization problem which tries to minimize the sum of square error: E= 1/2  n=1  k=1 [y k -f k (x )] where N is the total number of training examples and K, the total number of output units (useful for multiclass problems) and fk is the function implemented by the neural net NK

33 33 Learning in a Multi-Layer Perceptron II The optimization problem is solved by searching the space of possible solutions by gradient. The optimization problem is solved by searching the space of possible solutions by gradient. This consists of taking small steps in the direction that minimize the gradient (or derivative) of the error of the function we are trying to learn. This consists of taking small steps in the direction that minimize the gradient (or derivative) of the error of the function we are trying to learn. When the gradient is zero we have reached a local minimum that we hope is also the global minimum. When the gradient is zero we have reached a local minimum that we hope is also the global minimum.

34 34 Neural Network: Learning (II) Notice that a Neural Network with a set of adjustable weights represent a restricted hypothesis space corresponding to a family of functions. The size of this space can be increased or decreased by changing the number of hidden units in the network. Notice that a Neural Network with a set of adjustable weights represent a restricted hypothesis space corresponding to a family of functions. The size of this space can be increased or decreased by changing the number of hidden units in the network. Learning is done by a hill-climbing approach called backpropagation and is based on the paradigm of search by gradient. Learning is done by a hill-climbing approach called backpropagation and is based on the paradigm of search by gradient.

35 35 Neural Network: Learning (III) The idea of search by gradient is to take small steps in the direction that minimize the gradient (or derivative) of the error of the function we are trying to learn. The idea of search by gradient is to take small steps in the direction that minimize the gradient (or derivative) of the error of the function we are trying to learn. When the gradient is zero we have reached a local minimum that we hope is also the global minimum. When the gradient is zero we have reached a local minimum that we hope is also the global minimum.

36 36 Description of Two classifier combination Schemes: - Bagging - Boosting

37 37 Combining Multiple Models The idea is the following: In order to make the outcome of automated classification more reliable, it may be a good idea to combine the decisions of several single classifiers through some sort of voting scheme The idea is the following: In order to make the outcome of automated classification more reliable, it may be a good idea to combine the decisions of several single classifiers through some sort of voting scheme Bagging and Boosting are the two most used combination schemes and they usually yield much improved results over the results of single classifiers Bagging and Boosting are the two most used combination schemes and they usually yield much improved results over the results of single classifiers One disadvantage of these multiple model combinations is that, as in the case of neural Networks, the learned model is hard, if not impossible, to interpret. One disadvantage of these multiple model combinations is that, as in the case of neural Networks, the learned model is hard, if not impossible, to interpret.

38 38 Bagging: Bootstrap Aggregating Idea: perturb the composition of the data sets from which the classifier is trained. Learn a classifier from each different data set. Let these classifiers vote.  This procedure reduces the portion of the performance error caused by variance in the training set. Bagging Algorithm

39 39Boosting Boosting Algorithm Idea: To build models that complement each other. A first classifier is built. Its errors are given a higher weight than its right answers so that the next classifier being built focuses on these errors, etc… Decrease the weight of the right answers  Increase the weight of errors


Download ppt "1 CSI 5388:Topics in Machine Learning Inductive Learning: A Review."

Similar presentations


Ads by Google