Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin.

Similar presentations


Presentation on theme: "Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin."— Presentation transcript:

1 Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin

2 2 Credit Scoring Example: Given a set of inputs of income and savings, label each customer in loan application:  low-risk vs high-risk Input: x = [x 1,x 2 ] T,Output: C  {0,1} Prediction: Choose C=1 if P(C=1 | x1, x2) >0.5 C=0 Otherwise Prediction is to find C that maximize the conditional probability P(C|x)

3 3 Bayes’ Rule posterior likelihoodprior evidence

4 Example of Bayes Rule Given :  A doctor knows that meningitis causes stiff neck 50% of the time  Prior probability of any patient having meningitis is 1/50,000  Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? (diagnostics inference)

5 Naïve Bayes Classifier Assume independence among attributes A i when class is given:  P(A 1, A 2, …, A n |C) = P(A 1 | C j ) P(A 2 | C j )… P(A n | C j )  Can estimate P(A i | C j ) for all A i and C j.  New point is classified to C j if P(C j )  P(A i | C j ) is maximal.

6 How to Estimate Probabilities from Data? Class: P(C) = N c /N  e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes : P(A i | C k ) = |A ik |/ N c where |A ik | is number of instances having attribute A i and belongs to class C k  Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0

7 How to Estimate Probabilities from Data? For continuous attributes:  Discretize the range into bins one ordinal attribute per bin violates independence assumption  Two-way split: (A v) choose only one of the two splits as new attribute  Probability density estimation Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation)  Parametric estimation for regression Once probability distribution is known, can use it to estimate the conditional probability P(A i |c)

8 How to Estimate Probabilities from Data? Normal distribution:  One for each (A i,c i ) pair For (Income, Class=No):  If Class=No sample mean = 110 sample variance = 2975

9 Example of Naïve Bayes Classifier l P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024 l P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  10 -9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Given a Test Record:

10 Example of Naïve Bayes Classifier A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals

11 11 Bayes’ Rule: K>2 Classes

12 12 Bayesian Networks Aka graphical models, probabilistic networks, representing the interaction between variables visually  Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis  Arcs are direct direct influences between hypotheses The structure is represented as a directed acyclic graph (DAG) The parameters are the conditional probs in the arcs

13 13 Causes and Bayes’ Rule Diagnostic inference: Knowing that the grass is wet, what is the probability that rain is the cause? causal diagnostic

14 14 Causal vs Diagnostic Inference Causal inference: If the sprinkler is on, what is The prob that the grass is wet? P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S) = P(W|R,S) P(R) + P(W|~R,S) P(~R) = 0.95 0.4 + 0.9 0.6 = 0.92 Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S) P(S|R,W) = 0.21Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on.

15 15 Bayesian Networks: Causes Causal inference: P(W|C) = P(W|R,S) P(R,S|C) + P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C) and use the fact that P(R,S|C) = P(R|C) P(S|C) Diagnostic: P(C|W ) = ?

16 Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes  Use other techniques such as Bayesian Belief Networks (BBN)

17 Artificial Neural Network (ANN) An interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation.  Biological neural network: real biological neurons that are connected or functionally-related in the nervous system.  Connectionist: mental phenomena can be described by interconnected networks of simple units Data modeling tool to capture complex relationships between inputs and outputs or find patterns in data  Non-linear, statistical  Non-parametric Adapted from Lecture Notes of E Alpaydın 17

18 Neural Network Example Output Y is 1 if at least two of the three inputs are equal to 1.

19 ANN: Neuron

20 ANN Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t Output could be a sigmoid function Perceptron Model or Y=sigmoid(o)=1/(1+exp(-o))

21 General Structure of ANN Training ANN means learning the weights of the neurons

22 Algorithm for learning ANN Initialize the weights (w 0, w 1, …, w k ) Adjust the weights to ensure the output of ANN is consistent with class labels of training examples (offline or online)  Objective function:  Find the weights w i ’s that minimize the above objective function E.g. backpropagation algorithm  w i =  (r, Y)X i (  is the learning factor, gradually decreased in time for convergence)

23 Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data

24 Support Vector Machines One Possible Solution

25 Support Vector Machines Another possible solution

26 Support Vector Machines Other possible solutions

27 Support Vector Machines Which one is better? B1 or B2? How do you define better?

28 Support Vector Machines Find hyperplane maximizes the margin => B1 is better than B2

29 Support Vector Machines What if the problem is not linearly separable?

30 Nonlinear Support Vector Machines What if decision boundary is not linear?

31 Nonlinear Support Vector Machines Transform data into higher dimensional space

32 Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

33 Why Ensemble Classifier? Suppose there are 25 base classifiers  Each classifier has error rate,  = 0.35  Assume classifiers are independent  Probability that the ensemble classifier makes a wrong prediction:

34 Summary Machine Learning Approaches  Supervised, Unsupervised, Reinforcement Supervised Learning for Classification  Bayesian rule: diagnostic and causal inference  Artificial neural networks  Support vector machine  Ensemble methods Adapted from Lecture Notes of E Alpaydın 34

35 Assessing and Comparing Classifiers Questions:  Assessment of the expected error of a learning algorithm  Comparing the expected errors of two algorithms: Is alg-1 more accurate than alg-2? Training/validation/test sets Criteria (Application- dependent):  Misclassification error, or risk (loss functions)  Training time/space complexity  Testing time/space complexity  Interpretability  Easy programmability Cost-sensitive learning 35

36 36 K-Fold Cross-Validation The need for multiple training/validation sets {X i,V i } i : Training/validation sets of fold i K-fold cross-validation: Divide X into k, X i,i=1,...,K T i share K-2 parts

37 37 Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity

38 38 Receiver-Operating- Characteristics (ROC) Curve

39 39 Interval Estimation X = { x t } t where x t ~ N ( μ, σ 2 ) Sample ave m ~ N ( μ, σ 2 /N) Define a unit normal distrib Z: 100(1- α) percent confidence interval

40 40 McNemar’s Test for Comparison Given single training/validation set, Contingency Table: Under the hypothesis of same error rate, we expect e 01 = e 10 =(e 01 + e 10 )/2 Chi-square statistic: McNemar’s test accepts the hypothesis at a significance level  if it < X 2 α,1 (For example, X 2 0.05,1 = 3.84)


Download ppt "Supervised Learning Approaches Bayesian Learning Neural Network Support Vector Machine Ensemble Methods Adapted from Lecture Notes of V. Kumar and E. Alpaydin."

Similar presentations


Ads by Google