Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.

Similar presentations


Presentation on theme: "1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471."— Presentation transcript:

1 1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471

2 2 The Naïve Bayes Classifier Some material adapted from slides by Tom Mitchell, CMU.

3 3 The Naïve Bayes Classifier Recall Bayes rule: Recall Bayes rule: Which is short for: Which is short for: We can re-write this as: We can re-write this as:

4 4 Deriving Naïve Bayes Idea: use the training data to directly estimate: Idea: use the training data to directly estimate: Then, we can use these values to estimate Then, we can use these values to estimate using Bayes rule. Recall that representing the full joint probability Recall that representing the full joint probability is not practical. and

5 5 Deriving Naïve Bayes However, if we make the assumption that the attributes are independent, estimation is easy! However, if we make the assumption that the attributes are independent, estimation is easy! In other words, we assume all attributes are conditionally independent given Y. In other words, we assume all attributes are conditionally independent given Y. Often this assumption is violated in practice, but more on that later… Often this assumption is violated in practice, but more on that later…

6 6 Deriving Naïve Bayes Let and label Y be discrete. Let and label Y be discrete. Then, we can estimate and Then, we can estimate and directly from the training data by counting! SkyTempHumidWindWaterForecastPlay?sunnywarmnormalstrongwarmsameyes sunnywarmhighstrongwarmsameyes rainycoldhighstrongwarmchangeno sunnywarmhighstrongcoolchangeyes P(Sky = sunny | Play = yes) = ?P(Humid = high | Play = yes) = ?

7 7 The Naïve Bayes Classifier Now we have: Now we have: which is just a one-level Bayesian Network To classify a new point X new : To classify a new point X new : …… Attributes (evidence) Labels (hypotheses) 1ni j XXX Y

8 8 The Naïve Bayes Algorithm For each value y k For each value y k Estimate P(Y = y k ) from the data. Estimate P(Y = y k ) from the data. For each value x ij of each attribute X i For each value x ij of each attribute X i Estimate P(X i =x ij | Y = y k ) Estimate P(X i =x ij | Y = y k ) Classify a new point via: Classify a new point via: In practice, the independence assumption doesn’t often hold true, but Naïve Bayes performs very well despite it. In practice, the independence assumption doesn’t often hold true, but Naïve Bayes performs very well despite it.

9 9 Naïve Bayes Applications Text classification Text classification Which e-mails are spam? Which e-mails are spam? Which e-mails are meeting notices? Which e-mails are meeting notices? Which author wrote a document? Which author wrote a document? Classifying mental states Classifying mental states People WordsAnimal Words Learning P(BrainActivity | WordCategory) Pairwise Classification Accuracy: 85%

10 10 Neural Networks Some material adapted from lecture notes by Lise Getoor and Ron Parr Adapted from slides by Tim Finin and Marie desJardins.

11 11 Neural function Brain function (thought) occurs as the result of the firing of neurons Brain function (thought) occurs as the result of the firing of neurons Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength There are about 10 11 neurons and about 10 14 synapses in the human brain(!) There are about 10 11 neurons and about 10 14 synapses in the human brain(!)

12 12 Biology of a neuron

13 13 Brain structure Different areas of the brain have different functions Different areas of the brain have different functions Some areas seem to have the same function in all humans (e.g., Broca’s region for motor speech); the overall layout is generally consistent Some areas seem to have the same function in all humans (e.g., Broca’s region for motor speech); the overall layout is generally consistent Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly We don’t know how different functions are “assigned” or acquired We don’t know how different functions are “assigned” or acquired Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) Partly the result of experience (learning) Partly the result of experience (learning) We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Artificial neural networks are not nearly as complex or intricate as the actual brain structure Artificial neural networks are not nearly as complex or intricate as the actual brain structure

14 14 Comparison of computing power Computers are way faster than neurons… Computers are way faster than neurons… But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel Neural networks are designed to be massively parallel Neural networks are designed to be massively parallel The brain is effectively a billion times faster The brain is effectively a billion times faster INFORMATION CIRCA 1995 Computer Human Brain Computation Units 1 CPU, 10 5 Gates 10 11 Neurons Storage Units 10 4 bits RAM, 10 10 bits disk 10 11 neurons, 10 14 synapses Cycle time 10 -8 sec 10 -3 sec Bandwidth 10 4 bits/sec 10 14 bits/sec Updates / sec 10 5 10 14

15 15 Neural networks Neural networks are made up of nodes or units, connected by links Neural networks are made up of nodes or units, connected by links Each link has an associated weight and activation level Each link has an associated weight and activation level Each node has an input function (typically summing over weighted inputs), an activation function, and an output Each node has an input function (typically summing over weighted inputs), an activation function, and an output Output units Hidden units Input units Layered feed-forward network

16 16 Model of a neuron Neuron modeled as a unit i Neuron modeled as a unit i weights on input unit j to i, w ji weights on input unit j to i, w ji net input to unit i is: net input to unit i is: Activation function g() determines the neuron’s output Activation function g() determines the neuron’s output g() is typically a sigmoid g() is typically a sigmoid output is either 0 or 1 (no partial activation) output is either 0 or 1 (no partial activation)

17 17 “Executing” neural networks Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level Working forward through the network, the input function of each unit is applied to compute the input value Working forward through the network, the input function of each unit is applied to compute the input value Usually this is just the weighted sum of the activation on the links feeding into this node Usually this is just the weighted sum of the activation on the links feeding into this node The activation function transforms this input function into a final value The activation function transforms this input function into a final value Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node

18 18 Learning rules Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule assumes binary valued input/outputs assumes binary valued input/outputs assumes a single linear threshold unit assumes a single linear threshold unit

19 19 Perceptron learning rule If the target output for unit i is t i If the target output for unit i is t i Equivalent to the intuitive rules: Equivalent to the intuitive rules: If output is correct, don’t change the weights If output is correct, don’t change the weights If output is low (o i =0, t i =1), increment weights for all the inputs which are 1 If output is low (o i =0, t i =1), increment weights for all the inputs which are 1 If output is high (o i =1, t i =0), decrement weights for all inputs which are 1 If output is high (o i =1, t i =0), decrement weights for all inputs which are 1 Must also adjust threshold. Or equivalently assume there is a weight w 0i for an extra input unit that has an output of 1. Must also adjust threshold. Or equivalently assume there is a weight w 0i for an extra input unit that has an output of 1.

20 20 Perceptron learning algorithm Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or random) Initialize the weights to all zero (or random) Until outputs for all training examples are correct Until outputs for all training examples are correct for each training example e do for each training example e do compute the current output o j compute the current output o j compare it to the target t j and update weights compare it to the target t j and update weights Each execution of outer loop is called an epoch Each execution of outer loop is called an epoch For multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold For multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold

21 21 Representation limitations of a perceptron Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data. Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data. i.e., the positive and negative examples are separable by a hyperplane in n-dimensional space i.e., the positive and negative examples are separable by a hyperplane in n-dimensional space -  = 0 > 0 on this side < 0 on this side

22 22 Perceptron learnability Perceptron Convergence Theorem: If there is a set of weights that is consistent with the training data (i.e., the data is linearly separable), the perceptron learning algorithm will converge (Minicksy & Papert, 1969) Perceptron Convergence Theorem: If there is a set of weights that is consistent with the training data (i.e., the data is linearly separable), the perceptron learning algorithm will converge (Minicksy & Papert, 1969) Unfortunately, many functions (like parity) cannot be represented by LTU Unfortunately, many functions (like parity) cannot be represented by LTU

23 23 Learning: Backpropagation Similar to perceptron learning algorithm, we cycle through our examples Similar to perceptron learning algorithm, we cycle through our examples if the output of the network is correct, no changes are made if the output of the network is correct, no changes are made if there is an error, the weights are adjusted to reduce the error if there is an error, the weights are adjusted to reduce the error The trick is to assess the blame for the error and divide it among the contributing weights The trick is to assess the blame for the error and divide it among the contributing weights

24 24 Output layer As in perceptron learning algorithm, we want to minimize difference between target output and the output actually computed As in perceptron learning algorithm, we want to minimize difference between target output and the output actually computed activation of hidden unit j (T i – O i )derivative of activation function

25 25 Hidden layers Need to define error; we do error backpropagation. Need to define error; we do error backpropagation. Intuition: Each hidden node j is “responsible” for some fraction of the error  I in each of the output nodes to which it connects. Intuition: Each hidden node j is “responsible” for some fraction of the error  I in each of the output nodes to which it connects.  I divided according to the strength of the connection between hidden node and the output node and propagated back to provide the  j values for the hidden layer:  I divided according to the strength of the connection between hidden node and the output node and propagated back to provide the  j values for the hidden layer: update rule:

26 26 Backprogation algorithm Compute the  values for the output units using the observed error Compute the  values for the output units using the observed error Starting with output layer, repeat the following for each layer in the network, until earliest hidden layer is reached: Starting with output layer, repeat the following for each layer in the network, until earliest hidden layer is reached: propagate the  values back to the previous layer propagate the  values back to the previous layer update the weights between the two layers update the weights between the two layers

27 27 Backprop issues “Backprop is the cockroach of machine learning. It’s ugly, and annoying, but you just can’t get rid of it.”  Geoff Hinton “Backprop is the cockroach of machine learning. It’s ugly, and annoying, but you just can’t get rid of it.”  Geoff Hinton Problems: Problems: black box black box local minima local minima

28 28 Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit http://www.autonlab.org/tutorials/ for http://www.autonlab.org/tutorials/ Andrew’s repository of Data Mining tutorials.

29 29 Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y. Supervised learning used labeled data pairs (x, y) to learn a function f : X→Y. But, what if we don’t have labels? But, what if we don’t have labels? No labels = unsupervised learning No labels = unsupervised learning Only some points are labeled = semi-supervised learning Only some points are labeled = semi-supervised learning Labels may be expensive to obtain, so we only get a few. Labels may be expensive to obtain, so we only get a few. Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery. Clustering is the unsupervised grouping of data points. It can be used for knowledge discovery.

30 30 Clustering Data

31 31 K-Means Clustering K-Means ( k, data ) Randomly choose k cluster center locations (centroids). Loop until convergence Assign each point to the cluster of the closest centroid. Reestimate the cluster centroids based on the data assigned to each.

32 32 K-Means Clustering K-Means ( k, data ) Randomly choose k cluster center locations (centroids). Loop until convergence Assign each point to the cluster of the closest centroid. Reestimate the cluster centroids based on the data assigned to each.

33 33 K-Means Clustering K-Means ( k, data ) Randomly choose k cluster center locations (centroids). Loop until convergence Assign each point to the cluster of the closest centroid. Reestimate the cluster centroids based on the data assigned to each.

34 34 K-Means Animation Example generated by Andrew Moore using Dan Pelleg’s super- duper fast K-means system: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999.

35 35 Problems with K-Means Very sensitive to the initial points. Very sensitive to the initial points. Do many runs of k-Means, each with different initial centroids. Do many runs of k-Means, each with different initial centroids. Seed the centroids using a better method than random. (e.g. Farthest-first sampling) Seed the centroids using a better method than random. (e.g. Farthest-first sampling) Must manually choose k. Must manually choose k. Learn the optimal k for the clustering. (Note that this requires a performance measure.) Learn the optimal k for the clustering. (Note that this requires a performance measure.)

36 36 Problems with K-Means How do you tell it which clustering you want? How do you tell it which clustering you want? Constrained clustering techniques Constrained clustering techniques Same-cluster constraint (must-link) Different-cluster constraint (cannot-link)

37 37 Learning Bayes Nets Some material adapted from lecture notes by Lise Getoor and Ron Parr Adapted from slides by Tim Finin and Marie desJardins.

38 38 Learning Bayesian networks Given training set Given training set Find B that best matches D Find B that best matches D model selection model selection parameter estimation parameter estimation Data D Inducer C A EB

39 39 Parameter estimation Assume known structure Assume known structure Goal: estimate BN parameters  Goal: estimate BN parameters  entries in local probability models, P(X | Parents(X)) entries in local probability models, P(X | Parents(X)) A parameterization  is good if it is likely to generate the observed data: A parameterization  is good if it is likely to generate the observed data: Maximum Likelihood Estimation (MLE) Principle: Choose   so as to maximize L Maximum Likelihood Estimation (MLE) Principle: Choose   so as to maximize L i.i.d. samples

40 40 Parameter estimation II The likelihood decomposes according to the structure of the network The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter The MLE (maximum likelihood estimate) solution: The MLE (maximum likelihood estimate) solution: for each value x of a node X for each value x of a node X and each instantiation u of Parents(X) and each instantiation u of Parents(X) Just need to collect the counts for every combination of parents and children observed in the data Just need to collect the counts for every combination of parents and children observed in the data MLE is equivalent to an assumption of a uniform prior over parameter values MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

41 41 Sufficient statistics: Example Why are the counts sufficient? Why are the counts sufficient? EarthquakeBurglary Alarm Moon-phase Light-level θ * A | E, B = N(A, E, B) / N(E, B)

42 42 Model selection Goal: Select the best network structure, given the data Input: Training data Training data Scoring function Scoring functionOutput: A network that maximizes the score A network that maximizes the score

43 43 Structure selection: Scoring Bayesian: prior over parameters and structure Bayesian: prior over parameters and structure get balance between model complexity and fit to data as a byproduct get balance between model complexity and fit to data as a byproduct Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Marginal likelihood just comes from our parameter estimates Marginal likelihood just comes from our parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity Prior on structure can be any measure we want; typically a function of the network complexity Same key property: Decomposability Score(structure) =  i Score(family of X i ) Marginal likelihood Prior

44 44 Heuristic search B E A C B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A) Reverse E  A

45 45 Exploiting decomposability B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A) Reverse E  A B E A C Δscore(A) Delete E  A To recompute scores, only need to re-score families that changed in the last move

46 46 Variations on a theme Known structure, fully observable: only need to do parameter estimation Known structure, fully observable: only need to do parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Unknown structure, hidden variables: too hard to solve! Unknown structure, hidden variables: too hard to solve!

47 47 Handling missing data Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary Should we throw that data away?? Should we throw that data away?? Idea: Guess the missing values based on the other data Idea: Guess the missing values based on the other data EarthquakeBurglary Alarm Moon-phase Light-level

48 48 EM (expectation maximization) Guess probabilities for nodes with missing values (e.g., based on other observations) Guess probabilities for nodes with missing values (e.g., based on other observations) Compute the probability distribution over the missing values, given our guess Compute the probability distribution over the missing values, given our guess Update the probabilities based on the guessed values Update the probabilities based on the guessed values Repeat until convergence Repeat until convergence

49 49 EM example Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 We estimate the CPTs based on the rest of the data We estimate the CPTs based on the rest of the data We then estimate P(Burglary) for November 27 from those CPTs We then estimate P(Burglary) for November 27 from those CPTs Now we recompute the CPTs as if that estimated value had been observed Now we recompute the CPTs as if that estimated value had been observed Repeat until convergence! Repeat until convergence! EarthquakeBurglary Alarm


Download ppt "1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471."

Similar presentations


Ads by Google