Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning and Review Reading: C. 18. 2 Bayesian Approach  Each observed training example can incrementally decrease or increase probability of.

Similar presentations


Presentation on theme: "Machine Learning and Review Reading: C. 18. 2 Bayesian Approach  Each observed training example can incrementally decrease or increase probability of."— Presentation transcript:

1 Machine Learning and Review Reading: C. 18

2 2 Bayesian Approach  Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis  Prior knowledge can be combined with observed data to determine hypothesis  Bayesian methods can accommodate hypotheses that make probabilistic predictions  New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities

3 3 Applying Bayes Theorem  Best hypothesis = most probable hypothesis Maximum a posteriori (MAP) hypothesis  Variables  h = hypothesis  D = data  Prior probability  h: P(h)  training data observed: P(D)  P(D|h) = probability of observing data D given some world where hypothesis holds  Bayes theorem:  P(h|D) = P(D|h)*P(h) P(D)

4 4 Defining the MAP hypothesis  h MAP =argmax P(h|D) hεH  h MAP =argmax P(D|h)*P(h) hεH P(D) (Using Bayes Theorem)  h MAP =argmax P(D|h)*P(h) hεH (P(D) is a constant independent of h)  h MAP =argmax P(D|h) hεH (when we can make the assumption that each hypothesis h is equally probable)

5 5 Bayes Optimal Classifier  The most probable classification of the new instance by combining the predictions of all hypotheses weighted by their posterior probabilities  Possible classifications: v j εV  Argmax ∑ P(v j |h i )P(h i |D) v j εV h i εH

6 6 Example  V = {p, n}  P(h 1 |D)=.4 P(p|h 1 )=0 P(n,h 1 )=1  P(h 2 |D)=.3 P(p|h 2 )=1 P(n,h 2 )=0  P(h 3 |D)=.3 P(p|h 3 )=1 P(n,h 3 )=0  ∑ P(n|h i )P(h i |D) =.4 h i εH  ∑ P(p|h i )P(h i |D) =.6 h i εH  Argmax ∑ P(v j |h i )P(h i |D) = p v j ε{p,n} h i εH

7 7 Properties of Bayesian Approach  Bayesian learning is optimal  Easy to estimate P(h) by counting in training data  Estimating P(D|h) not feasible  Why?

8 8 P(D|h) S-lengthS-widthP-lengthClass 1high Versicolour 2lowhighlowSetosa 3lowhighlowVerginica 4low medVerginica 5high Versicolour 6high medSetosa 7high lowSetosa 8high Versicolour 9high Versicolour

9 9 Naïve Bayes  Assume independence of attributes D = a 1,a 2,…a n P(a 1,a 2,…a n |v j )=∏P(a i |v j ) i  Substitute into V MAP formula V NB =argmax P(v j )∏P(a i |v j ) v j V i

10 10 V NB =argmax P(v j )∏P(a i |v j ) v j  V S-lengthS-widthP-lengthClass 1high Versicolour 2lowhighlowSetosa 3lowhighlowVerginica 4low medVerginica 5high Versicolour 6high medSetosa 7high lowSetosa 8high Versicolour 9high Versicolour

11 11 Estimating Probabilities  What happens when the number of data elements is small?  Suppose true P(S-length=high|verginica)=.05  There are only 2 instances with C=Verginica  We estimate probability by n c /n or #S- length|Verginica/C-Verginica  #S-length|Verginica must = 0  Then, instead of.05 we use estimated probability of 0  Two problems  Biased underestimate of probability  This probability term will dominate

12 12 Instead  Use priors as well  n c +mp n+m Where p = prior estimate M is a constant called the equivalent sample size  Determines how heavily to weight p relative to observed data  Typical method: assume a uniform prior

13 13 Benefits of Naïve Bayes  Practical  As effective and in some cases, more so, than other machine learners

14 14 Review for Midterm  Concepts you should know  Search algorithms  Depth-first, breadth-first, iterative deepening, A *, greedy, hill-climbing, beam  Constraint propagation  Game playing  Bayesian Nets  A little on machine learning

15 15 Midterm format  Multiple choice  Short answer questions  Problem solving  Essay  An example midterm will be posted under links

16 16 Concepts  Any words in yellow or light blue or pink on slides

17 17 Uninformed Search  Depth-first  Breadth-first  Iterative Deepening

18 18 Formulating Problems as Search Given an initial state and a goal, find the sequence of actions leading through a sequence of states to the final goal state. Terms: Successor function: given action and state, returns {action, successors} State space: the set of all states reachable from the initial state Path: a sequence of states connected by actions Goal test: is a given state the goal state? Path cost: function assigning a numeric cost to each path Solution: a path from initial state to goal state

19 19 Breadth first  OPEN = start node; CLOSED = empty  While OPEN is not empty do  Remove leftmost state from OPEN, call it X  If X = goal state, return success  Put X on CLOSED  SUCCESSORS = Successor function (X)  Remove any successors on OPEN or CLOSED  Put remaining successors on right end of OPEN  End while

20 20 Depth-first  OPEN = start node; CLOSED = empty  While OPEN is not empty do  Remove leftmost state from OPEN, call it X  If X = goal state, return success  Put X on CLOSED  SUCCESSORS = Successor function (X)  Remove any successors on OPEN or CLOSED  Put remaining successors on left end of OPEN  End while

21 21 Can we combine benefits of both?  Depth limited  Select some limit in depth to explore the problem using DFS  How do we select the limit?  Iterative deepening  DFS with depth 1  DFS with depth 2 up to depth d

22 22 Complexity Analysis  Completeness: is the algorithm guaranteed to find a solution when there is one?  Optimality: Does the strategy find the optimal solution?  Time: How long does it take to find a solution?  Space: How much memory is needed to perform the search? Is this notion of completeness the same as completeness in logic?

23 23 Cost variables  Time: number of nodes generated  Space: maximum number of nodes stored in memory  Branching factor: b  Maximum number of successors of any node  Depth: d  Depth of shallowest goal node  Path length: m  Maximum length of any path in the state space

24 24 Informed Search  Best-first  A *  Greedy  Hill climbing  Variants  Randomness, Simulated annealing, Local beam search,  Online search will not be on midterm

25 25 Greedy Search  OPEN = start node; CLOSED = empty  While OPEN is not empty do  Remove leftmost state from OPEN, call it X  If X = goal state, return success  Put X on CLOSED  SUCCESSORS = Successor function (X)  Remove any successors on OPEN or CLOSED  Compute heuristic function for each node  Put remaining successors on either end of OPEN  Sort nodes on OPEN by value of heuristic function  End while

26 26 A* Search  Try to expand node that is on least cost path to goal  Evaluation function = f(n)  f(n)=g(n)+h(n)  h(n) is heuristic function: cost from node to goal  g(n) is cost from initial state to node  f(n) is the estimated cost of cheapest solution that passes through n  If h(n) is an underestimate of true cost to goal  A* is complete  A* is optimal  A* is optimally efficient: no other algorithm using h(n) is guaranteed to expand fewer states

27 27 Admissable heuristics  A heuristic that never overestimates the cost to the goal  h 1 and h 2 are admissable heuristics  Consistency: the estimated cost of reaching the goal from n is no greater than the step cost of getting to n’ plus estimated cost to goal from n’  h(n) <=c(n,a,n’)+h(n’)

28 28 Local Search Algorithms  Operate using a single current state  Move only to neighbors of the state  Paths followed by search are not retained  Iterative improvement  Keep a single current state and try to improve it

29 29 Steepest Ascent

30 30

31 31 Problems for hill climbing When the higher the heuristic function the better: maxima (objective fns); when the lower the function the better: minima (cost fns)  Local maxima: A local maximum is a peak that is higher than each of its neighboring states, but lower than the global maximum  Ridges: a sequence of local maxima  Plateaux: an area of the state space landscape where the evaluation function is flat

32 32 Some solutions  Stochastic hill-climbing  Chose at random from among the uphill moves  First-choice hill climbing  Generates successors randomly until one is generated that is better than current state  Random-restart hill climbing  Keep restarting from randomly generated initial states, stopping when goal is found  Simulated annealing  Generate a random move. Accept if improvement. Otherwise accept with continually decreasing probability.  Local beam search  Keep track of k states rather than just 1

33 33 Example midterm problem on search

34 34 Constraint Propagation

35 35 CSP algorithm Depth-first search often used  Initial state: the empty assignment {}; all variables are unassigned  Successor fn: assign a value to any variable, provided no conflicts w/constraints  All CSP search algorithms generate successors by considering possible assignments for only a single variable at each node in the search tree  Goal test: the current assignment is complete  Path cost: a constant cost for every step

36 36 Local search  Complete-state formulation  Every state is a compete assignment that might or might not satisfy the constraints  Hill-climbing methods are appropriate

37 37

38 38 General purpose methods for efficient implementation  Which variable should be assigned next?  in what order should its values be tried?  Can we detect inevitable failure early?  Can we take advantage of problem structure?

39 39 Order  Choose the most constrained variable first  The variable with the fewest remaining values  Minimum Remaining Values (MRV) heuristic  What if there are >1?  Tie breaker: Most constraining variable  Choose the variable with the most constraints on remaining variables

40 40 Order on value choice  Given a variable, chose the least constraining value  The value that rules out the fewest values in the remaining variables

41 41 Forward Checking  Keep track of remaining legal values for unassigned variables  Terminate search when any variable has no legal values

42 42 Example midterm problem on constraint satisfaction

43 43 Game Playing  Minimax  Alpha-beta pruning  Evaluation function (what is the difference between a cost function, a utility function, a heuristic function, an evaluation function?)

44 44 Bayesian nets  Example problem


Download ppt "Machine Learning and Review Reading: C. 18. 2 Bayesian Approach  Each observed training example can incrementally decrease or increase probability of."

Similar presentations


Ads by Google