Download presentation

Presentation is loading. Please wait.

Published byEmerald Stephens Modified over 2 years ago

1
© 2008 Warren B. Powell 1. Optimal Learning Informs TutORials October, 2008 Warren Powell Peter Frazier Princeton University © 2008 Warren B. Powell, Princeton University

2
© 2008 Warren B. Powell 2 Slide 2 Outline Introduction

3
© 2008 Warren B. Powell 3 Applications Sports »Who should be in the batting lineup for a baseball team? »What is the best group of five basketball players out of a team of 12 to be your starting lineup? »Who are the best four people to man the four-person boat for crew racing? »Who will perform the best in competition for your gymnastics team?

4
© 2008 Warren B. Powell 4 Applications Figure out Manhattan: »Walking »Subway/walking »Taxi »Street bus »Driving

5
© 2008 Warren B. Powell 5 Applications Biomedical research »How do we find the best drug to cure cancer? »There are millions of combinations, with laboratory budgets that cannot test everything. »We need a method for sequencing experiments.

6
© 2008 Warren B. Powell 6 Applications Biosurveillance »What is the prevalence of drug-resistant TB, MRSA, HIV/AIDS, malaria…., in the population? »How do we efficiently collect information about the state of disease around the world? »What are the best strategies for minimizing transmission? Deaths from vector-born diseases

7
© 2008 Warren B. Powell 7 Applications High technology »What is the best sensor to use to evaluate the status of optics for the National Ignition Facility? »When should lenses be inspected? »How often should an experiment be run to test a new hypothesis on the physics of fusion? National Ignition Facility

8
© 2008 Warren B. Powell 8 Applications Stochastic optimization »Stochastic search over surfaces that can only be measured with uncertainty »Simulation-optimization – What is the best set of parameters to produce the best manufacturing configuration? »Active learning – How do we choose which samples to collect for machine learning applications? »Exploration vs. exploitation in approximation dynamic programming – How do we decide which states to visit to balance our need to estimate the value of being in a state versus the reward from visiting a state?

9
© 2008 Warren B. Powell 9 Introduction Deterministic optimization »Find the choice with the highest reward (assumed known): The winner!

10
© 2008 Warren B. Powell 10 Introduction Stochastic optimization »Now assume the reward you will earn is stochastic, drawn from a normal distribution. The reward is revealed after the choice is made. The winner!

11
© 2008 Warren B. Powell 11 Introduction Optimal learning »Now, you have a budget of 10 measurements to determine which of the 5 choices is best. You have an initial probability distribution for the reward that each will return, but you are willing to change your belief as you make choices. How should you sequence your measurements to produce the best answer in the end? We might keep trying the option we think is best: … but what if the third or fourth choice is actually the best?

12
© 2008 Warren B. Powell 12 Introduction Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1234 5

13
© 2008 Warren B. Powell 13 Introduction Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1234 5 No improvement

14
© 2008 Warren B. Powell 14 Introduction Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1234 5 New solution The value of learning is that it may change your decision.

15
© 2008 Warren B. Powell 15© 2008 Warren B. Powell Slide 15 Outline Types of learning problems

16
© 2008 Warren B. Powell 16 Elements of a learning problem Things we have to think about: »How do we make measurements? What is the nature of the measurement decision? »What is the effect of a measurement? How does it change our state of knowledge? »What do we do with the results of what we learn from a measurement? What is the nature of the measurement decision? »How do we evaluate how well we have done with the results of our measurement? »Do we learn as we go, or are we able to make a series of measurements before solving a problem?

17
© 2008 Warren B. Powell 17 Elements of a learning problem Types of measurement decisions » Stopping problems – observe until you have to make a decision, such as selling an asset. » Finite (and not too big) set of choices » Subset selection –What is the best group of people for a sports team –What is the best subset of energy saving technologies for a building » What is the best price, density, temperature, speed » Linear, nonlinear and integer programming

18
© 2008 Warren B. Powell 18 Elements of a learning problem Optimal learning »Now assume that you do not know the distribution of the reward, although you have an estimate (a “prior”). »After you make your choice, you observe the actual reward which changes your belief about the distribution of rewards. Observation

19
© 2008 Warren B. Powell 19 Elements of a learning problem Updating the distribution »Frequentist view Assume we start with observations: Statistics: Frequentist interpretation: – and are random variables reflecting the randomness in the observations of

20
© 2008 Warren B. Powell 20 Elements of a learning problem Updating the distribution »Bayesian view We assume we start with a distribution of belief about the true mean Next we observe, which we assume comes from a distribution with variance (we assume the variance is known). Using Bayes theorem, we can show that our new distribution of belief about the true mean is normally distributed with mean and variance. We first define the precision of a distribution as the inverse variance: – The updating formulas are

21
© 2008 Warren B. Powell 21 Elements of a learning problem Frequentist vs. Bayesian »For optimal learning applications, we are generally in the situation where we have some knowledge about our choices, and we have to decide which one to measure to improve our final decision. »The state of knowledge: Frequentist view: Bayesian view: »For the remainder of our talk, we will adopt a Bayesian view since it allows us to introduce prior knowledge, a common property of learning problems.

22
© 2008 Warren B. Powell 22 Elements of a learning problem Relationships between beliefs and measurements »Beliefs Uncorrelated – What we know about one choice tells us nothing about what we know about another choice Correlated – If our belief of one choice is high, our belief about another choice might be higher »Measurement noise Uncorrelated - If we were to make two measurements at the same time, the measurements are independent. Correlated: –At a point in time – Simultaneous measurements are correlated. –Over time – Measurements of different choices may or may not be correlated, but measurements of the same choice at different points in time are correlated.

23
© 2008 Warren B. Powell 23 Elements of a learning problem Types of learning probems »On-line learning Learn as you earn Give example problems –Finding the best path to work –What is the best set of energy-saving technologies to use for your building? –What is the best medication to control your diabetes? »Off-line learning There is a phase of information collection with a finite (sometimes small) budget. You are allowed to make a series of measurements, after which you make an implementation decision. Examples: –Finding the best drug compound through laboratory experiments –Finding the best design of a manufacturing configuration or engineering design which is evaluated using an expensive simulation. –What is the best combination of designs for hydrogen production, storage and conversion.

24
© 2008 Warren B. Powell 24 Elements of a learning problem Measuring the benefits of knowledge: »Minimizing/maximizing a cost or reward Minimizing expected cost/maximizing reward or utility Minimizing expected opportunity cost (minimizing the gap from the best possible) Collecting information to produce a better solution to an optimization problem. »Making the right choice Maximizing the probability of making the correct selection Indifference zone selection – maximizing the probability of collecting a choice whose performance is within of the optimal. »Statistical measures Minimizing a measure (square, absolute value) of the distance between observations and a predictive function (classical estimation) Minimizing a metric (e.g. Kullback-Leibler divergence) measuring the distance between actual and predicted probability distributions. Minimizing entropy (or entropic loss)

25
© 2008 Warren B. Powell 25© 2008 Warren B. Powell Slide 25 Outline Measurement policies

26
© 2008 Warren B. Powell 26 Measurement policies What do we know? »The real average path times: »Mean time Path 1 20 minutes Path 2 22 minutes Path 3 24 minutes Path 4 26 minutes Errors are +/- 10 minutes »What we think: Path 1 25 minutes Path 2 24 minutes Path 3 22 minutes Path 4 20 minutes »We act by choosing the path that we “think” is the best. The only way we learn anything new is by choosing a path.

27
© 2008 Warren B. Powell 27 Measurement policies Illustration of calculations:

28
© 2008 Warren B. Powell 28 Measurement policies

29
© 2008 Warren B. Powell 29 Measurement policies

30
© 2008 Warren B. Powell 30 Measurement policies

31
© 2008 Warren B. Powell 31 Measurement policies

32
© 2008 Warren B. Powell 32 Measurement policies For problems with a finite number of alternatives »On-line learning (learn as you earn) This is known in the literature as the multi-armed bandit problem, where you are trying to find the slot machine with the highest payoff. It is necessary to trade off what you think you will earn with each decision, against the value of the information you will gain that might improve decisions in the future. »Off-line learning You have a budget for taking measurements. After your budget is exhausted, you have to make a final choice. This is known as the ranking and selection problem.

33
© 2008 Warren B. Powell 33 Measurement policies Elements of a measurement policy: »Deterministic or sequential Deterministic policy - you decide what you are going to measure in advance. Sequential policy – Future measurements depend on past observations. »Designing a measurement policy We have to strike a balance between the value of a good measurement policy and the cost of computing it If we are drilling oil exploration holes, we might be willing to spend a day on the computer deciding what to do next We may need a trivial calculation if we are guiding an algorithm that will perform thousands of iterations. »Evaluating a policy The goal is to find a policy that gets us close enough to the truth that we make the optimal (or near-optimal) decisions To do this, we have to assume a truth, and then use a policy to try to guess at the truth.

34
© 2008 Warren B. Powell 34 Measurement policies Finding an optimal policy »Dynamic programming formulation Let be the “state of knowledge” –E.g. if we have 10 choices, each with a mean and variance, our state would be An optimal learning policy is characterized by Bellman’s equation: »Computational challenges State variable has 20 dimensions, each is continuous. Solving this is impossible (and this is a simple problem!)

35
© 2008 Warren B. Powell 35 Measurement policies Special case: on-line learning with independent beliefs »Multi-armed bandit problem – Which slot machine should I try next to maximize total expected rewards? »Breakthrough (Gittins and Jones, 1974) Do not need to solve the high-dimensional dynamic program Compute a single index (the “Gittins index”) for each slot machine Try the slot machine with the largest index For normally distributed rewards, the index looks like: »Notes Yao (2006) and Brezzi and Lai (2002) provide analytical approximation for Despite extensive literature on index policies, range of applications is fairly limited. Standard deviation of measurementGittins index for mean zero, variance 1Current estimate of the reward from machine x

36
© 2008 Warren B. Powell 36 Measurement policies Heuristic measurement policies »Pure exploitation – Always make the choice that appears to be the best. »Pure exploration – Make choices at random so that you are always learning more, but without regard to the cost of the decision. »Hybrid Explore with probability and exploit with probability Epsilon-greedy exploration – explore with probability. Goes to zero as, but not too quickly. »Boltzmann exploration Explore choice x with probability »Interval estimation (upper confidence bounding) Choose x which maximizes

37
© 2008 Warren B. Powell 37 Measurement policies Approximate policies for off-line learning »Optimal computing budget allocation Brief description »LL(s) – Batch linear loss (Chick et al) »Maximizing the expected value of a single measurement (R1, R1, …,R1) Gupta and Miescke (1996) EVI (Chick, Branke and Schmidt, under review) “Knowledge gradient” (Frazier and Powell, 2008)

38
© 2008 Warren B. Powell 38 Measurement policies Evaluating measurement policies »How do we compare one measurement policy to another? »One possibility: … but we would be wrong!

39
© 2008 Warren B. Powell 39 Measurement policies Illustration »Setup: Option 1 is worth 15 Remaining 999 options are worth 10 Standard deviation of a measurement is 5 »Policy 1: Measure each option 10 times »Policy 2: Measure remaining 999 options once. Measure first option 9,001 times »Which measurement policy produces the best result?

40
© 2008 Warren B. Powell 40 Measurement policies Measuring each alternative 10 times Best choice

41
© 2008 Warren B. Powell 41 Measurement policies Measuring option 1 9,001 times, and everything else once. Lucky choice

42
© 2008 Warren B. Powell 42 Measurement policies What did we find? »Although option 1 is best, we will almost always identify some other option as being better, just through randomness. This method rewards collecting too little information. A better way: »Assume a truth for each x. We do this by choosing a sample realization of a truth from a prior probability distribution for the mean. »Given this truth, apply policy to produce statistical estimates given by. Let be the best solution based on these estimates. Repeat this n times and evaluate the policy using » Note: This must be done with realistic (but not real) data.

43
© 2008 Warren B. Powell 43© 2008 Warren B. Powell Slide 43 Outline The knowledge gradient policy

44
© 2008 Warren B. Powell 44 The knowledge gradient Basic principle: »Assume you can make only one measurement, after which you have to make a final choice (the implementation decision). »What choice would you make now to maximize the expected value of the implementation decision? 1234 5 Change in estimate of value of option 5 due to measurement. Change which produces a change in the decision.

45
© 2008 Warren B. Powell 45 The knowledge gradient General model »Off-line learning – We have a measurement budget of N observations. After we do our measurements, we have to make an implementation decision. »Notation:

46
© 2008 Warren B. Powell 46 The knowledge gradient »The knowledge gradient is the expected value of a single measurement x, given by »The challenge is a computational one: how do we compute the expectation? Knowledge state stateImplementation decisionUpdated knowledge state given measurement xExpectation over different measurement outcomesMarginal value of measuring x (the knowledge gradient)

47
© 2008 Warren B. Powell 47 The knowledge gradient Derivation »Notation »We update the precision using »In terms of the variance, this is the same as

48
© 2008 Warren B. Powell 48 The knowledge gradient Derivation »The change in variance can be found to be »Next compute the normalized influence: »Let »Knowledge gradient is computed using

49
© 2008 Warren B. Powell 49 The knowledge gradient Knowledge gradient 1234 5

50
© 2008 Warren B. Powell 50 The knowledge gradient The knowledge gradient policy Properties »Effectively a myopic policy, but also similar to steepest ascent for nonlinear programming. »The best single measurement you can make (by construction) »Asymptotically optimal (more difficult proof). As the measurement budget grows, we get the optimal solution. »The knowledge gradient policy is the only stationary policy with this behavior. Many policies are asymptotically optimal (e.g. pure exploration, hybrid exploration/exploitation, epsilon-greedy), but are not myopically optimal.

51
© 2008 Warren B. Powell 51 The knowledge gradient Current estimate of value of a decisionCurrent estimate of standard deviationValue of knowledge gradient

52
© 2008 Warren B. Powell 52 The knowledge gradient

53
© 2008 Warren B. Powell 53 The knowledge gradient

54
© 2008 Warren B. Powell 54 The knowledge gradient Experimental comparisons: »KG vs: Boltzmann Interval estimation Equal allocation OCBA Pure exploitation Linear loss KG - Boltzmann KG – Equal alloc KG – Pure exploit KG – Interval est. KG – OCBA KG – LL(S)

55
© 2008 Warren B. Powell 55 The knowledge gradient Notes: »KG slightly outperforms Interval Estimation (IE), OCBA, and LL(S), and is easier to compute than OCBA and LL(S). »KG is fairly easy to compute for independent, normally distributed rewards. »But KG is a general concept which generalizes to other important problem classes: Correlated beliefs Correlated measurements (e.g. Common Random Numbers) On-line applications … more general optimization problems

56
© 2008 Warren B. Powell 56© 2008 Warren B. Powell Slide 56 Outline The knowledge gradient with correlated beliefs

57
© 2008 Warren B. Powell 57 Correlated beliefs Applications »Measurements of continuous functions »Subset selection »Multiattribute

58
© 2008 Warren B. Powell 58 CKG technique Animations on a line Subset selection illustration (diabetes?) EGO technique? Contrast with CKG KG for online

59
© 2008 Warren B. Powell 59 KG for more general applications »On a graph »LP’s??? »KG with a physical state

60
© 2008 Warren B. Powell 60

61
© 2008 Warren B. Powell 61 Solution methods Dynamic programming for pure learning (knowledge state without a physical state) »On-line learning Gittins indices and the uncertainty bonus Gittins “index” Std. dev of The “uncertainty bonus” The estimated value of a decision

62
© 2008 Warren B. Powell 62 Optimal measuring - uncorrelated Knowledge gradient policy Measurement Updated knowledge Economic decision (using the information) Measurement decision (collecting information)

63
© 2008 Warren B. Powell 63 Solution methods Generalizations »Measurements may be correlated. We may measure an object with a multidimensional attribute vector a. Measuring a tells us about an object with attribute a’ if the two share common attributes. Events on a line – a sensor at location x may provide a rough measurement at nearby locations. May assume that measurements at x and y are correlated inversely with their distance. –Click hereClick here

64
© 2008 Warren B. Powell 64 Solution methods Knowledge gradient adapted to on-line learning »Finite horizon problems »Infinite horizon problems

65
© 2008 Warren B. Powell 65 Examples of learning: »Transportation You just took a new job and there are different paths you can take to get to work. You have an idea how long each path is, but you do not know anything about traffic delays, waiting for subways/commuter trains, missed connections, late service.

66
© 2008 Warren B. Powell 66 Figure out Manhattan: »Walking »Subway/walking »Taxi »Street bus »Driving

67
© 2008 Warren B. Powell 67 Information acquisition Finding the best path to work »Four paths, but everyone time I drive on one, I sample a new time. »I want to choose the path that is best on average.

68
© 2008 Warren B. Powell 68 Information acquisition What do we know? »What we think: Path 1 25 minutes Path 2 24 minutes Path 3 22 minutes Path 4 20 minutes »We act by choosing the path that we “think” is the best. The only way we learn anything new is by choosing a path.

69
© 2008 Warren B. Powell 69 Information acquisition The shortest path game (game 1) »Starting with the estimates at the top, choose paths so that you discover the best path.

70
© 2008 Warren B. Powell 70

71
© 2008 Warren B. Powell 71

Similar presentations

OK

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on action words for class 1 Ppt on power sharing in democracy the power Ppt on inhabiting other planets with water Ppt on event driven programming examples Ppt on hard gelatin capsule making Ppt on bluetooth communication code Ppt on airbag working principle Training ppt on soft skills Ppt on math quiz for grade 3 Ppt on point contact diode