Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.

Similar presentations


Presentation on theme: "Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial."— Presentation transcript:

1 Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering Princeton University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A A A AA

2 2 Learning Problems On-line Problems During a pandemic, a doctor may wish to find the best treatment for the disease while minimizing patient loss. Software in a telecommunications network may wish to find the network path with smallest latency. Off-line Problems A polling agency may wish to determine which candidate is most preferred by the electorate. A factory manager may wish to choose one from among several new machines during a pre-purchase evaluation period. A scientist running a Monte Carlo simulation may wish to find the set of input parameters that best fit observations.

3 3 Part I On-line Learning

4 4 On-line Learning, Model #1 Test Treatment (Unknown Success Rate) Standard Treatment (Known Success Rate) Sick Patients +1 Patients treated with Test Treatment Patients treated with Standard Treatment 0 0+1

5 5 On-line Model #1 The test treatment has success rate either q or r. The standard treatment has a known success rate s. Assume r < s < q. Define to be 1 if patient n is cured, and 0 otherwise. Our goal is to maximize the expected number of patients cured, with discount factor  2 (0,1).

6 6 Bayesian Learning Assume we have a Bayesian prior on the success rate of the test treatment. We update our prior based on our observation

7 7 Bayesian Learning If we use the test treatment and have a success, If we use the test treatment and have a failure, If we use the standard treatment we learn nothing.

8 8 Exploration vs. Exploitation Exploration: use the test treatment. Exploitation: use the treatment that appears to be the best. –The value of the standard treatment is –The expected value of the test treatment is

9 9 Odds Ratio Define the odds ratio If we use the test treatment and have a failure, If we use the test treatment and have a success, Recover the probability from the odds ratio by

10 10 Structure of the Optimal Policy When is large, we believe the test treatment is good. Explore when  n is large. Exploit when  n is small. x n = “test” if  n >  * “standard” if  n <  *

11 11 Typical Sample Paths

12 12 Dynamic Programming Recall the objective function Define the value function V(  ) to be the maximum achievable objective value given a time 0 odds ratio of . Lower bound for V is Upper bound for V is

13 13 Dynamic Programming The value function satisfies Bellman’s equation, Value of using the standard treatment from time n forward. Expected value of using the test treatment at time n, and using the optimal policy in the future.

14 14 Value Iteration We will compute a sequence of approximations v n converging to the value function V. –Limit the state space to a finite set of allowed . –Begin with v 0 (  ) = 0. This is a lower bound for V. –Compute v^ n+1 from v n via –In the limit, v n  V

15 15 Value Iteration Example q=2/3, s=1/2, r=1/3,  =.9. Restrict  to be in {2 n : n 2 { -10,...,0,...,10} }

16 16 Value Iteration Example To find the optimal policy, use the computed value function to solve for  * The optimal policy is to use the test treatment when  >  *, and the standard treatment otherwise. In our case,  * ¼ 2 -1.6 =.3299. This corresponds to P{q} =.2481.

17 17 On-line Learning, Model #2 Treatment 1 Treatment M Sick Patients +1 Patients treated with Treatment M +.2 -.2 +.5+1 Patients treated with Treatment 1

18 18 On-line Learning, Model #2 x n is the treatment used for patient n. Receive a reward At time n, Error  is independent N(0,(   ) 2 ). Maximize

19 19 Exploration vs. Exploitation Exploration: “measure the alternative about which we know the least.” Exploitation: “measure the alternative that appears to be the best.”

20 20 Time 0 Number of treatments is M=5 Discount factor  is.9

21 21 Time 1 We test compound x=1 and measure its quality with noise. We update our estimate of the value of compound 1. Total Reward: -2

22 22 Time 2 Total Reward: -2+  (.5) = -1.55

23 23 Time 5 Total Reward: -0.044

24 24 Time 10 Total Reward: 0.284

25 25 Bayes’ Rule Prior Posterior Observation

26 26 How Measurements Affect Estimates At time n we measure state x n. We update our estimate of based on the measurement. Estimates of other Y x do not change.

27 27  x declines deterministically At time n,  n+1 x is a normal random variable with mean  n x and variance satisfying uncertainty about Y x before the measurement How Measurements Affect Estimates uncertainty about Y x after the measurement change in best estimate of Y x due to the measurement

28 28  x n is a Random Walk

29 29 Dynamic Programming Recall the objective function Define the value function V(  0,  0 ) to be the maximum achievable objective value for a time 0 prior of Y » N(  0,  0 ).

30 30 Dynamic Programming Again, the value function satisfies Bellman’s equation Curse of dimensionality: –the  x are continuous. –  and  each have M dimensions. –If we discretize  x and  x each to K levels, there are K 2M states. –For K=10, M=5, this is 10 billion states.

31 31 Solution to Infinite Horizon Problem Gittins(1989) solves the problem by separating into K single-alternative problems: –At each time n, we may either continue and receive a reward from alternative x, or retire and receive a one-time retirement reward M. –Define M(  x n,  x n ) to be the retirement reward needed to make retiring optimal when the state is  x n,  x n.

32 32 Solution to Infinite Horizon Problem The composite problem has solution where  (n) is proportional to [n(1-  )] -1/2 in the limit as n  0.

33 33 Exploration vs. Exploitation in the On-line Problem The solution to the on-line problem balances exploration vs. exploitation. Exploration is provided by this term

34 34 Alternative 1 2 Value Even if  2 =0, we can always choose  2 big enough to make 2 the optimal measurement. 22 11 Exploration vs. Exploitation in the On-line Problem

35 35 Finite Horizon On-line Variant Fix a finite horizon N, and discard the discount factor. The new problem has objective function This problem is unsolved.

36 36 Part II Off-line Learning

37 37 Off-line Learning, Measurement Phase Treatment 1 Times in which to do experiments Treatment M N experiments +1 Tests with Treatment M +.2 -.2 +.5+1 Tests with Treatment 1

38 38 Off-line Learning, Implementation Phase Treatment 1 Choose one treatment and receive rewards from the chosen treatment only. Treatment M +.8+.1+.5+1 Rewards from Treatment 1 +.2+.5 Patients to be treated with chosen treatment.

39 39 Off-line Learning Model x n is the treatment tested at time n. Measurement the value of treatment x n At time n, Error  is independent N(0,(   ) 2 ). At time N, choose a treatment. Maximize

40 40 Time 0 Number of alternatives is M=5 Number of tests allowed is N=10.

41 41 Time 1 We test alternative x=1 and measure its quality with noise. We update our estimate of its value.

42 42 Time 2

43 43 Time 5

44 44 (Final) Time 10 After our experimental budget of N=10 tests is exhausted, we choose the alternative that is the best according to our current estimate, and we receive its true value.

45 45 (Final) Time 10 Reward=0.2 After our experimental budget of N=10 tests is exhausted, we choose the alternative that is the best according to our current estimate, and we receive its true value.

46 46 Dynamic Program The objective function is The value function V n (  n,  n ) is the value of the objective attained by the optimal policy starting from a time n prior of  n,  n.

47 47 Curse of Dimensionality The sequence of value functions V N,V N-1,...,V 0 satisfy the Bellman equation: The curse of dimensionality prevents direct computation using the Bellman equation. –  x are continuous –  and  each have M dimensions –Need to compute N value functions (V N is known) –If we discretize  x and  x to K levels each, there are N*K 2M states. –For K=10, M=5, N=10, this is 100 billion states.

48 48 Example 1 One common experimental design is to spread measurements equally across the alternatives.

49 49 Example 1 Round Robin Exploration

50 50 How might we improve round-robin exploration for use with this prior? Example 2

51 51 Example 2 Largest Variance Exploration

52 52 Example 3 Exploitation:

53 53 “There is no need to measure what you already know.” If alternative x has zero variance and some other alternative has strictly positive variance, then the optimal policy never measures alternative x.

54 54 Alternative 1 2 Value Off-line : it is always optimal to measure alternative 1 no matter how large we make  2, Alternative 1 2 Value BB BB AA AA On-line : We can always make measuring alternative 2 optimal by choosing  2 large enough. Contrasting Exploitation’s Role Suppose alternative 2 is perfectly known.

55 55 Optimal Policy for Off-line M=2 If we have two alternatives (M=2), then the exploration policy is optimal: In offline learning with M=2, there is no advantage to exploitation.

56 56 Knowing whether Y 1 >Y 2, Y 1 <Y 2, or Y 1 =Y 2 is the same as knowing the sign of Y 1 -Y 2 Treatment Quality Optimal Policy for Off-line M=2

57 57 The Benefit of Off-line Exploitation Though 3 has larger variance, our main task is to distinguish between 1 and 2. 1 23

58 58 Utility of Information Consider our “utility of information”, and consider the random change in utility due to a measurement at time n

59 59 Knowledge Gradient Definition The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,

60 60 Knowledge Gradient We may compute the knowledge gradient policy via which is the expectation of the maximum of a normal and a constant.

61 61 Computation of the Knowledge Gradient Policy The computation becomes where  is the normal cdf,  is the normal pdf, and

62 62 Numerical Example 1

63 63 Numerical Example 2

64 64 Numerical Example 3

65 65 Knowledge Gradient Example

66 66 Optimality Results If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.

67 67 Optimality Results If there are exactly 2 alternatives (M=2), the knowledge gradient policy measures the alternative with the largest variance, and hence is optimal.

68 68 Optimality Results The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity.

69 69 Optimality Results If there is no measurement noise and alternatives may be reordered so that then the knowledge gradient policy is optimal.

70 70 Optimality Results The knowledge gradient policy has sub-optimality bounded by where V KG,n gives the value of the knowledge gradient policy and V n the value of the optimal policy.

71 71 Other Theoretical Results V N-n ( ,  ) is increasing in n, , and . The value of perfect information bounds V from above The value of no additional information bounds V from below The value of measuring something is at least as large as that of measuring nothing at all.

72 72 Boltzmann Exploration Parameterized by a declining sequence of temperatures (T 0,...T N-1 ).

73 73 Compare alternatives via a linear combination of mean and standard deviation. The parameter z  /2 controls the tradeoff between exploration and exploitation. Interval Estimation

74 74 Interval Estimation Source: http://en.wikibooks.org/wiki/Image:TempC_sampleci.jpg True mean Y_x Random intervals based on measurements The following is a 1-  -confidence region for Y x That is, if Y x were a fixed but unknown value, and we sampled  x n many times from a normal( Y x,  x n ) distribution, the interval would contain Y x a fraction 1-  of the time.

75 75 Numerical Experiments 100 randomly generated problems M Uniform {1,...100} N Uniform {M, 3M, 10M}  0 x Uniform [-1,1] (  0 x ) 2 = 1 with probability 0.9 = 10 -3 with probability 0.1   = 1

76 76 Numerical Experiments

77 77 KG / IE Comparison

78 78 IE Example

79 79 IE and “Sticking” Alternative 1 is known perfectly

80 80 IE and “Sticking”

81 81 KG / IE Comparison

82 82 Generalizations to the Model Correlated normal priors Non-normal priors Continuous measurements Composite measurements Choosing the best k alternatives Risk aversion Min-max formulation Measurements with different costs Choosing when to stop measuring


Download ppt "Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial."

Similar presentations


Ads by Google