Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pondering Probabilistic Play Policies for Pig Todd W. Neller Gettysburg College.

Similar presentations


Presentation on theme: "Pondering Probabilistic Play Policies for Pig Todd W. Neller Gettysburg College."— Presentation transcript:

1 Pondering Probabilistic Play Policies for Pig Todd W. Neller Gettysburg College

2 Sow What’s This All About? The Dice Game “Pig” Odds and Ends Playing to Win –“Piglet” –Value Iteration Machine Learning

3 Pig: The Game Object: First to score 100 points On your turn, roll until: –You roll 1, and score NOTHING. –You hold, and KEEP the sum. Simple game  simple strategy? Let’s play…

4 Playing to Score Simple odds argument –Roll until you risk more than you stand to gain. –“Hold at 20” 1/6 of time: -20  -20/6 5/6 of time: +4 (avg. of 2,3,4,5,6)  +20/6

5 Hold at 20? Is there a situation in which you wouldn’t want to hold at 20? –Your score: 99; you roll 2 –Case scenario you: 79 opponent:99 Your turn total stands at 20

6 What’s Wrong With Playing to Score? It’s mathematically optimal! But what are we optimizing? Playing to score  Playing to win Optimizing score per turn  Optimizing probability of a win

7 Piglet Simpler version of Pig with a coin Object: First to score 10 points On your turn, flip until: –You flip tails, and score NOTHING. –You hold, and KEEP the # of heads. Even simpler: play to 2 points

8 Essential Information What is the information I need to make a fully informed decision? –My score –The opponent’s score –My “turn score”

9 A Little Notation P i,j,k – probability of a win if i = my score j = the opponent’s score k = my turn score Hold: P i,j,k = 1 - P j,i+k,0 Flip: P i,j,k = ½(1 - P j,i,0 ) + ½ P i,j,k+1

10 Assume Rationality To make a smart player, assume a smart opponent. (To make a smarter player, know your opponent.) P i,j,k = max(1 - P j,i+k,0, ½(1 - P j,i,0 + P i,j,k+1 )) Probability of win based on best decisions in any state

11 The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(1 – P 0,0,0 + P 0,0,2 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(1 – P 1,0,0 + P 0,1,2 )) P 1,0,0 = max(1 – P 0,1,0, ½(1 – P 0,1,0 + P 1,0,1 )) P 1,1,0 = max(1 – P 1,1,0, ½(1 – P 1,1,0 + P 1,1,1 ))

12 The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(1 – P 0,0,0 + P 0,0,2 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(1 – P 1,0,0 + P 0,1,2 )) P 1,0,0 = max(1 – P 0,1,0, ½(1 – P 0,1,0 + P 1,0,1 )) P 1,1,0 = max(1 – P 1,1,0, ½(1 – P 1,1,0 + P 1,1,1 )) These are winning states!

13 The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(1 – P 0,0,0 + 1)) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(1 – P 1,0,0 + 1)) P 1,0,0 = max(1 – P 0,1,0, ½(1 – P 0,1,0 + 1)) P 1,1,0 = max(1 – P 1,1,0, ½(1 – P 1,1,0 + 1)) Simplified…

14 The Whole Story P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(2 – P 0,0,0 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(2 – P 1,0,0 )) P 1,0,0 = max(1 – P 0,1,0, ½(2 – P 0,1,0 )) P 1,1,0 = max(1 – P 1,1,0, ½(2 – P 1,1,0 )) And simplified more into a hamsome set of equations…

15 How to Solve It? P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(2 – P 0,0,0 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(2 – P 1,0,0 )) P 1,0,0 = max(1 – P 0,1,0, ½(2 – P 0,1,0 )) P 1,1,0 = max(1 – P 1,1,0, ½(2 – P 1,1,0 )) P 0,1,0 depends on P 0,1,1 depends on P 1,0,0 depends on P 0,1,0 depends on …

16 A System of Pigquations P 0,1,1 P 0,0,1 P 1,1,0 P 0,1,0 P 1,0,0 P 0,0,0 Dependencies between non-winning states

17 How Bad Is It? The intersection of a set of bent hyperplanes in a hypercube In the general case, no known method (read: PhD research) Is there a method that works (without being guaranteed to work in general)? –Yes! Value Iteration!

18 Value Iteration Start out with some values (0’s, 1’s, random #’s) Do the following until the values converge (stop changing): –Plug the values into the RHS’s –Recompute the LHS values That’s easy. Let’s do it!

19 Value Iteration P 0,0,0 = max(1 – P 0,0,0, ½(1 – P 0,0,0 + P 0,0,1 )) P 0,0,1 = max(1 – P 0,1,0, ½(2 – P 0,0,0 )) P 0,1,0 = max(1 – P 1,0,0, ½(1 – P 1,0,0 + P 0,1,1 )) P 0,1,1 = max(1 – P 1,1,0, ½(2 – P 1,0,0 )) P 1,0,0 = max(1 – P 0,1,0, ½(2 – P 0,1,0 )) P 1,1,0 = max(1 – P 1,1,0, ½(2 – P 1,1,0 )) Assume P i,j,k is 0 unless it’s a win Repeat: Compute RHS’s, assign to LHS’s

20 But That’s GRUNT Work! So have a computer do it, slacker! Not difficult – end of CS1 level Fast! Don’t blink – you’ll miss it Optimal play: –Compute the probabilities –Determine flip/hold from RHS max’s –(For our equations, always FLIP)

21 Piglet Solved Game to 10 Play to Score: “Hold at 1” Play to Win: You Opponent

22 Pig Probabilities Just like Piglet, but more possible outcomes P i,j,k = max(1 - P j,i+k,0, 1/6(1 - P j,i,0 + P i,j,k+2 + P i,j,k+3 + P i,j,k+4 + P i,j,k+5 + P i,j,k+6 ))

23 Solving Pig 505,000 such equations Same simple solution method (value iteration) Speedup: Solve groups of interdependent probabilities Watch and see!

24 Pig Sow-lution

25

26 Reachable States Player 2 Score (j) = 30

27 Reachable States

28 Sow-lution for Reachable States

29 Probability Contours

30 Summary Playing to score is not playing to win. A simple game is not always simple to play. The computer is an exciting power tool for the mind!

31 When Value Iteration Isn’t Enough Value Iteration assumes a model of the problem –Probabilities of state transitions –Expected rewards for transitions Loaded die? Optimal play vs. suboptimal player? Game rules unknown?

32 No Model? Then Learn! Can’t write equations  can’t solve Must learn from experience! Reinforcement Learning –Learn optimal sequences of actions –From experience –Given positive/negative feedback

33 Clever Mabel the Cat

34 Mabel claws new La-Z-Boy  BAD! Cats hate water  spray bottle = negative reinforcement Mabel claws La-Z-Boy  Todd gets up  Todd sprays Mabel  Mabel gets negative feedback Mabel learns…

35 Clever Mabel the Cat Mabel learns to run when Todd gets up. Mabel first learns local causality: –Todd gets up  Todd sprays Mabel Mabel eventually sees no correlation, learns indirect cause Mabel happily claws carpet. The End.

36 Backgammon Tesauro’s Neurogammon –Reinforcement Learning + Neural Network (memory for learning) –Learned backgammon through self-play –Got better than all but a handful of people in the world! –Downside: Took 1.5 million games to learn

37 Greased Pig My continuous variant of Pig Object: First to score 100 points On your turn, generate a random number from 0.5 to 6.5 until: –Your rounded number is 1, and you score NOTHING. –You hold, and KEEP the sum. How does this change things?

38 Greased Pig Challenges Infinite possible game states Infinite possible games Limited experience Limited memory Learning and approximation challenge

39 Summary Solving equations can only take you so far (but much farther than we can fathom). Machine learning is an exciting area of research that can take us farther. The power of computing will increasingly aid in our learning in the future.


Download ppt "Pondering Probabilistic Play Policies for Pig Todd W. Neller Gettysburg College."

Similar presentations


Ads by Google