# 10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.

## Presentation on theme: "10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01."— Presentation transcript:

10/29/01Reinforcement Learning in Games 1 Colin Cherry colinc@cs Oct 29/01

Reinforcement Learning in Games2 10/29/01 Outline Reinforcement Learning & TD Learning TD-Gammon TDLeaf  Chinook Conclusion

Reinforcement Learning in Games3 10/29/01 The ideas behind Reinforcement Learning Two broad categories for learning:  Supervised  Unsupervised (Our concern) Problem with unsupervised learning:  Delayed rewards (temporal credit assignment) Goal:  Create a good control policy based on delayed rewards

Reinforcement Learning in Games4 10/29/01 Evaluation Function: Developing a Control Policy Evaluation function:  Function that estimates the total reward the agent will receive if it follows the function from this point onward We will assume the function evaluates states (good for deterministic games) The evaluation function could be:  Look-up table, Linear function, Neural Network, any function approximator…

Reinforcement Learning in Games5 10/29/01 Temporal Difference Learning TD(λ) Set initial weights to 0 or random values Assume our evaluation function evaluates a state at time t with the value Y t according to some weight vector w Modify the equation at the end of each game as follows for each time t+1:

Reinforcement Learning in Games6 10/29/01 A quick example: Printer Robot Objective:  Dock to printer, collect a document Assume 3 states:  C: next to coffee machine, no documents  P: next to printer, no documents  D: next to printer, carrying documents Assume 2 actions seen  a: dock to printer (available only from P or D)  b: go to printer (available only from C) P P (continue) C D (end) (Some time later) a reward b no reward

Reinforcement Learning in Games7 10/29/01 TD-Gammon Self-taught backgammon player Good enough to make the best sweat Huge success for reinforcement learning Far surpassed its supervised learning cousin, Neurogammon

Reinforcement Learning in Games8 10/29/01 How does it work? Used an artificial neural network for its evaluation function approximator Excellent neural network design Used expert features developed for Neurogammon along with basic board rep. Hundreds of thousands of training games against itself Hard-coded doubling algorithm

Reinforcement Learning in Games9 10/29/01 Why did it work so well? Stochastic domain – forces exploration Linear (basic) concepts are learned first Shallow search is “good enough” against humans

Reinforcement Learning in Games10 10/29/01 Backgammon vrs Other games Shallow Search TD-Gammon followed a greedy approach  1 ply look-ahead (later increased to 3-ply)  Its hard to predict your opponent’s move w/o his or her dice roll? What about your move after that? Doesn’t work so well for other games:  What features will tell me what move to take by looking only at the immediate results of the moves available to me?

Reinforcement Learning in Games11 10/29/01 TDLeaf(λ) TD Learning applied to the minimax algorithm For each state, search to a constant depth Evaluate a state according to a heuristic evaluation of its leaf of principle variation

Reinforcement Learning in Games12 10/29/01 Chinook This program, at this school, in this class, should need no introduction 84 features (4 sets of 21) were tunable by weight Each feature consists of many hand-picked parameters Question: Can we learn the 84 weights as well as a human can set them?

Reinforcement Learning in Games13 10/29/01 The Test Trained using TDLeaf All weight values set to 0 Variations introduced by using a book of opening moves (144 3-ply openings) Played no more than 10,000 games against itself before hitting a plateau Both programs are to use the same depth

Reinforcement Learning in Games14 10/29/01 The results were very positive Chinook w/ all weights set to 1 vrs Tournament Chinook: 94.5-193.5 Chinook after self-play training vrs Tournament Chinook: Even Steven Some Lessons Learned:  You have to train at the same depth you plan to play at  You have to play against real people too

Reinforcement Learning in Games15 10/29/01 Conclusions TD(λ) can be a powerful tool in the creation of game-playing evaluation functions Must be a type of training that will introduce variation Features need to be hand-picked (for now) TD and TDLeaf allow quick weight tuning  Takes a lot of the tedium out of player design  Allows designers more experiment with features