Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.

Similar presentations


Presentation on theme: "CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza."— Presentation transcript:

1 CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza

2 Introduction The research paper has been written by: Halina Kwasnicka, Artur Spirydowicz Department of Computer Science, Wroclaw University of Technology,

3 Discussion Topics Aim of the paper Introduction About the game TD (λ) Learning Modifications Checkers Q learning/Q algorithm TD (λ) Learning method applied for checkers Results obtained Summary

4 Introduction Aim of the paper is to explore possibilities of using reinforcement learning commonly know as TD-Gammon for a game without random factors. The program discussed in this paper consists of neural network that is trained on the basis of obtained reward to be an evaluation function for the game of the checkers by playing against itself. The programs gives possibilities to compare the TD learning with the other approach of supervised learning. This method of training multilayered neural networks is called TD(λ).

5 Introduction The article presents game learning program called checkers which uses TD(λ) learning method for feed forward neural network. A network learns and plays checkers by experiencing, achieving positive and negative rewards. Obtained results shows that our approach give satisfactory results TD(λ) used in TD-Gammon for backgammon can be successfully used for deterministic games.

6 Reinforcement Learning An agent observes an input state and produces an action. After this he receives some “reward” from the environment. The reward indicates how good or bad was the output produces by the agent. The goal of such learning is to produce the optimal action leading to maximal reward. Often the reward is delayed, the reward is know at the end of the long sequence of input and output actions.

7 Backgammon Backgammon game has some features that are absent in the checkers game and they probably caused the TD- Gammon plays surprisingly good. 1. One of them is stochastic nature of the task. It comes form random dice rolls, this assures a wide variability in the positions visited during whole training process. 2. The agent can explore more of the state space and discover new strategies.

8 Checkers Checkers is a deterministic game where self playing training can stop itself exploring only small part of stat space because the narrow range of different positions are produced. Game is played on 10 x 10 board.

9 Feature Comparison Another feature of backgammon is that for all playing strategies the sequence of moves will terminate (win or lost). In deterministic games we can obtain cycles and in such case trained network is not able to learn because the final reward cannot be produced.

10 The Problem The problem that occurs for an agent is know as “temporal credit assignment", Because the agent performs its task from it’s experience. It I necessary to omit this problem when we would like to use TD (λ) learning method for deterministic games. TD learning seams to be a promising general purpose technique for learning with delayed reward. Not only for prediction leaning but also for a combined prediction and control task where control decisions are made by optimizing predicted output.

11 Temporal Difference Learning Temporal difference or TD learning methods are a class of methods for approaching the “Temporal credit assignment problem”. The learning is based on the difference between temporally successive predictions. Most recent of these TD methods is an algorithm proposed for training multilayer neural networks called TD (λ). TD approach to deterministic games requires a kind of external noise to produce the variability and exploration, such this obtained from the random dice rolls in backgammon.

12 TD After the opinion of G.Tesauro and other researchers we concluded:- 1. During training process, all possible states of game should be taken into account. 2. Training episodes should be ended with receiving a reward, the reward must be positive or negative. To decrease above mentioned constraints for using TD learning, we propose four modification process in our program “CHEKCERS”.

13 Modifications 1. During learning process a number of steps without captures are identified, it indicates that program probably generated a cycle. In such a case game is finished and reward is established based on the difference between number of learner and opponent checkers on board. 2. Observation phase of learning process shows that at the beginning learning goes properly but when few checkers are left then such moves do not assist in learning, therefore we assume hypothesis program, which is learned to eliminate efficiently opponents checkers will be able to stop the game and reward will depend upon the difference.

14 Modifications 3. The method requires testing during learning process a great number of possible states of game, variety of states can be obtained by random selection of initial state, this allows a quite good diversity of tested states. 4. To enlarge diversity of a game we introduce a random noise into an output of trained neural network, it allows eliminating full determinism in selection of move at a given state.

15 TD- learning used in “CHECKERS” program Reinforcement learning:- In CHECKERS, finished game produce relevant reward. We can conclude, that the goal of learning algorithm is to find optimal strategy π*, that maximizes discounted cumulative reward:

16 TD- learning used in “CHECKERS” program Q- Learning:- We have only the three sequences: actions a1, a2, …, an, states s1, s2, …, sn, and rewards r1, r2, …, rn. What it is looked for is π. Let us define an optimal action for a state s as:

17 TD- learning used in “CHECKERS” program But in practice, an agent can see a new, reached state, as well as the obtained reward, after making selected action. Therefore, instead of knowing further reward, a heuristic function Q for assessment of quality of a given action is developed:

18 TD- learning used in “CHECKERS” program Temporal Difference Learning:- Q learning takes into account only next state, but we can regard two further states (Q(2)), or three (Q(3)), or, in general, n states ahead (Q(n)) : Richard Sutton proposes to take into account all states:

19 TD- learning used in “CHECKERS” program Neural Network and Temporal Difference Learning:- In G. Tesauro program, the learned network plays against itself instead of using a learning set. Neural network receives an input vector – a state on the board, it considers all combinations of moves (for one or two moves ahead), and each possible state is treated as an input vector and assessed taking into account discounted reward. An action that produces maximal discounted reward is selected.

20 CHECKERS – TD(λ) learning method applied for checkers Developed computer program works in two modes: learning and playing game. CHECKERS builds a search tree and uses mini-max with alpha-beta search algorithm. We learned a three-layered neural network, with 32 input neurons – they code 10 considered features, and one output neuron – it plays a role of an evaluation function. A sigmoid is used as transfer function. Output signal indicates a chance of reward.

21 CHECKERS – TD(λ) learning method applied for checkers The set of features being the input values is selected as below:

22 CHECKERS – TD(λ) learning method applied for checkers In CHECKERS, every final position is checked about the capture possibilities. Both networks, pupil and trainer, are initially identical, they have the same random weights, in consequence, they produce random outputs. After each move of pupil, its weights are changed, and output of trained network in a former state (Yt) becomes closer to the output in a current state (Yt+1).

23 CHECKERS – TD(λ) learning method applied for checkers The control games between pupil and trainer are performed without any changing of weights. Assumed size of pupil winning causes that all weights from the Pupil are copied to the trainer and a trained network receives opponent strong enough to achieve progress in learning

24 CHECKERS – obtained results Does a trained network is able to learn play checkers?

25 CHECKERS – obtained results It stops about 80% winning games. The level of search was changed into 3, it causes that a network was able to win all games.

26 CHECKERS – obtained results How a good player is a trained network?

27 Summary Developed program and presented briefly tests demonstrate that TD learning can be applied to learn other games than backgammon. Introduced modifications allow to eliminate weakness of checkers. Developed program plays checkers at least on the intermediate level. It is easy to find better checkers programs, but they do not have learning possibilities.

28 Summary They work on the basis of coded knowledge acquired from experts. More of them have a library of opening moves and endgame databases. It seems, that the program may be improved by introducing methods allowing processing a greater number of nodes of a tree of game in a shorter time.


Download ppt "CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza."

Similar presentations


Ads by Google