CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Randomized Strategies and Temporal Difference Learning in Poker Michael Oder April 4, 2002 Advisor: Dr. David Mutchler.
10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Adversarial Search: Game Playing Reading: Chapter next time.
Games CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
Lecture 13 Last time: Games, minimax, alpha-beta Today: Finish off games, summary.
Artificial Intelligence in Game Design
Probability CSE 473 – Autumn 2003 Henry Kautz. ExpectiMax.
Problem Solving Using Search Reduce a problem to one of searching a graph. View problem solving as a process of moving through a sequence of problem states.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
This time: Outline Game playing The minimax algorithm
Honte, a Go-Playing Program Using Neural Nets Frederik Dahl.
Reinforcement Learning
Game Playing CSC361 AI CSC361: Game Playing.
November 10, 2009Introduction to Cognitive Science Lecture 17: Game-Playing Algorithms 1 Decision Trees Many classes of problems can be formalized as search.
Games and adversarial search
1 search CS 331/531 Dr M M Awais A* Examples:. 2 search CS 331/531 Dr M M Awais 8-Puzzle f(N) = g(N) + h(N)
Reinforcement Learning
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
Adversarial Search: Game Playing Reading: Chess paper.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
double AlphaBeta(state, depth, alpha, beta) begin if depth
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Reinforcement Learning (1)
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
Introduction Many decision making problems in real life
Computer Go : A Go player Rohit Gurjar CS365 Project Proposal, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.
Chapter 12 Adversarial Search. (c) 2000, 2001 SNU CSE Biointelligence Lab2 Two-Agent Games (1) Idealized Setting  The actions of the agents are interleaved.
Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab.
Computer Go : A Go player Rohit Gurjar CS365 Project Presentation, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.
Minimax with Alpha Beta Pruning The minimax algorithm is a way of finding an optimal move in a two player game. Alpha-beta pruning is a way of finding.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
Games 1 Alpha-Beta Example [-∞, +∞] Range of possible values Do DF-search until first leaf.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 9 of 42 Wednesday, 14.
CSE473 Winter /04/98 State-Space Search Administrative –Next topic: Planning. Reading, Chapter 7, skip 7.3 through 7.5 –Office hours/review after.
Reinforcement Learning
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.
Backgammon Group 1: - Remco Bras - Tim Beyer - Maurice Hermans - Esther Verhoef - Thomas Acker.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
February 25, 2016Introduction to Artificial Intelligence Lecture 10: Two-Player Games II 1 The Alpha-Beta Procedure Can we estimate the efficiency benefit.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Artificial Intelligence in Game Design Board Games and the MinMax Algorithm.
Understanding AI of 2 Player Games. Motivation Not much experience in AI (first AI project) and no specific interests/passion that I wanted to explore.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Search: Games & Adversarial Search Artificial Intelligence CMSC January 28, 2003.
Adversarial Search Chapter Two-Agent Games (1) Idealized Setting – The actions of the agents are interleaved. Example – Grid-Space World – Two.
Stochastic tree search and stochastic games
4. Games and adversarial search
Reinforcement Learning
AlphaGo with Deep RL Alpha GO.
Announcements Homework 3 due today (grace period through Friday)
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
The Alpha-Beta Procedure
Chapter 1: Introduction
Games & Adversarial Search
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza

Introduction The research paper has been written by: Halina Kwasnicka, Artur Spirydowicz Department of Computer Science, Wroclaw University of Technology,

Discussion Topics Aim of the paper Introduction About the game TD (λ) Learning Modifications Checkers Q learning/Q algorithm TD (λ) Learning method applied for checkers Results obtained Summary

Introduction Aim of the paper is to explore possibilities of using reinforcement learning commonly know as TD-Gammon for a game without random factors. The program discussed in this paper consists of neural network that is trained on the basis of obtained reward to be an evaluation function for the game of the checkers by playing against itself. The programs gives possibilities to compare the TD learning with the other approach of supervised learning. This method of training multilayered neural networks is called TD(λ).

Introduction The article presents game learning program called checkers which uses TD(λ) learning method for feed forward neural network. A network learns and plays checkers by experiencing, achieving positive and negative rewards. Obtained results shows that our approach give satisfactory results TD(λ) used in TD-Gammon for backgammon can be successfully used for deterministic games.

Reinforcement Learning An agent observes an input state and produces an action. After this he receives some “reward” from the environment. The reward indicates how good or bad was the output produces by the agent. The goal of such learning is to produce the optimal action leading to maximal reward. Often the reward is delayed, the reward is know at the end of the long sequence of input and output actions.

Backgammon Backgammon game has some features that are absent in the checkers game and they probably caused the TD- Gammon plays surprisingly good. 1. One of them is stochastic nature of the task. It comes form random dice rolls, this assures a wide variability in the positions visited during whole training process. 2. The agent can explore more of the state space and discover new strategies.

Checkers Checkers is a deterministic game where self playing training can stop itself exploring only small part of stat space because the narrow range of different positions are produced. Game is played on 10 x 10 board.

Feature Comparison Another feature of backgammon is that for all playing strategies the sequence of moves will terminate (win or lost). In deterministic games we can obtain cycles and in such case trained network is not able to learn because the final reward cannot be produced.

The Problem The problem that occurs for an agent is know as “temporal credit assignment", Because the agent performs its task from it’s experience. It I necessary to omit this problem when we would like to use TD (λ) learning method for deterministic games. TD learning seams to be a promising general purpose technique for learning with delayed reward. Not only for prediction leaning but also for a combined prediction and control task where control decisions are made by optimizing predicted output.

Temporal Difference Learning Temporal difference or TD learning methods are a class of methods for approaching the “Temporal credit assignment problem”. The learning is based on the difference between temporally successive predictions. Most recent of these TD methods is an algorithm proposed for training multilayer neural networks called TD (λ). TD approach to deterministic games requires a kind of external noise to produce the variability and exploration, such this obtained from the random dice rolls in backgammon.

TD After the opinion of G.Tesauro and other researchers we concluded:- 1. During training process, all possible states of game should be taken into account. 2. Training episodes should be ended with receiving a reward, the reward must be positive or negative. To decrease above mentioned constraints for using TD learning, we propose four modification process in our program “CHEKCERS”.

Modifications 1. During learning process a number of steps without captures are identified, it indicates that program probably generated a cycle. In such a case game is finished and reward is established based on the difference between number of learner and opponent checkers on board. 2. Observation phase of learning process shows that at the beginning learning goes properly but when few checkers are left then such moves do not assist in learning, therefore we assume hypothesis program, which is learned to eliminate efficiently opponents checkers will be able to stop the game and reward will depend upon the difference.

Modifications 3. The method requires testing during learning process a great number of possible states of game, variety of states can be obtained by random selection of initial state, this allows a quite good diversity of tested states. 4. To enlarge diversity of a game we introduce a random noise into an output of trained neural network, it allows eliminating full determinism in selection of move at a given state.

TD- learning used in “CHECKERS” program Reinforcement learning:- In CHECKERS, finished game produce relevant reward. We can conclude, that the goal of learning algorithm is to find optimal strategy π*, that maximizes discounted cumulative reward:

TD- learning used in “CHECKERS” program Q- Learning:- We have only the three sequences: actions a1, a2, …, an, states s1, s2, …, sn, and rewards r1, r2, …, rn. What it is looked for is π. Let us define an optimal action for a state s as:

TD- learning used in “CHECKERS” program But in practice, an agent can see a new, reached state, as well as the obtained reward, after making selected action. Therefore, instead of knowing further reward, a heuristic function Q for assessment of quality of a given action is developed:

TD- learning used in “CHECKERS” program Temporal Difference Learning:- Q learning takes into account only next state, but we can regard two further states (Q(2)), or three (Q(3)), or, in general, n states ahead (Q(n)) : Richard Sutton proposes to take into account all states:

TD- learning used in “CHECKERS” program Neural Network and Temporal Difference Learning:- In G. Tesauro program, the learned network plays against itself instead of using a learning set. Neural network receives an input vector – a state on the board, it considers all combinations of moves (for one or two moves ahead), and each possible state is treated as an input vector and assessed taking into account discounted reward. An action that produces maximal discounted reward is selected.

CHECKERS – TD(λ) learning method applied for checkers Developed computer program works in two modes: learning and playing game. CHECKERS builds a search tree and uses mini-max with alpha-beta search algorithm. We learned a three-layered neural network, with 32 input neurons – they code 10 considered features, and one output neuron – it plays a role of an evaluation function. A sigmoid is used as transfer function. Output signal indicates a chance of reward.

CHECKERS – TD(λ) learning method applied for checkers The set of features being the input values is selected as below:

CHECKERS – TD(λ) learning method applied for checkers In CHECKERS, every final position is checked about the capture possibilities. Both networks, pupil and trainer, are initially identical, they have the same random weights, in consequence, they produce random outputs. After each move of pupil, its weights are changed, and output of trained network in a former state (Yt) becomes closer to the output in a current state (Yt+1).

CHECKERS – TD(λ) learning method applied for checkers The control games between pupil and trainer are performed without any changing of weights. Assumed size of pupil winning causes that all weights from the Pupil are copied to the trainer and a trained network receives opponent strong enough to achieve progress in learning

CHECKERS – obtained results Does a trained network is able to learn play checkers?

CHECKERS – obtained results It stops about 80% winning games. The level of search was changed into 3, it causes that a network was able to win all games.

CHECKERS – obtained results How a good player is a trained network?

Summary Developed program and presented briefly tests demonstrate that TD learning can be applied to learn other games than backgammon. Introduced modifications allow to eliminate weakness of checkers. Developed program plays checkers at least on the intermediate level. It is easy to find better checkers programs, but they do not have learning possibilities.

Summary They work on the basis of coded knowledge acquired from experts. More of them have a library of opening moves and endgame databases. It seems, that the program may be improved by introducing methods allowing processing a greater number of nodes of a tree of game in a shorter time.