Download presentation

Presentation is loading. Please wait.

Published byKenna Lumley Modified over 2 years ago

1
Monte Carlo Tree Search A Tutorial Peter Cowling University of York, UK Simon Lucas University of Essex, UK Supported by the Engineering and Physical Sciences Research Council (EPSRC)

2
Further Reading Browne C., Powley E., Whitehouse D., Lucas S., Cowling P.I., Rohlfshagen P., Tavener S., Perez D. and Samothrakis S., Colton S. (2012): "A Survey of Monte Carlo Tree Search Methods" IEEE Transactions on Computational Intelligence and AI in Games, IEEE, 4 (1): 1-43.

3
Monte Carlo Tree Search (MCTS) Why know more? Go: Strong/World Champion MCTS players in: General Game Playing, Hex, Amazons, Arimaa, Khet, Shogi, Mancala, Morpion, Blokus, Focus, Chinese Checkers, Yavalath, Connect Four, Gomoku Over 250 papers since about one per week on average. Promise for planning, real-time games, nondeterministic games, single-player puzzles, optimisation, simulation, PCG, …

4
Tutorial Structure Monte Carlo search – The multi-armed bandit problem Monte Carlo Tree Search (MCTS) – Selection, expansion, simulation, backpropagation Enhancements to the basic algorithm Handling uncertainty in MCTS MCTS for real-time games Conclusions, future research directions

5
Monte Carlo Search for Decision Making in Games No trees yet …

6
The Multi-Armed Bandit Problem At each step pull one arm Noisy/random reward signal In order to: Find the best arm Minimise regret Maximise expected return

7
Which Arm to Pull? Flat Monte Carlo Share trials uniformly between arms ε-Greedy P(1- ε) – Best arm so far P(ε) – Random arm UCB1 (Auer et al (2002)). Choose arm j so as to maximise: Mean so far Upper bound on variance

8
Game Decisions Current position Position after move A Position after move B Position after move C Move A Move B Move C Simulation Result (+1 = win, 0 = loss) Multi-Armed Bandit Arm AArm CArm B

9
Discussion Anytime – stop whenever you like UCB1 formula minimises regret – Grows like log(n) Needs only game rules: – Move generation – Terminal state evaluation Surprisingly effective, but… … doesnt look ahead

10
MCTS / UCT Basics

11
Attractive Features Anytime Scalable – Tackle complex games better than before – May be logarithmically better with increased CPU No need for heuristic function – Though usually better with one Next well look at: – General MCTS – UCT in particular

12
MCTS: the main idea Tree policy: choose which node to expand (not necessarily leaf of tree) Default (simulation) policy: random playout until end of game

13
MCTS Algorithm Decompose into 6 parts: MCTS main algorithm – Tree policy Expand Best Child (UCT Formula) – Default Policy – Back-propagate Well run through these then show demos

14
MCTS Main Algorithm BestChild simply picks best child node of root according to some criteria: e.g. best mean value In our pseudo-code BestChild is called from TreePolicy and from MctsSearch, but different versions can be used – E.g. final selection can be the max value child or the most frequently visited one

15
TreePolicy Note that node selected for expansion does not need to be a leaf of the tree – The nonterminal test refers to the game state

16
Expand Note: need to choose an appropriate data structure for the children (array, ArrayList, Map)

17
Best Child (UCT) This is the standard UCT equation – Used in the tree Higher values of c lead to more exploration Other terms can be added, and usually are – More on this later

18
DefaultPolicy Each time a new node is added to the tree, the default policy randomly rolls out from the current state until a terminal state of the game is reached The standard is to do this uniformly randomly – But better performance may be obtained by biasing with knowledge

19
Backup Note that v is the new node added to the tree by the tree policy Back up the values from the added node up the tree to the root

20
MCTS Builds Asymmetric Trees (demo)

21
Connect 4 Demo

22
Enhancements to the Basic MCTS Algorithm Tree Policy Default Policy Domain: independent or specific

23
Selection/Expansion Enhancements (just a sample …) AMAF / RAVE First Play Urgency Learning – E.g. Temporal Difference Learning to bias tree policy or default policy Bandit enhancements – E.g. Bayesian selection Parameter Tuning

24
All Moves As First (AMAF), Rapid Value Action Estimates (RAVE) Additional term in UCT equation: – Treat actions / moves the same independently of where they occur in the move sequence

25
Simulation Enhancements Nudge the simulation to be more realistic – Without introducing bias – At very low CPU cost Examples: – Move-Average Sampling Technique (MAST) – Last Good Reply (LGR) – Patterns e.g. Bridge Completion: – Heuristic value functions Paradoxically better simulation policy sometimes yields weaker play

26
Hex Demo

27
Parallelisation

28
Handling Uncertainty and Incomplete Information in MCTS

29
Probabilistic (stochastic) nodes MAXMAX MINMIN MAXMAX CHANCECHANCE Probabilistic (stochastic) planning Decision-making/optimisation under uncertainty

30
Hidden Information E.g. Poker, Bridge, Security Security Actual state: Observation:Observation: Information set: {,,,...}

31
Effects of Uncertainty and Hidden Information on the Game Tree 4 possible plays by me 50 possible random card draws 40 C 3 = 9880 different opponent plays... choose subset

32
Reduced Branching Through Determinization 4 possible plays by me 1 possible random card draws 4 C 3 = 4 different opponent plays... [Yoon, Fern & Givan 2007] Use average results over many determinizations

33
DeterminizationDeterminization Sample states from the information set Sample states from the information set Analyse the individual perfect information games Analyse the individual perfect information games Combine the results at the end Combine the results at the end Successes Successes – Bridge (Ginsberg), Scrabble (Sheppard) – Klondike solitaire (Bjarnason, Fern and Tadepalli) – Probabilistic planning (Yoon, Fern and Givan) Problems Problems – Never tries to gather or hide information – Suffers from strategy fusion and non-locality (Frank and Basin) M.L. Ginsberg, GIB: Imperfect information in a computationally challenging game, Journal of Artificial Intelligence Research, vol. 14, 2002, p. 313–368. R. Bjarnason, A. Fern, and P. Tadepalli, Lower bounding Klondike solitaire with Monte-Carlo planning, Proc. ICAPS-2009, p. 26–33. S. Yoon, A. Fern, and R. Givan, FF-Replan: A Baseline for Probabilistic Planning, 1Proc. ICAPS-2007, p. 352–359. I. Frank and D. Basin, Search in games with incomplete information: a case study using Bridge card play, Artificial Intelligence, vol. 100, 1998, pp

34
CheatingCheating Easiest approach to AI for games with imperfect information: cheat and look at the hidden information and outcomes of future chance events Easiest approach to AI for games with imperfect information: cheat and look at the hidden information and outcomes of future chance events This gives a deterministic game of perfect information, which can be searched with standard techniques (minimax, UCT, …) This gives a deterministic game of perfect information, which can be searched with standard techniques (minimax, UCT, …) Obviously not a reasonable approach in practice, but gives a useful bound on performance Obviously not a reasonable approach in practice, but gives a useful bound on performance Much used in commercial games Much used in commercial games

35
Information Set Monte Carlo Tree Search (ISMCTS) Collect information about the value of a decision in the corresponding information set. Maintain a single tree Use each determinization only once Determinizations restrict search to a subtree of the information set tree.

36
Board game designed by Reiner Knizia Hidden information: cant see the identities of opponents pieces Simultaneous moves: combat is resolved by both players simultaneously choosing a card Lord of the Rings: The Confrontation Also Phantom 4,4,4, Dou Di Zhu and several other card games…

37
MCTS for Real-Time Games Limited roll-out budget – Heuristic knowledge becomes important Action space may be too fine-grained – Take Macro-actions – Otherwise planning will be very short-term May be no terminal node in sight – Again, use heuristic – Tune simulation depth Next-state function may be expensive – Consider making a simpler abstraction

38
PTSP Demo

39
Summary MCTS: exciting area of research Many impressive achievements already – With many more to come Some interesting directions: – Applications beyond games – Comparisons with other search methods – Heuristic knowledge is important – more work needed on learning it – Efficient parameter tuning – POMDPs – Uncertainty – Macro-actions – Hybridisation with other CI methods e.g. Evolution

40
Questions? Browne C., Powley E., Whitehouse D., Lucas S., Cowling P.I., Rohlfshagen P., Tavener S., Perez D. and Samothrakis S., Colton S. (2012): "A Survey of Monte Carlo Tree Search Methods" IEEE Transactions on Computational Intelligence and AI in Games, IEEE, 4 (1): 1-43.

41
Advertising Break Sep 28: Essex workshop on AI and Games – Strong MCTS Theme – Organised by Essex Game Intelligence Group Next week: AIGAMEDEV Vienna AI summit – Find out what games industry people think – Resistance competition and MCTS tutorial (!)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google