Presentation is loading. Please wait.

Presentation is loading. Please wait.

Texas Holdem Poker With Q-Learning. First Round (pre-flop) PlayerOpponent.

Similar presentations


Presentation on theme: "Texas Holdem Poker With Q-Learning. First Round (pre-flop) PlayerOpponent."— Presentation transcript:

1 Texas Holdem Poker With Q-Learning

2 First Round (pre-flop) PlayerOpponent

3 Second Round (flop) PlayerOpponent Community cards

4 Third Round (turn) PlayerOpponent Community cards

5 Final Round (river) PlayerOpponent Community cards

6 End (We Win) PlayerOpponent Community cards

7 End Round (Note how initially low hands can win later when more community cards are added) PlayerOpponent Community cards

8 The Problem State Space Is Too Big Over 2 million states would be needed just to represent the possible combinations for a single 5 card poker hand

9 Our Solution It doesn’t matter what your exact cards are, just their relations to your opponent. The most important piece of information that it needs is how many possible combinations of two cards could make a better hand

10 Our State Representation [Round] [Opponent’s Last Bet] [# Possible Better Hands] [Best Obtainable Hand] [4] [3] [10] [ 3]

11 Player Community cards To Calculate The # Better Hands All other two possible cards Evaluate! Count the number of better hands on the right side

12 Q-lambda Implementation (I) The current state of the game is stored in a variable Each time the community cards are updated, or the opponent places a bet, we update our current state. For all states, the Q-values of each betting action is stored in an array. Some State FoldCheckCallRaise = -0.9954 = 2.014 = 1.745 = -3.457

13 Q-lambda Implementation (II) Eligibility Trace: we keep a vector of state-actions which are responsible for us being in the current state In State s1 Did Action a1 In State s2 Did Action a2 Now we are in Current-State Each time we make a betting decision, we add the current state and the action we chose to the eligibility trace.

14 Q-lambda Implementation (III) At the end of each game, we get the money won/lost to reward/punish the state-actions in the eligibility trace In State s1 Did Action a1 In State s2 Did Action a2 In State s3 Did Action a3 Got Reward R Update Q[sn, an]

15 Testing Our Q-lambda Player Play Against the Random Player Play Against the Bluffer Play Against Itself

16 Play Against the Random Player Q-lambda learns very fast how to beat the random player Why does it learn so fast?

17 Play Against the Random Player (II) Same graph, with up to 9000 games

18 Play Against the Bluffer The bluffer always raises, unless raising is not possible, in which case it calls. It is not trivial to defeat the bluffer, because you need to fold on weaker hands and keep raising on better hands Our Q-lambda player does very poorly against the bluffer!

19 Play Against the Bluffer (II) In our many trials with different alpha and lambda values, the Q-lambda player always lost with a linear slope

20 Play Against the Bluffer (III) Why is Q-lambda losing to the bluffer? To answer this, we looked at the Q-value tables With the good hands, Q-lambda has learned to Raise or Call Q-values from Round = 3, OpptBet = Raise

21 Play Against the Bluffer (IV) The problem is that even with a very poor hand in the second round, it still does not learn to fold and continues to either raise, call, or check. Same problem exists with poor hands in other rounds Q-values from Round = 1 OpptBet = Not_Yet BestHandPossible = ‘ok’

22 Play Against Itself We played the Q-lambda player against itself, hoping that it would eventually converge on some strategy.

23 Play Against Itself (II) We also graphed the Q-values of a few particular states over time, to see if they converge to meaningful values. The result was mixed. For some states Q- values completely converge, while for some other states they are almost random

24 Play Against Itself (III) With a good hand in the last round, the Q-values have converged so that Calling and after that Raising are good and folding is very bad

25 Play Against Itself (IV) With a medium hand in the last round, the Q-values does not clearly converge. Folding still looks very bad, but there is no preference between calling and raising.

26 Play Against Itself (V) With a very bad set of hands in the last round, the Q- values do not converge at all. This is clearly wrong, since in an optimal-policy folding would have a higher value.

27 Why Q-values Do not Converge?  Poker cannot be represented with our state representation (our states are too broad or are missing some critical aspects of the game)  The ALPHA and LAMBDA factors are incorrect  We have not run the game for long enough time

28 Conclusion Our State representation and Q-lambda implementation is able to learn some aspects of poker (for instance it learns to Raise or Call when it has a good hand in the last round) However, in our tests it does not converge to an optimal policy. More experience with the Alpha and Lambda parameters, and the state representation may result a better convergence.


Download ppt "Texas Holdem Poker With Q-Learning. First Round (pre-flop) PlayerOpponent."

Similar presentations


Ads by Google