Reinforcement Learning for the game of Tetris using Cross Entropy

Reinforcement Learning for the game of Tetris using Cross Entropy
Roee Zinaty and Sarai Duek Supervisor: Sofia Berkovich

Tetris Game The Tetris game is composed of a 10x20 board and 7 types of blocks that can spawn: Each block can be rotated and translated to the desired placement. Points are given upon completion of rows.

Our Tetris Implementation
We used a version of the Tetris game which is common in many computer applications (various machine learning competitions and the like). We differ from the known game in several ways: We rotate the pieces at the top, and then drop them straight down, simplifying and removing some possible moves. We don’t award extra points for combos – we record simply how many rows were completed in each game.

Reinforcement Learning
A form of machine learning, where each action is evaluated and then awarded a certain grade – good actions are awarded points, while bad actions are penalized. Mathematically, it is defined as follows: Where is the value given to the state s, based on the reward function which is dependent on the state s and action a, and on the value of the next resulting state. Input World Agent Action

Cross Entropy A method for achieving a rare-occurrence result from a given distribution with minimal steps/iterations. We need it to find the optimal weights of given features in the Tetris game (our value function). That is because the chance of success in Tetris is much smaller than the chance of failure – a rare occurrence. This is an iterative method, using the last iteration’s results and improving them. We add noise to the CE result to prevent an early convergence to a wrong result.

CE Algorithm For iteration t, with distribution of
Draw sample vectors and evaluate their values Select the best samples, and denote their indices by Compute the parameters of the next iteration’s distribution by: is a constant vector (dependent on iteration) of noise. We’ve tried different kinds of noise, eventually using

RL and CE in the Tetris Case
We use a certain amount of in-game parameters, and generate using the CE method corresponding weights for each, using a base distribution of Our reward function is derived from the weights and features: where is the weight of the matching feature Afterwards, we run games using the above weights and sort those weights according to the number of rows completed in each game, computing the next iteration’s distributions according to the best results.

Our Parameters – First Try
We used initially a set of parameters detailing the following features: Max pile height. Number of holes. Individual column heights. Difference of heights between the columns. Results from using these features were bad, and didn’t match the original paper they were taken from.

Our Parameters – Second Try
Afterwards, we tried the following features, which have their results displayed next: Pile height Max well depth (width of one) Sum of wells Number of holes Row transitions (occupied - unoccupied, sum on all rows) Altitude difference between reachable points Number of connected vertically holes Column transitions Weighted sum of filled cells (higher row > lower) Removed lines (in the last move) Landing height (last move)

2-Piece Strategy However, we tried using a 2-piece strategy (look at the next piece and plan accordingly) with the first set of parameters. We thus achieved superb results – after ~20 iterations of the algorithm, we scored 4.8 million rows on average! The downside was running time, approx. 1/10 of the speed of our normal algorithm. Coupled with the better results, long running times resulted. Only two games were run using the 2-piece strategy, and they ran for about 3-4 weeks before ending abruptly (the computer restarted).

The Tetris Algorithm New Tetris block Use two blocks for strategy? Yes
No Compute best action using both blocks and the feature weights Compute best action using the current block and the feature weights Move block according to the best action Update board if necessary (collapse full rows, lose) Upon loss, return number of completed rows

Past Results Mean Score Method / Reference
# of Games Played During Learning Mean Score Method / Reference Non-reinforcement learning n.a. 631,167 Hand-coded (P. Dellacherie) [Fahey, 2003] 3000 586,103 GA [BÄohm et al., 2004] Reinforcement learning 120 ~50 RRL-KBR [Ramon and Driessens, 2004] 1500 3,183 Policy iteration [Bertsekas and Tsitsiklis, 1996] ~17 < 3,000 LSPI [Lagoudakis et al., 2002] 4,274 LP+Bootstrap [Farias and van Roy, 2006] ~10,000 ~6,800 Natural policy gradient [Kakade, 2001] 10,000 21,252 CE+RL [Szita and Lorincz, 2006] 72,705 CE+RL, constant noise 5,000 348,895 CE+RL, decreasing noise

Results Following are some results we have from running our algorithm with the afore-mentioned features (second try). Each run takes approx. two days, with arbitrary 50 iterations of the CE algorithm. Each iteration includes 100 randomly generated weights, and a game played for each to evaluate, and then 30 games of the “best” result (most rows completed) for statistics.

Results – Sample Output
Below is a sample output as printed by our program. Performance & Weight Overview (min, max, avg weight values): Iteration 46, average is rows, best is rows min: , average: , max: Iteration 47, average is rows, best is rows min: , average: , max: Iteration 48, average is rows, best is rows min: , average: , max: Iteration 49, average is rows, best is rows min: , average: , max: Feature Weights & Matching STD (of the Normal distribution):

Results Weights vs. Iteration, Game 0

Results STD vs. Iteration, Game 0

Results Each graph is a feature’s weight, averaged over the different simulations, versus the iterations

Results Each graph is the STD of a feature’s weight (derived from the CE method), averaged over the different simulations, versus the iterations

Results Final weight vectors per simulation, reduced to 2D-space, with the average rows performance matching the weights (averaged over 30 games each)

Conclusions We currently see lack of progress in the games. Games get quickly to a good result, but swing back and forth in coming iterations (i.e. on one iteration, 100K rows, next 200K, next 50K). We can see from the graphs that the STD of the weights didn’t go down to near-zero, meaning we might gain more from further iterations. Another thing is that the weights vectors are all different, meaning there wasn’t some convergence to a similar weight vector – there is room for improvement.

Possible Directions We might try different approaches of noise. We currently used a noise model of where t is the iteration. So we can either use a smaller or bigger noise to check for changes. We can try updating the distributions only partly with the new parameters, and partly from the last iteration’s parameters [ ]. We can use a certain threshold for the weights’ Variance. Once they pass it, we will “lock” that weight from changing, and thus randomize less weight vectors (less possible combinations).

Reinforcement Learning for the game of Tetris using Cross Entropy

Similar presentations

Presentation on theme: "Reinforcement Learning for the game of Tetris using Cross Entropy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning for the game of Tetris using Cross Entropy

Similar presentations

Presentation on theme: "Reinforcement Learning for the game of Tetris using Cross Entropy"— Presentation transcript:

Similar presentations

About project

Feedback