Reinforcement Learning for the game of Tetris using Cross Entropy

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Artificial Intelligence Presentation
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Markov Decision Process
Regularization David Kauchak CS 451 – Fall 2013.
Slides 2c: Using Spreadsheets for Modeling - Excel Concepts (Updated 1/19/2005) There are several reasons for the popularity of spreadsheets: –Data are.
Tetris – Genetic Algorithm Presented by, Jeethan & Jun.
Tetris and Genetic Algorithms Math Club 5/30/2011.
On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy
Simulation Operations -- Prof. Juran.
1. Algorithms for Inverse Reinforcement Learning 2
CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.
VK Dice By: Kenny Gutierrez, Vyvy Pham Mentors: Sarah Eichhorn, Robert Campbell.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Optimal resampling using machine learning Jesse McCrosky.
Noam Segev, Israel Chernyak, Evgeny Reznikov Supervisor: Gabi Nakibly, Ph. D.
Project: – Several options for bid: Bid our signal Develop several strategies Develop stable bidding strategy Simulating Normal Random Variables.
Machine Learning CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 5.
Simulating Normal Random Variables Simulation can provide a great deal of information about the behavior of a random variable.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Ensemble Learning: An Introduction
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Optimizing the Placement of Chemical and Biological Agent Sensors Daniel L. Schafer Thomas Jefferson High School for Science and Technology Defense Threat.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Ocober 10, 2012Introduction to Artificial Intelligence Lecture 9: Machine Evolution 1 The Alpha-Beta Procedure Example: max.
09/16/2010© 2010 NTUST Today Course overview and information.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Reinforcement Learning
Seongbo Shim, Yoojong Lee, and Youngsoo Shin Lithographic Defect Aware Placement Using Compact Standard Cells Without Inter-Cell Margin.
Applying reinforcement learning to Tetris A reduction in state space Underling : Donald Carr Supervisor : Philip Sterne.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Today’s Topics FREE Code that will Write Your PhD Thesis, a Best-Selling Novel, or Your Next Methods for Intelligently/Efficiently Searching a Space.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Cilk Pousse James Process CS534. Overview Introduction to Pousse Searching Evaluation Function Move Ordering Conclusion.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Neural Network Implementation of Poker AI
Is One Better than None? Does foresight allow an artificial intelligence to survive longer in Tetris? William Granger and Liqun Tracy Yang.
Tetris Agent Optimization Using Harmony Search Algorithm
CS270 Project Overview Maximum Planar Subgraph Danyel Fisher Jason Hong Greg Lawrence Jimmy Lin.
By: David Gelbendorf, Hila Ben-Moshe Supervisor : Alon Zvirin
October 1, 2013Computer Vision Lecture 9: From Edges to Contours 1 Canny Edge Detector However, usually there will still be noise in the array E[i, j],
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Othello Artificial Intelligence With Machine Learning Computer Systems TJHSST Nick Sidawy.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.
Reinforcement Learning and Tetris Jared Christen.
February 25, 2016Introduction to Artificial Intelligence Lecture 10: Two-Player Games II 1 The Alpha-Beta Procedure Can we estimate the efficiency benefit.
An AI Game Project. Background Fivel is a unique hybrid of a NxM game and a sliding puzzle. The goals in making this project were: Create an original.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Teaching Computers to Think:
Discrete Optimization MA2827 Fondements de l’optimisation discrète Dynamic programming (Part 2) Material based on.
Figure 5: Change in Blackjack Posterior Distributions over Time.
Deep Feedforward Networks
Reinforcement Learning (1)
Iterative Deletion Routing Algorithm
Objective of This Course
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Kevin Mason Michael Suggs
The Alpha-Beta Procedure
Minimax strategies, alpha beta pruning
Unit II Game Playing.
Presentation transcript:

Reinforcement Learning for the game of Tetris using Cross Entropy Roee Zinaty and Sarai Duek Supervisor: Sofia Berkovich

Tetris Game The Tetris game is composed of a 10x20 board and 7 types of blocks that can spawn: Each block can be rotated and translated to the desired placement. Points are given upon completion of rows.

Our Tetris Implementation We used a version of the Tetris game which is common in many computer applications (various machine learning competitions and the like). We differ from the known game in several ways: We rotate the pieces at the top, and then drop them straight down, simplifying and removing some possible moves. We don’t award extra points for combos – we record simply how many rows were completed in each game.

Reinforcement Learning A form of machine learning, where each action is evaluated and then awarded a certain grade – good actions are awarded points, while bad actions are penalized. Mathematically, it is defined as follows: Where is the value given to the state s, based on the reward function which is dependent on the state s and action a, and on the value of the next resulting state. Input World Agent Action

Cross Entropy A method for achieving a rare-occurrence result from a given distribution with minimal steps/iterations. We need it to find the optimal weights of given features in the Tetris game (our value function). That is because the chance of success in Tetris is much smaller than the chance of failure – a rare occurrence. This is an iterative method, using the last iteration’s results and improving them. We add noise to the CE result to prevent an early convergence to a wrong result.

CE Algorithm For iteration t, with distribution of Draw sample vectors and evaluate their values Select the best samples, and denote their indices by Compute the parameters of the next iteration’s distribution by: is a constant vector (dependent on iteration) of noise. We’ve tried different kinds of noise, eventually using

RL and CE in the Tetris Case We use a certain amount of in-game parameters, and generate using the CE method corresponding weights for each, using a base distribution of Our reward function is derived from the weights and features: where is the weight of the matching feature Afterwards, we run games using the above weights and sort those weights according to the number of rows completed in each game, computing the next iteration’s distributions according to the best results.

Our Parameters – First Try We used initially a set of parameters detailing the following features: Max pile height. Number of holes. Individual column heights. Difference of heights between the columns. Results from using these features were bad, and didn’t match the original paper they were taken from.

Our Parameters – Second Try Afterwards, we tried the following features, which have their results displayed next: Pile height Max well depth (width of one) Sum of wells Number of holes Row transitions (occupied - unoccupied, sum on all rows) Altitude difference between reachable points Number of connected vertically holes Column transitions Weighted sum of filled cells (higher row > lower) Removed lines (in the last move) Landing height (last move)

2-Piece Strategy However, we tried using a 2-piece strategy (look at the next piece and plan accordingly) with the first set of parameters. We thus achieved superb results – after ~20 iterations of the algorithm, we scored 4.8 million rows on average! The downside was running time, approx. 1/10 of the speed of our normal algorithm. Coupled with the better results, long running times resulted. Only two games were run using the 2-piece strategy, and they ran for about 3-4 weeks before ending abruptly (the computer restarted).

The Tetris Algorithm New Tetris block Use two blocks for strategy? Yes No Compute best action using both blocks and the feature weights Compute best action using the current block and the feature weights Move block according to the best action Update board if necessary (collapse full rows, lose) Upon loss, return number of completed rows

Past Results Mean Score Method / Reference # of Games Played During Learning Mean Score Method / Reference Non-reinforcement learning n.a. 631,167 Hand-coded (P. Dellacherie) [Fahey, 2003] 3000 586,103 GA [BÄohm et al., 2004] Reinforcement learning 120 ~50 RRL-KBR [Ramon and Driessens, 2004] 1500 3,183 Policy iteration [Bertsekas and Tsitsiklis, 1996] ~17 < 3,000 LSPI [Lagoudakis et al., 2002] 4,274 LP+Bootstrap [Farias and van Roy, 2006] ~10,000 ~6,800 Natural policy gradient [Kakade, 2001] 10,000 21,252 CE+RL [Szita and Lorincz, 2006] 72,705 CE+RL, constant noise 5,000 348,895 CE+RL, decreasing noise

Results Following are some results we have from running our algorithm with the afore-mentioned features (second try). Each run takes approx. two days, with arbitrary 50 iterations of the CE algorithm. Each iteration includes 100 randomly generated weights, and a game played for each to evaluate, and then 30 games of the “best” result (most rows completed) for statistics.

Results – Sample Output Below is a sample output as printed by our program. Performance & Weight Overview (min, max, avg weight values): Iteration 46, average is 163667.57 rows, best is 499363 rows min: -41.56542, average: -13.61646, max: 5.54374 Iteration 47, average is 138849.43 rows, best is 387129 rows min: -38.91538, average: -12.93479, max: 4.42429 Iteration 48, average is 251081.03 rows, best is 806488 rows min: -38.60941, average: -11.88640, max: 11.97776 Iteration 49, average is 251740.57 rows, best is 648248 rows min: -38.41177, average: -11.81831, max: 7.05757 Feature Weights & Matching STD (of the Normal distribution): -10.748 -20.345 -7.5491 -11.033 7.0576 -8.9337 -12.211 -0.063724 -11.804 -15.959 -38.412 2.9659 1.4162 1.3864 1.6074 1.4932 0.93831 1.0166 0.34907 1.1931 0.7918 2.536

Results Weights vs. Iteration, Game 0

Results STD vs. Iteration, Game 0

Results Each graph is a feature’s weight, averaged over the different simulations, versus the iterations

Results Each graph is the STD of a feature’s weight (derived from the CE method), averaged over the different simulations, versus the iterations

Results Final weight vectors per simulation, reduced to 2D-space, with the average rows performance matching the weights (averaged over 30 games each)

Conclusions We currently see lack of progress in the games. Games get quickly to a good result, but swing back and forth in coming iterations (i.e. on one iteration, 100K rows, next 200K, next 50K). We can see from the graphs that the STD of the weights didn’t go down to near-zero, meaning we might gain more from further iterations. Another thing is that the weights vectors are all different, meaning there wasn’t some convergence to a similar weight vector – there is room for improvement.

Possible Directions We might try different approaches of noise. We currently used a noise model of where t is the iteration. So we can either use a smaller or bigger noise to check for changes. We can try updating the distributions only partly with the new parameters, and partly from the last iteration’s parameters [ ]. We can use a certain threshold for the weights’ Variance. Once they pass it, we will “lock” that weight from changing, and thus randomize less weight vectors (less possible combinations).