Presentation is loading. Please wait.

Presentation is loading. Please wait.

November 10, 2010 Dr. Itamar Arel College of Engineering

Similar presentations


Presentation on theme: "November 10, 2010 Dr. Itamar Arel College of Engineering"— Presentation transcript:

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies
November 10, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010

2 Project report – Due Friday, Dec 3th
Final Project Recap Requirements: Presentation In-class 15 minute presentation + 5 minutes for questions Presentation assignment slots have been posted on website Project report – Due Friday, Dec 3th Comprehensive documentation of your work Recall that the Final Project is 30% of course grade!

3 Introduction We’ll discuss several case studies of reinforcement learning The intention is to illustrate some of the trade-offs and issues that arise in real applications For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem We also highlight the representation issues that are so often critical to successful applications Applications of reinforcement learning are still far from routine and typically require as much art as science Making applications easier and more straightforward is one of the goals of current research in reinforcement learning

4 TD-Gammon (Tesauro’s 1992, 1994, 1995, …)
One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of the TD(l) algorithm and nonlinear function approximation FA using a FFNN trained by backpropagating TD errors There are probably more professional backgammon players than there are professional chess players BG is in part a game of chance, which can be viewed as a large MDP

5 TD-Gammon (cont.) The game is played with 15 white and 15 black pieces on a board of 24 locations, called points Here’s a typical position early in the game, seen from the perspective of the white player

6 TD-Gammon (cont.) White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps The objective is to advance all pieces to points 19-24, and then off the board Hitting – removal of single piece 30 pieces, 24 locations implies enormous number of configurations (state set is ~1020) Effective branching factor of 400, considering that each dice role has ~20 possibilities

7 TD-Gammon - details Although the game is highly stochastic, a complete description of the game's state is available at all times The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won, when it is 1 Episodic (game = episode), undiscounted Non-linear form of TD(l) using a FF neural network Weights initialized to small random numbers Backpropagation of TD error Four input units for each point; unary encoding of number of white pieces, plus other features Use of Afterstate Learning during self-play – fully incrementally

8 TD-Gammon – Neural Network Employed

9 Summary of TD-Gammon Results
Two players played against each other Each had no prior knowledge of the game Only the rules of the game were prescribed Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players

10 Any such approach would work with backgammon
Rebuttal on TD-Gammon For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS 9 (1997) Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success” Any such approach would work with backgammon Success does not extend to other problems e.g. Tetris, maze-type problems – exploration issue comes up

11 The Acrobot Robotic application of RL
Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to the hands on the bar) cannot exert torque The second joint (corresponding to the gymnast bending at the waist) can This system has been widely studied by control engineers and machine learning researchers

12 The Acrobot (cont.) One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps) Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context

13 Acrobot Learning Curves for Sarsa(l)

14 Typical Acrobot Learned Behavior

15 RL in Robotics Robot motor capabilities were investigated using RL
Walking, grabbing and delivering MIT Media Lab Robocup competitions – soccer games Sony AIBOs are common employed Maze-type problems Balancing themselves on unstable platform Multi-dimensional input streams Hopefully some new applications soon 

16 Introduction to Wireless Sensor Networks (WSN)
A sensor network is composed of a large number of sensor nodes, which are densely deployed either inside the phenomenon or very close to it Random deployment Cooperative capabilities May be wireless or wired, however most modern applications require wireless communications May be mobile or static Main challenge: maximize the life of the network under battery constraints!

17 Communication Topology of Sensor Networks

18 Fire detection and monitoring

19 Nodes we have here at the lab
Intel Mote UCB TelosB

20 Energy Consumption in WSN
Sources of Energy Consumption Sensing Computation Communication (dominant) Energy Wastes on Communications Collisions. (Packet retransmission increases energy consumption) Idle Listening. (listen to the channel when the node are not intending to transmit) Communication Overhead. (the communications cost of the MAC protocol) Overhearing (receive packets which are destined to other nodes)

21 MAC-related problems in WSN
Goal: to schedule or coordinate the communications among multiple nodes sharing the same wireless radio frequency. Hidden Terminal Problem. Node 5 and node 3 want to transmit data to node 1. Since node 3 is out of the communication range of node 5, if communication occurs simultaneously, node 1 will experience collision. Exposed Terminal Problem. node 1 sends data to node 3, since node 5 also overhears it, the transmission from node 6 to node 5 is constrained.

22 S-MAC – Example of WSN MAC Protocol
S-MAC — by Ye, Heidemann and Estrin (2003) Tradeoffs Major components in S-MAC Periodic listen and sleep Collision avoidance Overhearing avoidance Massage passing Latency Fairness Energy

23 tr– action (active time)
RL-MAC (Z. Liu, I. Arel, 2005) Formulate the MAC problem as a RL problem Similar frame-based structure as in SMAC/TMAC Each node infers the state of other nodes as part of its decision making process Active time and duty cycle both a function of the traffic load and Q-Learning was used The main effort involved crafting the reward signal nb - # of packets queued tr– action (active time) Ratio of successful rx vs. tx # Failed attempts Reflect on delay

24 RL-MAC Results

25 RL-MAC Results (cont.)

26 Summary RL is a powerful tool which can support a wide range of applications There is an art to defining the observations, states, rewards and actions Main goal: formulate “as simple as possible” representation Depends on the application Can impact results significantly Fits in high-resource and low-resource systems Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programming


Download ppt "November 10, 2010 Dr. Itamar Arel College of Engineering"

Similar presentations


Ads by Google