Download presentation
Presentation is loading. Please wait.
Published byKaare Jenssen Modified over 5 years ago
1
ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 19: Case Studies
November 10, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010
2
Project report – Due Friday, Dec 3th
Final Project Recap Requirements: Presentation In-class 15 minute presentation + 5 minutes for questions Presentation assignment slots have been posted on website Project report – Due Friday, Dec 3th Comprehensive documentation of your work Recall that the Final Project is 30% of course grade!
3
Introduction We’ll discuss several case studies of reinforcement learning The intention is to illustrate some of the trade-offs and issues that arise in real applications For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem We also highlight the representation issues that are so often critical to successful applications Applications of reinforcement learning are still far from routine and typically require as much art as science Making applications easier and more straightforward is one of the goals of current research in reinforcement learning
4
TD-Gammon (Tesauro’s 1992, 1994, 1995, …)
One of the most impressive applications of RL to date is Gerry Tesauro’s (IBM) game of backgammon TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of the TD(l) algorithm and nonlinear function approximation FA using a FFNN trained by backpropagating TD errors There are probably more professional backgammon players than there are professional chess players BG is in part a game of chance, which can be viewed as a large MDP
5
TD-Gammon (cont.) The game is played with 15 white and 15 black pieces on a board of 24 locations, called points Here’s a typical position early in the game, seen from the perspective of the white player
6
TD-Gammon (cont.) White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps The objective is to advance all pieces to points 19-24, and then off the board Hitting – removal of single piece 30 pieces, 24 locations implies enormous number of configurations (state set is ~1020) Effective branching factor of 400, considering that each dice role has ~20 possibilities
7
TD-Gammon - details Although the game is highly stochastic, a complete description of the game's state is available at all times The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won, when it is 1 Episodic (game = episode), undiscounted Non-linear form of TD(l) using a FF neural network Weights initialized to small random numbers Backpropagation of TD error Four input units for each point; unary encoding of number of white pieces, plus other features Use of Afterstate Learning during self-play – fully incrementally
8
TD-Gammon – Neural Network Employed
9
Summary of TD-Gammon Results
Two players played against each other Each had no prior knowledge of the game Only the rules of the game were prescribed Human’s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players
10
Any such approach would work with backgammon
Rebuttal on TD-Gammon For an alternative view, see “Why did TD-Gammon Work?”, Jordan Pollack and Alan Blair, NIPS 9 (1997) Claim: it was the “co-evolutionary training strategy, playing games against itself, which led to the success” Any such approach would work with backgammon Success does not extend to other problems e.g. Tetris, maze-type problems – exploration issue comes up
11
The Acrobot Robotic application of RL
Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to the hands on the bar) cannot exert torque The second joint (corresponding to the gymnast bending at the waist) can This system has been widely studied by control engineers and machine learning researchers
12
The Acrobot (cont.) One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque A reward of –1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps) Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context
13
Acrobot Learning Curves for Sarsa(l)
14
Typical Acrobot Learned Behavior
15
RL in Robotics Robot motor capabilities were investigated using RL
Walking, grabbing and delivering MIT Media Lab Robocup competitions – soccer games Sony AIBOs are common employed Maze-type problems Balancing themselves on unstable platform Multi-dimensional input streams Hopefully some new applications soon
16
Introduction to Wireless Sensor Networks (WSN)
A sensor network is composed of a large number of sensor nodes, which are densely deployed either inside the phenomenon or very close to it Random deployment Cooperative capabilities May be wireless or wired, however most modern applications require wireless communications May be mobile or static Main challenge: maximize the life of the network under battery constraints!
17
Communication Topology of Sensor Networks
18
Fire detection and monitoring
19
Nodes we have here at the lab
Intel Mote UCB TelosB
20
Energy Consumption in WSN
Sources of Energy Consumption Sensing Computation Communication (dominant) Energy Wastes on Communications Collisions. (Packet retransmission increases energy consumption) Idle Listening. (listen to the channel when the node are not intending to transmit) Communication Overhead. (the communications cost of the MAC protocol) Overhearing (receive packets which are destined to other nodes)
21
MAC-related problems in WSN
Goal: to schedule or coordinate the communications among multiple nodes sharing the same wireless radio frequency. Hidden Terminal Problem. Node 5 and node 3 want to transmit data to node 1. Since node 3 is out of the communication range of node 5, if communication occurs simultaneously, node 1 will experience collision. Exposed Terminal Problem. node 1 sends data to node 3, since node 5 also overhears it, the transmission from node 6 to node 5 is constrained.
22
S-MAC – Example of WSN MAC Protocol
S-MAC — by Ye, Heidemann and Estrin (2003) Tradeoffs Major components in S-MAC Periodic listen and sleep Collision avoidance Overhearing avoidance Massage passing Latency Fairness Energy
23
tr– action (active time)
RL-MAC (Z. Liu, I. Arel, 2005) Formulate the MAC problem as a RL problem Similar frame-based structure as in SMAC/TMAC Each node infers the state of other nodes as part of its decision making process Active time and duty cycle both a function of the traffic load and Q-Learning was used The main effort involved crafting the reward signal nb - # of packets queued tr– action (active time) Ratio of successful rx vs. tx # Failed attempts Reflect on delay
24
RL-MAC Results
25
RL-MAC Results (cont.)
26
Summary RL is a powerful tool which can support a wide range of applications There is an art to defining the observations, states, rewards and actions Main goal: formulate “as simple as possible” representation Depends on the application Can impact results significantly Fits in high-resource and low-resource systems Next class, we’ll talk about a particular class of RL techniques called Neuro-Dynamic Programming
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.