Download presentation
Presentation is loading. Please wait.
Published byJasper Mills Modified over 9 years ago
1
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher behavior advice Environment state action reward RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how? DESIDERATA: Non-technical users as teachers Natural interaction methods High level rules as advice for RL In the form of programming-language constructs (Maclin and Shavlik 1996), rules about action and utility preferences (Maclin et al. 2005) Logical rules derived from a constrained natural language (Kuhlmann et al. 2004) Learning by Demonstration (LBD) User provides full demonstrations of a task that the agent can learn from (Billard et al. 2008). Recent works (Coates, Abbeel, and Ng 2008) include model learning to improve on demonstrations but does not allow users to provide feedback. Argall, Browning, and Veloso 2007; 2008 combines LBD and human critiques on behavior (similar to our work here), but there is no autonomous practice. Real time feedback from User TAMER framework (Knox and Stone, 2009) uses a type of supervised learning to predict the user’s reward signal, and then select actions to maximize predicted reward. Thomaz and Breazeal (2008) rather combine the end-user reward signal with the environmental reward, and use Q- Learning. Get Critique Get Experience Simulator Learn Act C T Critique data: Trajectory data: Features: Allows feedback and guidance advice Allows practice Novel approach to learn from critique advice and practice Advice Interface Simulator How to pick the value of λ? What are the forms of U and L? Optimization using Gradient Ascent Choose Simulator Estimate Utility Problem: Given data sets T and C, how can we update the agent’s policy so as to maximize its reward? Set of all optimal actions Any action not in O(s) is suboptimal All actions are equally good Learning Goal: find a probabilistic policy that has a high probability of returning an action in O(s) when applied to s. It is not important which action is selected as long as the probability of selecting an action in O(s) is high. We call this problem Any Label Learning (ALL) ALL likelihood: The Multi-Label Learning problem (Tsoumakas and Katakis 2007) differs in that the goal is to learn a classifier that outputs all of the labels in sets and no others. Reality: there does not exist an ideal teacher!!! Ideal Teacher Key idea: define a user model that induces a distribution over ALL problems. User model: distributionover sets given critique data, assume: independence among different states. We introduce two noise parameters and, and one bias parameter (probability that an unlabeled action is in O(s) ). Expected ALL likelihood: Closed form of likelihood: Our Domain: RTS tactical micro-management 5 friendly footmen versus 5 enemy footmen (Wargus AI). Difficulty: Fast pace and multiple units acting in parallel Our setup: Provide end-users with an interface that allows to watch a battle and pause at any moment. The user can then scroll back and forth within the episode and mark any possible action of any agent as good or bad. Available actions for each military unit are to attack any of the units on the map (enemy or friendly) giving a total of 9 actions per unit. Two battle maps, which differed only in the initial placement of the units. Evaluated 3 Learning Systems: Pure RL: Only practice Pure Supervised: Only critique Combined System: Critique + Practice Goal: Test learning capability with varying amount of critique and practice data Total of 2 users per map. For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data. We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. Map 1 Map 2 Advice Interface The user study involved 10 end-users 6 with CS backgrounds and 4 without a CS background. For each user, the study consisted of teaching both the pure supervised and the combined systems, each on a different map, for a fixed amount of time. (Supervised: 30 mins, Combined: 60 mins) These results show that the users were able to significantly outperform pure RL using both the supervised and combined systems. The end-users had slightly greater success with the pure supervised system versus the combined system: Large delay experienced while waiting for the practice stages to end Policy returned by practice was sometimes poor, ignored advice Lesson Learned: Such behavior is detrimental to the user experience and overall performance. Future Work: Better behaving combined system Studies where users are not captive during practice stages Frustrating Fraction of positive, negative and mixed advice. Supervise d Combined Positive (or negative) advice is where the user only gives feedback on the action taken by the agent. Mixed is where the user not only gives feedback on the agent's action but also suggests alternative actions to the agent. We use likelihood weighting to estimate the utility U( ,T) of policy using off-policy trajectories T (Peshkin and Shelton 2002). Let be the probability of generating trajectory and let be the parameters of the policy that generated. An unbiased utility estimate is given by: where The gradient of has a compact closed form.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.