Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning and Evolution in Hierarchical Behavior-based Systems

Similar presentations


Presentation on theme: "Learning and Evolution in Hierarchical Behavior-based Systems"— Presentation transcript:

1 Learning and Evolution in Hierarchical Behavior-based Systems
Amir massoud Farahmand Advisor: Majid Nili Ahmadabadi Co-advisors: Caro Lucas – Babak N. Araabi

2 University of Tehran - Dept. of ECE
Motivation Machines (e.g. robots): from labs. to homes, factories, … . Machines face: Unknown environment/body [exact] Model of environment/body is not known Non-stationary environment/body Changing environment (offices, houses, streets, and almost everywhere) Aging Designer may not know how to benefit from every aspects of her agent/environment University of Tehran - Dept. of ECE

3 University of Tehran - Dept. of ECE
Motivation Difficulty of the design process Machines see different things Machines interact differently The designer is not a machine! I know what I want! Our goal: Automatic design of intelligent machines University of Tehran - Dept. of ECE

4 Research Specification
Goal: Automatic design of intelligent robots Architecture: Hierarchical behavior-based architectures. Objective performance measure is available (reinforcement signal) [Agent] Did I perform it correctly?! [Tutor] Yes/No! (or 0.3) University of Tehran - Dept. of ECE

5 Behavior-based Approach to AI
Behavior-based approach as a successful alternative for classical AI approach No {Abstraction, Planning, Deduction, … } Behavioral (activity) decomposition against functional decomposition Behavior: Sensor->Action (Direct link between perception and action) University of Tehran - Dept. of ECE

6 Behavioral Decomposition
manipulate the world build maps sensors actuators explore avoid obstacles locomote University of Tehran - Dept. of ECE

7 Behavior-based Design
Robust not sensitive to failure of particular part of the system no need for precise perception as there is no modelling there Reactive: Fast response as there is no long route from perception to action No explicit representation University of Tehran - Dept. of ECE

8 ? DESIGN How should we a behavior-based system?!
University of Tehran - Dept. of ECE

9 Behavior-based System Design Methodologies
Hand Design Common in almost everywhere. Complicated: may be even infeasible in complex problems Even if it is possible to find a working system, it is not optimal probably. Evolution Good solutions can be found Biologically feasible Time consuming Not fast in making new solutions Learning Learning is essential for life-time survival of the agent. University of Tehran - Dept. of ECE

10 Taxonomy of Design Methods
University of Tehran - Dept. of ECE

11 Problem Formulation Behaviors
University of Tehran - Dept. of ECE

12 Problem Formulation Purely Parallel Subsumption Architecture (PPSSA)
Different behaviors excites Higher behaviors can suppress lower ones. Controlling behavior University of Tehran - Dept. of ECE

13 University of Tehran - Dept. of ECE
Problem Formulation Reinforcement Signal and the Agent’s Value Function This function states the value of using a set of behaviors in an specific structure. We want to maximize the agent’s value function University of Tehran - Dept. of ECE

14 Problem Formulation Design as an Optimization
Structure Learning: Finding the best structure given a set of behaviors using learning Behavior Learning: Finding the best behaviors given the structure using learning Concurrent Behavior and Structure Learning Behavior Evolution: Finding the best behaviors given structure using evolution Behavior Evolution and Structure Learning University of Tehran - Dept. of ECE

15 University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE

16 Learning in Behavior-based Systems
There are a few researches on behavior-based learning Mataric, Mahadevan, Maes, and ... … but there is no deep investigation about it (specially mathematical formulation)! And most of them incorporate flat architectures. University of Tehran - Dept. of ECE

17 Learning in Behavior-based Systems
We design: Structure (Hierarchy) Behavior We Learn: Structure Learning Organizing behaviors in the architecture using a behavior toolbox Behavior Learning The correct mapping of each behavior University of Tehran - Dept. of ECE

18 University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE

19 University of Tehran - Dept. of ECE
Structure Learning build maps explore manipulate the world The agent wants to learn how to arrange these behaviors in order to get maximum reward from its environment (or tutor). locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE

20 University of Tehran - Dept. of ECE
Structure Learning build maps explore manipulate the world locomote avoid obstacles Behavior Toolbox University of Tehran - Dept. of ECE

21 University of Tehran - Dept. of ECE
Structure Learning build maps manipulate the world explore locomote avoid obstacles 1-explore becomes controlling behavior and suppress avoid obstacles 2-The agent hits a wall! Behavior Toolbox University of Tehran - Dept. of ECE

22 University of Tehran - Dept. of ECE
Structure Learning build maps manipulate the world explore locomote avoid obstacles Tutor (environment) gives explore a punishment for its being in that place of the structure. Behavior Toolbox University of Tehran - Dept. of ECE

23 University of Tehran - Dept. of ECE
Structure Learning build maps manipulate the world explore locomote avoid obstacles “explore” is not a very good behavior for the highest position of the structure. So it is replaced by “avoid obstacles”. Behavior Toolbox University of Tehran - Dept. of ECE

24 Structure Learning Challenging Issues
Representation: How should the agent represent knowledge gathered during learning? Sufficient (Concept space should be covered by Hypothesis space) Generalization Capability Tractable (small Hypothesis space) Well-defined credit assignment Hierarchical Credit Assignment: How should the agent assign credit to different behaviors and layers in its architecture? If the agent receives a reward/punishment, how should we reward/punish the structure of the agent? Learning: How should the agent update its knowledge when it receives reinforcement signal? University of Tehran - Dept. of ECE

25 Structure Learning Overcoming Challenging Issues
Our approach is defining a representation that allows decomposing the agent’s value function to simpler components. Decomposing the behavior of a multi-agent system to simpler components may enhance our vision to the problem under investigation. Structure can provide a lot of clues to us. University of Tehran - Dept. of ECE

26 University of Tehran - Dept. of ECE
Structure Learning University of Tehran - Dept. of ECE

27 Structure Learning Zero Order Representation
ZO Value Table in the agent’s mind avoid obstacles (0.8) explore (0.7) locomote (0.4) Higher layer avoid obstacles (0.6) explore (0.9) locomote (0.4) Lower layer University of Tehran - Dept. of ECE

28 University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE

29 University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Value Function Decomposition ZO components Layer’s value Agent’s value function University of Tehran - Dept. of ECE

30 University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Value Function Decomposition University of Tehran - Dept. of ECE

31 University of Tehran - Dept. of ECE
Structure Learning Zero Order Representation - Credit Assignment and Value Updating Controlling behavior is the only responsible behavior for the current reinforcement signal. University of Tehran - Dept. of ECE

32 Structure Learning First Order Representation
University of Tehran - Dept. of ECE

33 Structure Learning First Order Representation
University of Tehran - Dept. of ECE

34 Structure Learning First Order Representation
University of Tehran - Dept. of ECE

35 Structure Learning First Order Representation – Credit Assignment
If only one behavior becomes activated, we should update V0(i) . If two or more behaviors become active, we must update V(i>j) for which ‘i’ is the index of the controlling behavior and ‘j’ which is the index of the next active behavior . University of Tehran - Dept. of ECE

36 University of Tehran - Dept. of ECE
A Break! University of Tehran - Dept. of ECE

37 Introduction to Experiments
Abstract problem Multi-robot object lifting problem I will only discuss this problem now. A group of robots lifts a bulky object. University of Tehran - Dept. of ECE

38 Experiments Structure Learning
Comparison of the average gained reward of two different structure learning methods (Zero Order (ZO) and First Order (FO)), hand- designed structure, and random structure for the object lifting problem. University of Tehran - Dept. of ECE

39 University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE

40 University of Tehran - Dept. of ECE
Behavior Learning No more behavior repertoire assumption All we know Sensor/Actuator dimensions Reinforcement Signal University of Tehran - Dept. of ECE

41 Behavior Learning Challenging Issues
How should behaviors cooperative with each other to maximize the performance of the agent? How should we assign credit to behaviors of the architecture? How should each behavior update its knowledge? University of Tehran - Dept. of ECE

42 University of Tehran - Dept. of ECE
Behavior Learning B2, B3, and B4 excite B4 takes the control Punishment!!! ?! University of Tehran - Dept. of ECE

43 University of Tehran - Dept. of ECE
Behavior Learning Augmenting the action space with a pseudo-action named NoAction (NA) NA does nothing and let lower behaviors take control B2, B3, B4 excite B4 proposed NA B3 proposes an action and takes control Reward! University of Tehran - Dept. of ECE

44 University of Tehran - Dept. of ECE
Behavior Learning NA lets behaviors to cooperate How should we force them to cooperative correctly?! Hierarchical Credit Assignment Problem Boolean-like algebra for logically expressible multi-agent systems University of Tehran - Dept. of ECE

45 University of Tehran - Dept. of ECE
Behavior Learning University of Tehran - Dept. of ECE

46 Behavior Learning Optimality
Internal states of different behaviors excites in different regions University of Tehran - Dept. of ECE

47 Behavior Learning Optimality
University of Tehran - Dept. of ECE

48 Behavior Learning Value Updating
For the case of immediate reward University of Tehran - Dept. of ECE

49 Behavior Learning Value Updating
For the general return case, we should use Monte Carlo estimation. Bootstrapping method is not applicable. University of Tehran - Dept. of ECE

50 Concurrent Behavior and Structure Learning
Applying Behavior Learning State-Action Mappings Structure Learning Hierarchy University of Tehran - Dept. of ECE

51 Experiments Behavior Learning
Reward comparison between structure learning, behavior learning, and concurrent behavior/structure learning methods for the object lifting task. University of Tehran - Dept. of ECE

52 Experiments Behavior Learning
Learning phase Testing phase University of Tehran - Dept. of ECE

53 Experiments Behavior Learning
Testing phase Learning phase University of Tehran - Dept. of ECE

54 Experiments Behavior Learning
A sample trajectory showing the position of robot-object contact points, the tilt angle of the object during object lifting, and controlling behavior of robots in each time steps after sufficient structure/behavior learning. Behaviors correspondence with numbers of lowest diagram is as follows: 0 (No Behavior), 1 (Push More), 2 (Don’t Go Fast), 3 (Stop), 4 (Hurry up), 5 (Slow down). University of Tehran - Dept. of ECE

55 University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE

56 Behavior Co-evolution Motivations
+ Learning can trap in local maxima of objective function Learning is sensitive (POMDP, non-Markov, …) Evolutionary methods have more chance to find the global maximum of the objective function Objective function may not be well-defined in robotics - Evolutionary robotics’ methods are usually slow Fast changes of the environment Non-modular controllers Monolithic No reusability University of Tehran - Dept. of ECE

57 Behavior Co-evolution Motivations
Use evolution to search the difficult and big part of parameters’ space Behaviors’ parameters space is usually the bigger one Use learning to do fast responses Structure’s parameters space is usually the smaller one A change is the structure results in different agent’s behavior Evolve behaviors separately (modularity and re-usability) University of Tehran - Dept. of ECE

58 Behavior Co-evolution
Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Evolve each kind of behavior in its own genetic pool University of Tehran - Dept. of ECE

59 Behavior Co-evolution Fitness Sharing
Fitness of the agent  Fitness of each behavior?! Fitness Sharing Uniform Value-based University of Tehran - Dept. of ECE

60 Behavior Co-evolution Uniform Fitness Sharing
University of Tehran - Dept. of ECE

61 Behavior Co-evolution Value-based Fitness Sharing
University of Tehran - Dept. of ECE

62 Behavior Co-evolution
Each behavior’s genetic pool Selection Genetic Operators Crossover Mutation Hard Replacement Soft Perturbation University of Tehran - Dept. of ECE

63 University of Tehran - Dept. of ECE
Where?! University of Tehran - Dept. of ECE

64 University of Tehran - Dept. of ECE
Memetic Algorithm We waste learned knowledge after each agent’s lifetime Meme as a unit of information that reproduces itself as people exchange idea Traditional memetic algorithms: Evolutionary Method: Meme exchange Local Search: Meme refinement May be called as Hybrid Evolutionary Algorithm University of Tehran - Dept. of ECE

65 University of Tehran - Dept. of ECE
Memetic Algorithm Two different interpretations of meme: Current hybridization of behavior co-evolution and structure learning Similar to traditional MA Difference with traditional MA: different parameters spaces are being searched Meme as a cultural bias University of Tehran - Dept. of ECE

66 University of Tehran - Dept. of ECE
Memetic Algorithm Experienced individuals store their experiences in the form of meme in the culture. Newborn individuals get a new meme from the culture. Structure as a meme University of Tehran - Dept. of ECE

67 University of Tehran - Dept. of ECE
Memetic Algorithm Agent Behavior Pool 1 Behavior Pool 2 Behavior Pool n Meme Pool (Culture) University of Tehran - Dept. of ECE

68 University of Tehran - Dept. of ECE
Memetic Algorithm Each meme has its own value Value of the meme is updated using the fitness of the agent Valuable memes have more chance to be selected for newborn individuals University of Tehran - Dept. of ECE

69 University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes fitness comparison for different design methods: 1) evolution of behaviors (uniform fitness sharing) and learning structure (blue), 2) evolution of behaviors (valued-based fitness sharing) and learning structure (black), 3) hand-designed behaviors with learning structure (green), and 4) hand-designed behaviors and structure (red). Dotted line across the hand-designed cases (3 and 4) show one standard deviation region across the mean performance. University of Tehran - Dept. of ECE

70 University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for uniform fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is much higher. University of Tehran - Dept. of ECE

71 University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Probability distribution comparison for uniform fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE

72 University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm (Object Lifting) Averaged last five episodes and lifetime fitness comparison for value-based fitness sharing co-evolutionary mechanism: 1) evolution of behaviors and learning structure (blue), 2) evolution of behaviors and learning structure benefiting from meme pool bias (black), 3) evolution of behaviors and hand-designed structure (magenta), 4) hand-designed behaviors and learning structure (green), and 5) hand-designed behaviors and structure (red). Filled line indicate the last five episodes of the agent’s lifetime and the dotted lines indicate the agent’s lifetime fitness. Although the final time performance of all cases are rather the same, the lifetime fitness of memetic-based design is higher. University of Tehran - Dept. of ECE

73 University of Tehran - Dept. of ECE
Experiments Behavior Co-evolution – Structure Learning – Memetic Algorithm Figure 13. (Object Lifting) Probability distribution comparison for value-based fitness sharing (). Comparison is made between agents using meme pool as their initial bias for their structure learning (black), agents that learn structure from a random initial setting (blue), and agents with hand-designed structure (magenta). Dotted lines are for distribution for lifetime fitness. More right-side distribution indicates higher chance of generating very good agents. University of Tehran - Dept. of ECE

74 University of Tehran - Dept. of ECE
Other Topics Probabilistic Analysis of PPSSA Change in the excitation probability  Change in the controlling probability of each layer. Some estimate of learning time The effect of reinforcement signal uncertainty on Value function Policy of the agent University of Tehran - Dept. of ECE

75 University of Tehran - Dept. of ECE
Conclusions University of Tehran - Dept. of ECE

76 University of Tehran - Dept. of ECE
Contributions Deep and mathematical investigation of behavior-based systems Tackling the design process from different approaches Learning Evolution Culture-based methods Structure learning is quite new in hierarchical reinforcement learning University of Tehran - Dept. of ECE

77 Suggestions for the Future Work
Extending the proposed methods to more complex architectures Automatic behaviors’ state space extraction Traditional clustering methods are not suitable Convergence proof in learning Automatic Abstraction of Knowledge Simultaneous low-level and high-level decision making Investigations on the reinforcement signal design University of Tehran - Dept. of ECE

78 University of Tehran - Dept. of ECE
Thanks! University of Tehran - Dept. of ECE

79 The Effect of Reinforcement Signal Uncertainty on the Value Function
Uncertainty Model University of Tehran - Dept. of ECE

80 The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
Boltzman action selection University of Tehran - Dept. of ECE

81 The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
University of Tehran - Dept. of ECE

82 The Effect of Reinforcement Signal Uncertainty on the Agent’s Policy
نتايج قسمت تاثير خطا بر تابع ارزش University of Tehran - Dept. of ECE

83 Reinforcement Uncertainty Simulations
شکل2. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.1 شکل 1. خطاي به ازاي مقادير γ مختلف University of Tehran - Dept. of ECE

84 Reinforcement Uncertainty Simulations
شکل4. مقايسه بين خطاي مشاهده شده و کران به دست آمده به ازاي γ=0.9 شکل 3. مقايسه بين خطاي مشاهده شده و کران به‌دست آمده به ازاي γ=0.5 University of Tehran - Dept. of ECE

85 Reinforcement Uncertainty Simulations
شکل5. کران بالا و پايين نسبت احتمالات عامل با سيگنال تقويت نادقيق به احتمالات عامل با سيگنال تقويت اصلي به ازاي مقادير مختلف γ (آبي: γ=0.1، مشکي: γ=0.5، قرمز: γ=0.9). شکل6. مقايسه بين نسبت احتمالات مشاهده شده و کران‌هاي به دست آمده به ازاي γ=0.1 University of Tehran - Dept. of ECE

86 Reinforcement Uncertainty Simulations
شکل 7. مقايسه بين نسبت احتمالات مشاهده شده و کران‌هاي به دست آمده به ازاي γ=0.5 شکل 8. مقايسه بين نسبت احتمالات مشاهده شده و کران‌هاي به دست آمده به ازاي γ=0.9 University of Tehran - Dept. of ECE


Download ppt "Learning and Evolution in Hierarchical Behavior-based Systems"

Similar presentations


Ads by Google