Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning.

Similar presentations


Presentation on theme: "CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning."— Presentation transcript:

1 CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning

2 CS-424 Gregory Dudek Transition networks How determine strategies (policy) in a problem defined by a transition network. It was: –Deterministic or stochastic –Markovian (exhibited the Markov property). –Fully observable (RN: accessible): we can directly observe (determine) exactly what what we are in during the update process. Computing the optimal policy is a Markov Decision Problem (MDP). If we don’t know the current state for sure, but can only infer it (probabilistically), then we have a Partially Observable system. Partially Observable Markov Decision Problem (POMDP). –How hard is it to compute the optimal policy?

3 CS-424 Gregory Dudek Specific details on reinforcement Simplest model: Given we know all transition probabilities, and the immediate (short- term) reward R(i) associated with each state i. We can compute the value function U() by solving a linear system U(i) = R(i) +  j M(i,j) U(j) This approach is referred to as adaptive dynamic programming. In contrast, Sampling and TD methods update this system intermittently based on partial information. (Note we have omitted the less-effective LMS algorithm in the textbook.)

4 CS-424 Gregory Dudek Types of learners 2 classes wrt reinforcement learning: –Passive learners: you just update the state transition/reward info for the states you are taken to, but do not control the sequence of states visited. Backgammon learner that merely observers another part of the system playing. A kid watching it’s parents. –Active learners: the learner actively modifies the sequence of states visited in order to (presumably) acquire information.

5 CS-424 Gregory Dudek Exploration versus Exploitation Fundamental tradeoff. We want to maximize return: –Should we do what we know is best, based on incomplete information –Or should we seek information about unknown things, although this may not lead to rewards? Plenty of intuitive relevance. How do we combine these two processes?

6 CS-424 Gregory Dudek Planning: general approach Use a (restrictive) formal language to describe problems and goals. –Why restrictive? More precision and fewer states to search Have a goal state specification and an initial state. Use a special-purpose planner to search for a solution.

7 CS-424 Gregory Dudek Basic formalism Basic logical formalism derived from STRIPS. State variables determine what actions can or should be taken: in this context they are conditions –Shoe_untied() –Door_open(MC) An operator (remember those?) is now a triple Preconditions Additions Deletions Together, called effects of an operator Seen in the context of search

8 CS-424 Gregory Dudek A plan is 4 components –A set of steps defined by a sequence of operator applications –A set of constraints on the ordering of these steps. (Not necessarily a total ordering.) –A set of variable binding constraints: set of things various operators can apply to. –Set of causal links that specify what effects one action achieves that are needed by another.

9 CS-424 Gregory Dudek Going forwards All state variables are true or false, but some may not be defined at a certain point on our State Progression. A planner based on this is a progression planner. Idea: In a state S, Can apply operator X=(P,A,D). Leads to new state T T = f X (S) = (S-D)  A

10 CS-424 Gregory Dudek Constancy Important caveat When we go from one state to another, we assume that the only changes were those that resulted explicitly from the Additions and Deletions. Given this assumption, the operator X computes the strongest provable postconditions. In reality, even more might be deleted.

11 CS-424 Gregory Dudek With time involves When we want to reason in the presence of temporal change special issues arise. This is related to the frame problem A historical bugaboo in AI How to we deal with things that change? Want to avoid saying, “I tied my shoe but the light stayed on, the door stayed open, the room stayed full, my other shoe stayed tied, it was still daytime, I stayed in the same place, nobody else came into the room, my batteries stayed at the same recharge state (roughly), ….

12 CS-424 Gregory Dudek Aside: FOL with time One approach is a variation of first-order logic called situation calculus [McCarthy]. –Events take place at specific times. –Some predicates are fluents and only apply for certain ranges in time. –A situation is a temporal interval over which all the predicates remain fixed. –Reference: read RN Sec 7.6 or DAA Ch. 6.

13 CS-424 Gregory Dudek Going backwards Remember backwards chaining? State at the goal G. Assuming the deletions aren’t there for some operator X –Why? Can chain backwards by adding what would have been deleted and removing what would have been added S = f -1 X (G) = (G-A)  D Maybe we added too much (with D), or deleted too little?

14 CS-424 Gregory Dudek Means/ends analysis How can we get from initial to final? –Assume the states and operators are given. –What’s the right path? How to we measure distance? Means/ends analysis assumes we simply reduce the number of things that make our current state different from out goal.

15 CS-424 Gregory Dudek STRIPS STRIPS is an old planning language STanford Research Institute Problem Solver. –Less expressive than situation calculus –Initial state: At(office) & NOT(Have(Video)) & Have(Cash) & Have(Uncooked-kernels) –Goal state At(Home) & Have(Video) & Have(Cooked-Popcorn)

16 CS-424 Gregory Dudek Schemas Basic operators assume a complete specification of the state in which they are applied. This can be tedious –An operator schema is a “generic” operator that has variables in it Related to axiom schemas Related to unification in logic (e.g. prolog) E.g. Tie_shoes(h), Tie_necktie(h), Tie_boat_rope(h), Tie_straightjacket(h) might all be abstracted by Tie_object(X,h)

17 CS-424 Gregory Dudek Least Commitment Planning When we formulate a plan intuitively, we often think of doing things in a specific sequence even when the sequencing is arbitrary. –This may not be wise. This can leads to re-shuffling actions... which is undesirable. Generate plans such that we have sets of applicable actions, but we don’t order the actions unless there is something (conditions) that demands it.

18 CS-424 Gregory Dudek Partially ordered plan A D EG B C F

19 CS-424 Gregory Dudek Terminology Constraints on sequencing, requirements for operators, links relating operators, conflicts between operators in a given plan. For a plan: Sound –Plan steps obey constraints on sequencing –Successful Systematic –Doesn’t “waste” effort Complete –Generates a plan if one exists. Still may not terminate (cf. Halting problem) Plan refinement –Improvement of an existing plan to make it better meet the constraints

20 CS-424 Gregory Dudek Links & Conflicts Consumer Producer Consumer Producer Clobberer A conflict involves a link, & a step that messes it up.

21 CS-424 Gregory Dudek Refinement Fix conflicts by creating a new one from an old one. –Keep old structures (links, producers, consumers, constraints) but add new constraints If there are conflicts, resolve them by adding constraints: move a clobberer before of after the link it’s hitting. –(if you can). If there are no conflicts, satisfy an unfulfilled requirement.

22 CS-424 Gregory Dudek Applications of planning Planning for Shakey the robot –Climb boxes –Push things –Move around Blocks world –Moving blocks –Piling them onto one another –Clearing the tops of chosen blocks Really doing this suggested we need vision !

23 CS-424 Gregory Dudek Configuration Space Planning

24 CS-424 Gregory Dudek Issues


Download ppt "CS-424 Gregory Dudek Today’s Lecture Reinforcement learning: further thoughts. Planning."

Similar presentations


Ads by Google