9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the.

9/16 Scan by Kalpesh Shah

What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the more costly each iteration) --A “goodness” function needs to give a value to non-solution configurations too for 8 queens: (-ve) of number of pair-wise conflicts

Applying min-conflicts based hill-climbing to 8-puzzle Local Minima

Problematic scenarios for hill-climbing  When the state-space landscape has local minima, any search that moves only in the greedy direction cannot be (asymptotically) complete  Random walk, on the other hand, is asymptotically complete Idea: Put random walk into greedy hill-climbing Ridges Solution(s):  Random restart hill-climbing  Do the non-greedy thing with some probability p>0  Use simulated annealing

Making Hill-Climbing Asymptotically Complete Random restart hill-climbing –Keep some bound B. When you made more than B moves, reset the search with a new random initial seed. Start again. Getting random new seed in an implicit search space is non-trivial! –In 8-puzzle, if you generate a random state by making random moves from current state, you are still not truly random (as you will continue to be in one of the two components) “biased random walk”: Avoid being greedy when choosing the seed for next iteration –With probability p, choose the best child; but with probability (1- p) choose one of the children randomly Use simulated annealing –Similar to the previous idea—the probability p itself is increased asymptotically to one (so you are more likely to tolerate a non- greedy move in the beginning than towards the end) With random restart or the biased random walk strategies, we can solve very large problems million queen problems in under minutes!

9/18 Bye Bye, Galieleo…

Announcements Sreelakshmi says: Recitation in Rm 409, 10:30—11:30AM (note changed timing) Homework 2 socket will be closed by Friday and it will be due on next Thursday.

“Beam search” for Hill-climbing Hill climbing, as described, uses one seed solution that is continually updated –Why not use multiple seeds? Stochastic hill-climbing uses multiple seeds (k seeds k>1). In each iteration, the neighborhoods of all k seeds are evaluated. From the neighborhood, k new seeds are selected probabilistically –The probability that a seed is selected is proportional to how good it is. –Not the same as running k hill-climbing searches in parallel Stochastic hill-climbing is sort of “almost” close to the way evolution seems to work with one difference –Define the neighborhood in terms of the combination of pairs of current seeds (Sexual reproduction; Crossover) The probability that a seed from current generation gets to “mate” to produce offspring in the next generation is proportional to the seed’s goodness To introduce “randomness” do mutation over the offspring –Genetic algorithms limit number of matings to keep the num seeds the same –This type of stochastic beam-search hillclimbing algorithms are called Genetic algorithms.

Illustration of Genetic Algorithms in Action Very careful modeling needed so the things emerging from crossover and mutation are still potential seeds (and not monkeys typing Hamlet) Is the “genetic” metaphor really buying anything?

Hill-climbing in “continuous” search spaces Gradient descent (that you study in calculus of variations) is a special case of hill- climbing search applied to continuous search spaces –The local neighborhood is defined in terms of the “gradient” or derivative of the error function. Since the error function gradient will be zero near the minimum, and higher farther from it, you tend to take smaller steps near the minimum and larger steps farther away from it. [just as you would want] Gradient descent is guranteed to converge to the global minimum if alpha (see on the right) is small, and the error function is “uni-modal” (I.e., has only one minimum). –Versions of gradient-descent algorithms will be used in neuralnetwork learning. Unfortunately, the error function is NOT unimodal for multi-layer neural networks. So, you will have to change the gradient descent with ideas such as “simulated annealing” to increase the chance of reaching global minimum. XX Err= |x 3 -a| a 1/3 xoxo Example: cube root Finding using newton- Raphson approximation Tons of variations based on how alpha is set

Origins of gradient descent: Newton-Raphson applied to function minimization Newton-Raphson method is used for finding roots of a polynomial –To find roots of g(x), we start with some value of x and repeatedly do x  x – g(x)/g’(x) –To minimize a function f(x), we need to find the roots of the equation f’(x)=0 X  x – f’(x)/f’’(x) If x is a vector then –X  x – f’(x)/f’’(x)  f(x) H f (x) Because hessian is costly to Compute (will have n 2 double Derivative entries for an n-dimensional vector), we try approximations

The middle ground between hill-climbing and systematic search Hill-climbing has a lot of freedom in deciding which node to expand next. But it is incomplete even for finite search spaces. –Good for problems which have solutions, but the solutions are non-uniformly clustered. Systematic search is complete (because its search tree keeps track of the parts of the space that have been visited). –Good for problems where solutions may not exist, Or the whole point is to show that there are no solutions (e.g. propositional entailment problem to be discussed later). –or the state-space is densely connected (making repeated exploration of states a big issue). Smart idea: Try the middle ground between the two?

Between Hill-climbing and systematic search You can reduce the freedom of hill- climbing search to make it more complete –Tabu search You can increase the freedom of systematic search to make it more flexible in following local gradients –Random restart search

Tabu Search A variant of hill-climbing search that attempts to reduce the chance of revisiting the same states –Idea: Keep a “Tabu” list of states that have been visited in the past. Whenever a node in the local neighborhood is found in the tabu list, remove it from consideration (even if it happens to have the best “heuristic” value among all neighbors) –Properties: As the size of the tabu list grows, hill-climbing will asymptotically become “non-redundant” (won’t look at the same state twice) In practice, a reasonable sized tabu list (say 100 or so) improves the performance of hill climbing in many problems

Random restart search Variant of depth-first search where When a node is expanded, its children are first randomly permuted before being introduced into the open list –The permutation may well be a “biased” random permutation Search is “restarted” from scratch anytime a “cutoff” parameter is exceeded –There is a “Cutoff” (which may be in terms of # of backtracks, #of nodes expanded or amount of time elapsed) Because of the “random” permutation, every time the search is restarted, you are likely to follow different paths through the search tree. This allows you to recover from the bad initial moves. The higher the cutoff value the lower the amount of restarts (and thus the lower the “freedom” to explore different paths). When cutoff is infinity, random restart search is just normal depth-first search—it will be systematic and complete For smaller values of cutoffs, the search has higher freedom, but no guarantee of completeness A strategy to guarantee asymptotic completeness: Start with a low cutoff value, but keep increasing it as time goes on. Random restart search has been shown to be very good for problems that have a reasonable percentage of “easy to find” solutions (such problems are said to exhibit “heavy-tail” phenomenon). Many real- world problems have this property.

Leaving goal-based search… Looked at –Systematic Search Blind search (BFS, DFS, Uniform cost search, IDDFS) Informed search (A*, IDA*; how heuristics are made) –Local search Greedy (Hill climbing) Asymptotically complete (Hill climbing with random restart; biased random walk or simulated annealing) Multi-seed hill-climbing –Genetic algorithms…

MDPs as Utility-based problem solving agents

[can generalize to have action costs C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario..

Think of these as h*() values… Called value function U* Think of these as related to h* values

What does a solution to an MDP look like? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy  Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes…

Think of these as h*() values… Called value function U* Think of these as related to h* values

Policies change with rewards.. --

(Value)

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1

Policies converge earlier than values Given a utility vector U i we can compute the greedy policy  ui The policy loss of  is ||U   U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) So search in the space of policies

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have max factor)

Incomplete observability (the dreaded POMDPs) To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) –Policy maps belief states to actions In practice, this causes (humongous) problems –The space of belief states is “continuous” (even if the underlying world is discrete and finite). –Even approximate policies are hard to find (PSPACE-hard). Problems with few dozen world states are hard to solve currently –“Depth-limited” exploration (such as that done in adversarial games) are the only option…

LALA Land—Don’t look below.

The Big Computational Issues in MDP MDP models are quite easy to specify and understand conceptually. The big issue is “compactness” and “effciency” Policy construction is polynomial in the size of state space (which is bad news…!) –For POMDPs, the state space is the belief space (infinite  ) Compact representations needed for –Actions –Reward function –Policy –Value Efficient methods needed for –Policy/value update Representations that have been tried include: –Decision trees –Neural nets, –Bayesian nets –ADDs (algebraic decision diagrams— which are a general case of BDDs—where the leaf nodes can have real-valued valuation instead of T/F).

SPUDD: Using ADDs to Represent Actions, Rewards and Policies

MDPs and Planning Problems FOMDPS (fully observable MDPS) can be used to model planning problems with fully observable states, but non-deterministic transitions POMDPs (partially observable MDPs)—a generalization of MDP framework, where the current state can only be partially observed—will be needed to handle planning problems with partial observability –POMDPs can be solved by converting them into FOMDPs—but the conversion takes us from world states to belief states (which is a continuous space)

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) –Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) –Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods –Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution –RTDP methods SSSP are a special case of MDPs where –(a) initial state is given –(b) there are absorbing goal states –(c) Actions have costs. Goal states have zero costs. A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) –Value/Policy Iteration don’t consider the notion of relevance –Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state. –(L)AO* or –RTDP algorithms –(or envelope extension methods)

AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS  Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)

LAO*--turning bottom-up labeling into a full DP

RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values  If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value  If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way  The M ij are not known correctly in RL

Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Envelope Extension Methods For each action, take the most likely outcome and discard the rest. Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence. Consider states that are most likely to be encountered while traveling this path. Find policy for those states too. Tricky part is to show that we can converge to the optimal policy

9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the.

Similar presentations

Presentation on theme: "9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the.

Similar presentations

Presentation on theme: "9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the."— Presentation transcript:

Similar presentations

About project

Feedback