CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning

Ch U Universe C h = Error region P(C h ) <= Є + + accuracy parameter Prob. distribution

Learning Means the following Should happen: Pr(P(c h) = 1- δ PAC model of learning correct. + Probably Approximately Correct

IIT Bombay4 AB D C + + + - - - - - - - - - - - - x y

Algo: 1. Ignore –ve example. 2. Find the closest fitting axis parallel rectangle for the data.

Case 1: If P([]ABCD) < Є than the Algo is PAC. AB D C + + + - - - - - - - - - - - - x y c h Pr(P(c h) = 1- δ C h + +

Case 2: AB D C - - - - - - - - - - - - x y p([]ABCD) > Є Bottom Right Left Top P(Top) = P(Bottom) = P(Right) = P(Left) = Є /4 Case 2

IIT Bombay8 Let # of examples = m. Probability that a point comes from top = Є/4 Probability that none of the m example come from top = (1- Є/4) m

Probability that none of m examples come from one of top/bottom/left/right = 4(1 - Є/4) m Probability that at least one example will come from the 4 regions = 1- 4(1 - Є/4) m

This fact must have probability greater than or equal to 1- δ 1-4 (1 - Є/4 ) m >1- δ or 4(1 - Є/4 ) m < δ

(1 - Є/4) m < e (-Єm/4) We must have 4 e (-Єm/4) < δ Or m > (4/Є) ln(4/δ)

Lets say we want 10% error with 90% confidence M > ((4/0.1) ln (4/0.1)) Which is nearly equal to 200

VC-dimension Gives a necessary and sufficient condition for PAC learnability.

Def:- Let C be a concept class, i.e., it has members c1,c2,c3,…… as concepts in it. C1 C2 C3 C

Let S be a subset of U (universe). Now if all the subsets of S can be produced by intersecting with C i s, then we say C shatters S.

The highest cardinality set S that can be shattered gives the VC-dimension of C. VC-dim(C)= |S| VC-dim: Vapnik-Cherronenkis dimension.

IIT Bombay17 2 – Dim surface C = { half planes} x y

IIT Bombay18 a |s| = 1 can be shattered S 1 = { a } {a}, Ø y x

IIT Bombay19 a |s| = 2 can be shattered b S 2 = { a,b } {a,b}, {a}, {b}, Ø x y

IIT Bombay20 a |s| = 3 can be shattered b c x y S 3 = { a,b,c }

IIT Bombay21

IIT Bombay22 AB D C x y |s| = 4 cannot be shattered S 4 = { a,b,c,d }

IIT Bombay23 Fundamental Theorem of PAC learning (Ehrenfeuct et. al, 1989) A Concept Class C is learnable for all probability distributions and all concepts in C if and only if the VC dimension of C is finite If the VC dimension of C is d, then…(next page)

IIT Bombay24 Fundamental theorem (contd) (a) for 0<ε<1 and the sample size at least max[(4/ε)log(2/δ), (8d/ε)log(13/ε)] any consistent function A:S c  C is a learning function for C (b) for 0<ε<1/2 and sample size less than max[((1-ε)/ ε)ln(1/ δ), d(1-2(ε(1- δ)+ δ))] No function A:S c  H, for any hypothesis space is a learning function for C.

Book 1.Computational Learning Theory, M. H. G. Anthony, N. Biggs, Cambridge Tracts in Theoretical Computer Science, 1997. Paper’s 1. A theory of the learnable, Valiant, LG (1984), Communications of the ACM 27(11):1134 -1142. 2. Learnability and the VC-dimension, A Blumer, A Ehrenfeucht, D Haussler, M Warmuth - Journal of the ACM, 1989.

Introducing Reinforcement Learning

Introduction  Reinforcement Learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward.

Constituents  In RL no correct/incorrrect input/output are given.  Feedback for the learning process is called 'Reward' or 'Reinforcement'  In RL we examine how an agent can learn from success and failure, reward and punishment

The RL framework  Environment is depicted as a finite-state Markov Decision process.(MDP)‏  Utility of a state U[i] gives the usefulness of the state  The agent can begin with knowledge of the environment and the effects of its actions; or it will have to learn this model as well as utility information.

The RL problem  Rewards can be received either in intermediate or a terminal state.  Rewards can be a component of the actual utility(e.g. Pts in a TT match) or they can be hints to the actual utility (e.g. Verbal reinforcements)‏  The agent can be a passive or an active learner

Passive Learning in a Known Environment In passive learning, the environment generates state transitions and the agent perceives them. Consider an agent trying to learn the utilities of the states shown below:

Passive Learning in a Known Environment  Agent can move {North, East, South, West}  Terminate on reading [4,2] or [4,3]

Passive Learning in a Known Environment Agent is provided: M i j = a model given the probability of reaching from state i to state j

Passive Learning in a Known Environment  The object is to use this information about rewards to learn the expected utility U(i) associated with each nonterminal state i  Utilities can be learned using 3 approaches 1) LMS (least mean squares)‏ 2) ADP (adaptive dynamic programming)‏ 3) TD (temporal difference learning)‏

Passive Learning in a Known Environment LMS (Least Mean Square) ‏ Agent makes random runs (sequences of random moves) through environment [1,1]->[1,2]->[1,3]->[2,3]->[3,3]->[4,3] = +1 [1,1]->[2,1]->[3,1]->[3,2]->[4,2] = -1

Passive Learning in a Known Environment LMS  Collect statistics on final payoff for each state (eg. when on [2,3], how often reached +1 vs -1 ?)‏  Learner computes average for each state Probably converges to true expected value (utilities)‏

Passive Learning in a Known Environment LMS Main Drawback: - slow convergence - it takes the agent well over a 1000 training sequences to get close to the correct value

Passive Learning in a Known Environment ADP (Adaptive Dynamic Programming)‏ Uses the value or policy iteration algorithm to calculate exact utilities of states given an estimated mode

Passive Learning in a Known Environment ADP In general: U n+1 (i) = U n (i)+ ∑ M ij. U n (j) -U n (i) is the utility of state i after nth iteration -Initially set to R(i) - R(i) is reward of being in state i (often non zero for only a few end states)‏ - M ij is the probability of transition from state i to j

Passive Learning in a Known Environment Consider U(3,3)‏ U(3,3) = 0.33 x U(4,3) + 0.33 x U(2,3) + 0.33 x U(3,2)‏ = 0.33 x 1.0 + 0.33 x 0.0886 + 0.33 x -0.4430 = 0.2152 ADP

Passive Learning in a Known Environment ADP  makes optimal use of the local constraints on utilities of states imposed by the neighborhood structure of the environment  somewhat intractable for large state spaces

Passive Learning in a Known Environment TD (Temporal Difference Learning)‏ The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations

Passive Learning in a Known Environment TD Learning  Suppose we observe a transition from state i to state j U(i) = -0.5 and U(j) = +0.5  Suggests that we should increase U(i) to make it agree better with it successor  Can be achieved using the following updating rule U n+1 (i) = U n (i)+ a(R(i) + U n (j) –U n (i))

Passive Learning in a Known Environment TD Learning Performance:  Runs “noisier” than LMS but smaller error  Deal with observed states during sample runs (Not all instances, unlike ADP)‏

Passive Learning in an Unknown Environment LMS approach and TD approach operate unchanged in an initially unknown environment. ADP approach adds a step that updates an estimated model of the environment.

Passive Learning in an Unknown Environment ADP Approach  The environment model is learned by direct observation of transitions  The environment model M can be updated by keeping track of the percentage of times each state transitions to each of its neighbours

Passive Learning in an Unknown Environment ADP & TD Approaches  The ADP approach and the TD approach are closely related  Both try to make local adjustments to the utility estimates in order to make each state “agree” with its successors

Passive Learning in an Unknown Environment Minor differences :  TD adjusts a state to agree with its observed successor  ADP adjusts the state to agree with all of the successors Important differences :  TD makes a single adjustment per observed transition  ADP makes as many adjustments as it needs to restore consistency between the utility estimates U and the environment model M

Passive Learning in an Unknown Environment To make ADP more efficient :  directly approximate the algorithm for value iteration or policy iteration  prioritized-sweeping heuristic makes adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates Advantage of the approximate ADP :  efficient in terms of computation  eliminate long value iterations occur in early stage

Active Learning in an Unknown Environment An active agent must consider :  what actions to take  what their outcomes may be  how they will affect the rewards received

Active Learning in an Unknown Environment Minor changes to passive learning agent:  environment model now incorporates the probabilities of transitions to other states given a particular action  maximize its expected utility  agent needs a performance element to choose an action at each step

IIT Bombay52 The framework

Learning An Action Value-Function The TD Q-Learning Update Equation - requires no model - calculated after each transition from state.i to j Thus, they can be learned directly from reward feedback

Generalization In Reinforcement Learning Explicit Representation  we have assumed that all the functions learned by the agents(U,M,R,Q) are represented in tabular form  explicit representation involves one output value for each input tuple.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning.

Similar presentations

Presentation on theme: "CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning.

Similar presentations

Presentation on theme: "CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28- PAC and Reinforcement Learning."— Presentation transcript:

Similar presentations

About project

Feedback