Download presentation

Presentation is loading. Please wait.

Published byTyrese Blasdell Modified over 4 years ago

1
1 Testing Stochastic Processes Through Reinforcement Learning François Laviolette Sami Zhioua Nips-Workshop December 9 th, 2006 Josée Desharnais

2
2 Outline Program Verification Problem The Approach for trace-equivalence Other equivalences Conclusion Application on MDPs

3
3 Stochastic Program Verification Specification (LMP): an MDP without rewards Implementation s0s0 s1s1 s3s3 s6s6 s2s2 s4s4 s5s5 a[0.5]a[0.3] b[0.9] c c How far the Implementation is from the Specification ? (Distance or divergence) The Specification model is available. The Implementation is available only for interaction (no model).

4
4 1. Non deterministic trace equivalence P a ac b c b a c c bb Q a ba c c b a a b c a b Trace Equivalence Two systems are trace equivalent iff they accept the same set of traces T(P) = {a, aa, aac, ac, b, ba, bab, c, cb,cc} T(Q) = {a, ab, ac, abc, abca, ba, bab, c, ca} 2. Probabilistic trace equivalence Two systems are trace equivalent iff they accept the same set of traces and with the same probabilities P a[2/3] a[1/3]b[2/3] a[1/4] c b a[3/4] c a b[1/2]c[1/2] a7/12 aa5/12 aac1/6 bc2/3 … Q a[1/3] a[1/2] b c b a[1/4] a[3/4] b[1/2] c a a1 aa1/2 aac0 bc0 …

5
5 Testing (Trace Equivalence) The system is a black box. The button goes down (transition) The button does not go down (no transition) When a button is pushed (action execution) Grammar (trace equiv): t ::= | a.t Observations : When a test is executed, several observations are possible : O t. b[0.7] s0s0 s3s3 a[0.2]a[0.5] [2,4)[7,10] Example: O t = {a , a.b , a.b } 0.30.56 t = a.b. 0.14 abz

6
6 Outline Program Verification Problem The Approach for trace-equivalence Other equivalences Conclusion Application on MDPs

7
7 Why Reinforcement Learning ? s0s0 s1s1 s4s4 s2s2 s5s5 s6s6 a[0.2] a[0.5] b[0.7] a[0.3]a s7s7 b s3s3 b[0.9] a[0.7] s8s8 s0s0 s1s1 s2s2 s3s3 s4s4 s6s6 s7s7 s8s8 s5s5 ab aa b ab LMP MDP Reinforcement Learning is particularly efficient in the absence of the full model. 0.5 0.20.9 1 0.3 0.7 1 Reinforcement Learning can deal with bigger systems. Analogy : LMP MDP Trace Policy Divergence Optimal Value ( V* )

8
8 A Stochastic Game towards RL F S S F S F S F F S F S F F S S S F S S S F F F + 1 0 - 1 b[0.7] s0s0 s1s1 s3s3 s6s6 s2s2 s4s4 s5s5 a[0.2]a[0.5] b[0.3]a c[0.4] s7s7 c[0.2] s 10 b s8s8 b Implementation Specification s0s0 s1s1 s3s3 s2s2 s4s4 s5s5 a[0.2] a[0.3] b[0.7] b[0.3]a s7s7 s9s9 c[0.8] c[0.7] s 10 b s8s8 b b[0.9] Specification (clone) s0s0 s1s1 s3s3 s2s2 s4s4 s5s5 a[0.2] a[0.3] b[0.7] b[0.3]a s7s7 s9s9 c[0.8] c[0.7] s 10 b s8s8 b b[0.9] Reward : (+1) when Impl Spec Reward : (-1) when Spec Clone

9
9 MDP Defintion MDP : Specification LMPStates Actions Next-state probability distribution MDP s0s0 s1s1 s3s3 s6s6 s2s2 s4s4 s5s5 a[0.2]a[0.5] b[0.7] b[0.3]a c[0.4] s7s7 c[0.2] s 10 b s8s8 b s0s0 s1s1 s3s3 s2s2 s4s4 s5s5 a[0.2] a[0.5] b[0.7] b[0.3]a s7s7 s9s9 c[0.8] c[0.7] s 10 b s8s8 b Implémentation Spécification b[0.9] s0s0 s1s1 s2s2 s3s3 s3s3 s4s4 s8s8 s9s9 s5s5 s7s7 s 10 0.5 0.20.9 1 0.30.7 10.80.7 1 ab a b cb c b Dead

10
10 Divergence Computation F S S F S F S F F S F S F F S S S F S S S F F F + 1 0 - 1 V*(s 0 ) 0 : Equivalent 1 : Different ** s0s0 s1s1 s3s3 s6s6 s2s2 s4s4 s5s5 a[0.2]a[0.5] b[0.7] b[0.3]a c[0.4] s7s7 c[0.2] s 10 b s8s8 b s0s0 s1s1 s3s3 s2s2 s4s4 s5s5 a[0.2] a[0.5] b[0.7] b[0.3]a s7s7 s9s9 c[0.8] c[0.7] s 10 b s8s8 b Implementation Specification b[0.9] MDP s0s0 s1s1 s2s2 s3s3 s3s3 s4s4 s8s8 s9s9 s5s5 s7s7 s 10 0.5 0.20.9 1 0.30.7 10.80.7 1 ab a b cb c b Dead

11
11 Symmetry Problem Implementation Specification F S S S F F F F S S S F + 1- 1 Create two variants for each action (a): Success variant ( a ) Failure variant ( a ) s0s0 s1s1 a[1] s0s0 s1s1 a[0.5] Spec (Clone) s0s0 s1s1 a[0.5] Compute and give reward Give reward 0 Select action make a prediction (, ×) If pred = obs If pred obs Prediction: execute action Prob=0*.5*.5+1*.5*.5 =.25

12
12 The Divergence (with the symmetry problem fixed) Theorem. Let "Spec" and "Impl" be two LMPs, and M their induced MDP. V*(s 0 ) ≥ 0, and V*(s 0 ) = 0 iff "Spec" and "Impl" are trace-equivalent.

13
13 Implementation and PAC Guaranty There exists a PAC Guaranty for Q-Learning Algorithm but.. Fiechter algorithm has a simpler PAC guaranty. Besides, it is possible to obtain a bottom bound thanks to the Hoeffding inequality : If then : Implementation : = 0.8 Action selection : softmax ( decreasing from 0.8 to 0.01 ) RL algorithm : Q-Learning decreasing according to the function 1/x PAC guaranty :

14
14 Outline Program Verification Problem The Approach for trace-equivalence Other equivalences Conclusion Application on MDPs

15
15 Testing (Bisimulation) The system is a black box. Grammar t ::= | a.t abz b[0.7] s0s0 s3s3 a[0.2]a[0.5] [2,4)[7,10] Example: O t = {a , a.(b , b ), a.(b ,b ), a.(b,b ), a.(b,b )} 0.30.518 t = a.(b,b) 0.042 0.098 P t,s 0 : Replication | (t 1, …, t n ) (bisimulation) :

16
16 P a c b[1/3]c[2/3] c a[1/3]a[2/3] b c Q New Equivalence Notion ‘’By-Level Equivalence’’

17
17 K-Moment Equivalence t ::= | a.t t ::= | a k.t k 2 1-moment (trace) 2-moment 3-moment t ::= | a k.t k 3 : is a random variable such that is the probability to perform the trace and make a transition to a state that accepts action a with probability p i. is equal to Two systems are “By-level’’ equivalent Recall : k th moment of X = E(X k ) = ( x i k. Pr(X=x i ) ) k

18
18 Ready Equivalence and Failure equivalence 1. Ready Equivalence Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process accepting all actions from A.. P a[1/3] a[2/3] b c b a[1/4] a[3/4] c a b[1/2] Q a[1/3] a[1/2] b c b a[1/4] a[3/4] b[1/2] c a (,{b,c}) 2/3(,{b,c}) 1/2 Test t ::= | a.t | {a 1,.., a n } 1. Failure Equivalence P a[1/3] a[2/3] b c b a[1/4] a[3/4] c a b[1/2] Q a[1/3] a[1/2] b c b a[1/4] a[3/4] b[1/2] c a (,{b,c}) 1/3(,{b,c}) 1/2 Two systems are Ready equivalent iff for any trace tr and any set of actions A, they have the same probability to run successfully tr and reach a process refusing all actions from A. Test t ::= | a.t | { a 1,.., a n }

19
19 1. Barb acceptation P a[1/3] a[2/3] b c b a[1/4] a[3/4] c a b[1/2] Q a[1/3] a[1/2] b c b a[1/4] a[3/4] b[1/2] c a Barb equivalence (, ) 2/3 2. Barb Refusal P a[1/3] a[2/3] b c b a[1/4] a[3/4] c a b[1/2] Q a[1/3] a[1/2] b c b a[1/4] a[3/4] b[1/2] c a (, ) 1/3 Test t ::= | a.t | {a 1,.., a n }a.t Test t ::= | a.t | { a 1,.., a n }a.t

20
20 Outline Program Verification Problem The Approach for trace-equivalence Other equivalences Conclusion Application on MDPs

21
21 MDP 1 s0s0 s1s1 s2s2 s3s3 s3s3 s4s4 s8s8 s9s9 s5s5 s7s7 ab a b cb c 0.8 0.21 1 0.30.7 111 r1 r2r3 r4r5 r7r8r6 s0s0 s1s1 s2s2 s3s3 s4s4 s6s6 s7s7 s8s8 s5s5 ab aa b ab 0.5 0.20.9 1 0.3 0.7 1 r1 r2r3 r4r5 r7r8 Application on MDPs MDP 2 Case 3 : The reward space is very large (continuous) : w.l.o.g. [0,1] Case 1 : The reward space contains 2 values (binary) : 0 and 1 Case 2 : The reward space is small (discrete) : {r 1, r 2, r 3, r 4, r 5 }

22
22 Application on MDPs Case 1 : The reward space contains 2 values (binary) r1 : 0 F r2 : 1 S Case 2 :The reward space is small (discrete) {r 1, r 2, r 3, r 4, r 5 } a r1r1 a r2r2 a r3r3 a r4r4 a r5r5 b r1r1 b r2r2 b r3r3 b r4r4 b r5r5 F S Case 3 :The reward space is very large (continuous) Intuition : r = 3/4 1 with probability 3/4 ar pick a reward value (ranVal) randomly ranVal r ranVal < r S F 0 with probability 1/4

23
23 Current and Future Work Application to different equivalence notions : - Failure equivalence - Ready equivalence - Barb equivalence, etc. Experimental analysis on realistic systems Applying the approach to compute the divergence between : - HMMs - POMDPs Studying the properties of the divergence - Probabilistic automata

Similar presentations

Presentation is loading. Please wait....

OK

7.5 Glide Reflections and Compositions

7.5 Glide Reflections and Compositions

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google