Keep the Adversary Guessing: Agent Security by Policy Randomization

Keep the Adversary Guessing: Agent Security by Policy Randomization
Praveen Paruchuri University of Southern California

Motivation: The Prediction Game
Police vehicle Patrols 4 regions Can you predict the patrol pattern ? Pattern 1 Pattern 2 Randomization decreases Predictability Increases Security Region 1 Region 2 Region 3 Region 4

Domains Police patrolling groups of houses
Scheduled activities at airports like security check, refueling etc Adversary monitors activities Randomized policies

Problem Definition Problem : Security for agents in uncertain adversarial domains Assumptions for Agent/agent-team: Variable information about adversary Adversary cannot be modeled (Part 1) Action/payoff structure unavailable Adversary is partially modeled (Part 2) Probability distribution over adversaries Assumptions for Adversary: Knows agents plan/policy Exploits the action predictability

Outline Security via Randomization No Adversary Model
Partial Adversary Model Contributions: New, Efficient Algorithms Randomization + Quality Constraints MDP/Dec-POMDP Mixed strategies: Bayesian Stackelberg Games

No Adversary Model: Solution Technique
Intentional policy randomization for security Information Minimization Game MDP/POMDP: Sequential decision making under uncertainty POMDP  Partially Observable Markov Decision Process Maintain Quality Constraints Resource constraints (Time, Fuel etc) Frequency constraints (Likelihood of crime, Property Value)

Randomization with quality constraints
Fuel used < Threshold

No Adversary Model: Contributions
Two main contributions Single Agent Case: Nonlinear program: Entropy based metric Hard to solve (Exponential) Convert to Linear Program: BRLP (Binary search for randomization) Multi Agent Case: RDR (Rolling Down Randomization) Randomized policies for decentralized POMDPs

MDP based single agent case
MDP is tuple < S, A, P, R > S – Set of states A – Set of actions P – Transition function R – Reward function Basic terms used : x(s,a) : Expected times action a is taken in state s Policy (as function of MDP flows) :

Entropy : Measure of randomness
Randomness or information content quantified using Entropy ( Shannon 1948 ) Entropy for MDP - Additive Entropy – Add entropies of each state Weighted Entropy – Weigh each state by it contribution to total flow

Randomized Policy Generation
Non-linear Program: Max entropy, Reward above threshold Exponential Algorithm Linearize: Obtain Poly-time Algorithm BRLP (Binary Search for Randomization LP) Entropy as function of flows

BRLP: Efficient Randomized Policy
Inputs: and target reward can be any high entropy policy (uniform policy) LP for BRLP Entropy control with

BRLP in Action Increasing scale of Beta = .5 = 1 = 0 Deterministic
- Max entropy = 0 Deterministic Max Reward Target Reward Increasing scale of

Results (Averaged over 10 MDPs)
For a given reward threshold, Highest entropy : Weighted Entropy : 10% avg gain over BRLP Fastest : BRLP : 7 fold average speedup over Expected Entropy

Multi Agent Case: Problem
Maximize entropy for agent teams subject to reward threshold For agent team: Decentralized POMDP framework No communication between agents For adversary: Knows the agents policy Exploits the action predictability

Policy trees : Deterministic vs Randomized
Deterministic Policy Tree A1 A2 O1 O2 Randomized Policy Tree O1 A1 A2 O2

RDR : Rolling Down Randomization
Input : Best ( local or global ) deterministic policy Percent of reward loss d parameter – Number of turns each agent gets Ex: d = .5 => Number of steps = 1/d = 2 Each agent gets one turn ( for 2 agent case ) Single agent MDP problem at each step

RDR : d = .5 Agent 1 Fix Agent 2’s policy Maximize joint entropy
Joint Reward > 90% M = Max Reward Agent 2 Fix Agent 1’s policy Maximize joint entropy Joint reward > 80% 90% of M 80% of M

RDR Details To derive single agent MDP:
New Transition, Observation and Belief Update rules needed Original Belief Update Rule – New Belief Update Rule –

Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

Security with Partial Adversary Modeled
Police agent patrolling a region. Many adversaries (robbers) Different motivations, different times and places Model (Action & Payoff) of each adversary known Probability distribution known over adversaries Modeled as Bayesian Stackelberg game

Bayesian Game It contains: Set of agents: N (Police and robbers)
A set of types θm (Police and robber types) Set of strategies σi for each agent i Probability distribution over types Пj: θj  [0,1] Utility function: Ui : θ1 * θ2 * σ1 * σ2  R

Stackelberg Game Agent as leader
Commits to strategy first: Patrol policy Adversaries as followers Optimize against leaders fixed strategy Observe patrol patterns to leverage information Adversary Nash Equilibrium: <a,a> : [2,1] Leader commits to uniform random strategy {.5,.5} Follower plays b: [3.5,1] a b 2,1 4,0 1,0 3,2 Agent

Previous work: Conitzer, Sandholm AAAI’05, EC’06
MIP-Nash (AAAI’05): Efficient best Nash procedure Multiple LPs Method (EC’06): Given normal form game Finds optimal leader strategy to commit to Bayesian to Normal Form Game Harsanyi Transformation: Exponential adversary strategies NP-hard For every joint pure strategy j of adversary: (R, C: Agent, Adversary)

Bayesian Stackelberg Game: Approach
Two Approaches: Heuristic solution ASAP: Agent Security via Approximate Policies Exact Solution DOBSS: Decomposed Optimal Bayesian Stackelberg Solver Exponential savings No Harsanyi Transformation No exponential # of LP’s One MILP program (Mixed Integer Linear Program)

ASAP vs DOBSS ASAP: Heuristic Control probability of strategy
Discrete probability space Generates k-uniform policies k = 3 => Probability = {0, 1/3, 2/3, 1} Simple and easy to implement DOBSS: Exact Modify ASAP Algorithm Discrete to continuous probability space Focus of rest of talk

DOBSS Details Previous work: Fix adversary (joint) pure strategy
Solve LP to find best agent strategy My approach: For each agent mixed strategy Find adversary best response Advantages: Decomposition technique Given agent strategy Each adversary can find Best-response independently Mathematical technique obtains single MILP

Obtaining MILP Decomposing Substitute

Experiments: Domain Patrolling Domain: Security agent and robber
Security agent patrols houses Ex: Visit house a Observe house and its neighbor Plan for patrol length 2 6 or 12 strategies : 3 or 4 houses Robbers can attack any house 3 possible choices for 3 houses Reward dependent on house and agent position Joint space of robbers exponential strategies: 3 houses, 10 robbers

Sample Patrolling Domain: 3 & 4 houses
LPs: 7 followers DOBSS: 20 4 houses LP’s: 6 followers DOBSS: 12

Conclusion Agent cannot model adversary
Intentional randomization algorithms for MDP/Dec-POMDP Agent has partial model of adversary Efficient MILP solution for Bayesian Stackelberg games

Vision Incorporating machine learning Dynamic environments
Resource constrained agents Constraints might be unknown in advance Developing real world applications Police patrolling, Airport security

Thank You Any comments/questions ?

Keep the Adversary Guessing: Agent Security by Policy Randomization

Similar presentations

Presentation on theme: "Keep the Adversary Guessing: Agent Security by Policy Randomization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Keep the Adversary Guessing: Agent Security by Policy Randomization

Similar presentations

Presentation on theme: "Keep the Adversary Guessing: Agent Security by Policy Randomization"— Presentation transcript:

Similar presentations

About project

Feedback