1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Slides:

Advertisements

Similar presentations

S. Ceppi, N. Gatti, and N. Basilico DEI, Politecnico di Milano Computing Bayes-Nash Equilibria through Support Enumeration Methods in Bayesian Two-Player.

Advertisements

A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan

Markov Decision Process

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Game Theoretical Insights in Strategic Patrolling: Model and Analysis Nicola Gatti – DEI, Politecnico di Milano, Piazza Leonardo.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

Playing Games for Security: An Efficient Exact Algorithm for Solving Bayesian Stackelberg Games Praveen Paruchuri, Jonathan P. Pearce, Sarit Kraus Catherine.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.

Decision Theoretic Planning

Security via Strategic Randomization Milind Tambe Fernando Ordonez Praveen Paruchuri Sarit Kraus (Bar Ilan, Israel) Jonathan Pearce, Jansuz Marecki James.

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

Planning under Uncertainty

1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

A Heuristic Bidding Strategy for Multiple Heterogeneous Auctions Patricia Anthony & Nicholas R. Jennings Dept. of Electronics and Computer Science University.

1 University of Southern California Towards A Formalization Of Teamwork With Resource Constraints Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Lecture 5: Learning models using EM

1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

Dynamic lot sizing and tool management in automated manufacturing systems M. Selim Aktürk, Siraceddin Önen presented by Zümbül Bulut.

1 University of Southern California Increasing Security through Communication and Policy Randomization in Multiagent Systems Praveen Paruchuri, Milind.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Game-Theoretic Approach to Determining Efficient Patrolling Strategies for Mobile Robots Francesco Amigoni, Nicola Gatti, Antonio Ippedico.

Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Myopic Policies for Budgeted Optimization with Constrained Experiments Javad Azimi, Xiaoli Fern, Alan Fern Oregon State University AAAI, July

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

online convex optimization (with partial information)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 1 Chapter 1: The DP Algorithm To do:  sequential decision-making  state.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

By: Messias, Spaan, Lima Presented by: Mike Plasker DMES – Ocean Engineering.

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

Outline The role of information What is information? Different types of information Controlling information.

Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.

MDPs (cont) & Reinforcement Learning

1 University of Southern California Between Collaboration and Competition: An Initial Formalization using Distributed POMDPs Praveen Paruchuri, Milind.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Lecture 20 Review of ISM 206 Optimization Theory and Applications.

Keep the Adversary Guessing: Agent Security by Policy Randomization

Non-additive Security Games

Analytics and OR DP- summary.

When Security Games Go Green

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Reinforcement Nisheeth 18th January 2019.

Normal Form (Matrix) Games

Presentation transcript:

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California

2 University of Southern California Motivation: The Prediction Game Police vehicle  Patrols 4 regions Can you predict the patrol pattern ? Pattern 1 Pattern 2 Randomization decreases Predictability Increases Security Region 1Region 2 Region 3Region 4

3 University of Southern California Domains Police patrolling groups of houses Scheduled activities at airports like security check, refueling etc  Adversary monitors activities  Randomized policies

4 University of Southern California Problem Definition Problem : Security for agents in uncertain adversarial domains Assumptions for Agent/agent-team:  Variable information about adversary –Adversary cannot be modeled (Part 1) n Action/payoff structure unavailable –Adversary is partially modeled (Part 2) n Probability distribution over adversaries Assumptions for Adversary:  Knows agents plan/policy  Exploits the action predictability

5 University of Southern California Outline Security via Randomization No Adversary ModelPartial Adversary Model Randomization + Quality Constraints MDP/Dec-POMDP Mixed strategies: Bayesian Stackelberg Games Contributions: New, Efficient Algorithms

6 University of Southern California No Adversary Model: Solution Technique Intentional policy randomization for security  Information Minimization Game  MDP/POMDP: Sequential decision making under uncertainty –POMDP  Partially Observable Markov Decision Process Maintain Quality Constraints  Resource constraints (Time, Fuel etc)  Frequency constraints (Likelihood of crime, Property Value)

7 University of Southern California Randomization with quality constraints Fuel used < Threshold

8 University of Southern California No Adversary Model: Contributions Two main contributions  Single Agent Case: –Nonlinear program: Entropy based metric n Hard to solve (Exponential) –Convert to Linear Program: BRLP (Binary search for randomization)  Multi Agent Case: RDR (Rolling Down Randomization) –Randomized policies for decentralized POMDPs

9 University of Southern California MDP based single agent case MDP is tuple  S – Set of states  A – Set of actions  P – Transition function  R – Reward function Basic terms used :  x(s,a) : Expected times action a is taken in state s  Policy (as function of MDP flows) :

10 University of Southern California Entropy : Measure of randomness Randomness or information content quantified using Entropy ( Shannon 1948 ) Entropy for MDP -  Additive Entropy – Add entropies of each state  Weighted Entropy – Weigh each state by it contribution to total flow

11 University of Southern California Randomized Policy Generation Non-linear Program: Max entropy, Reward above threshold  Exponential Algorithm Linearize: Obtain Poly-time Algorithm  BRLP (Binary Search for Randomization LP)  Entropy as function of flows

12 University of Southern California BRLP: Efficient Randomized Policy Inputs: and target reward can be any high entropy policy (uniform policy) LP for BRLP Entropy control with

13 University of Southern California BRLP in Action = 1 - Max entropy = 0 Deterministic Max Reward Target Reward Beta =.5 Increasing scale of

14 University of Southern California Results (Averaged over 10 MDPs) For a given reward threshold, Highest entropy : Weighted Entropy : 10% avg gain over BRLP Fastest : BRLP : 7 fold average speedup over Expected Entropy

15 University of Southern California Multi Agent Case: Problem Maximize entropy for agent teams subject to reward threshold For agent team:  Decentralized POMDP framework  No communication between agents For adversary:  Knows the agents policy  Exploits the action predictability

16 University of Southern California Policy trees : Deterministic vs Randomized A1 A2 O1 O2 O1 O2 O1 A1A2 A1A2 A1A2 O1 O2 O1 O2 Deterministic Policy Tree Randomized Policy Tree

17 University of Southern California RDR : Rolling Down Randomization Input :  Best ( local or global ) deterministic policy  Percent of reward loss  d parameter – Number of turns each agent gets –Ex: d =.5 => Number of steps = 1/d = 2 –Each agent gets one turn ( for 2 agent case ) –Single agent MDP problem at each step

18 University of Southern California RDR : d =.5 M = Max Reward 80% of M Agent 1 Fix Agent 2’s policy Maximize joint entropy Joint Reward > 90% 90% of M Agent 2 Fix Agent 1’s policy Maximize joint entropy Joint reward > 80%

19 University of Southern California RDR Details To derive single agent MDP:  New Transition, Observation and Belief Update rules needed Original Belief Update Rule – New Belief Update Rule –

20 University of Southern California Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

21 University of Southern California Security with Partial Adversary Modeled Police agent patrolling a region. Many adversaries (robbers)  Different motivations, different times and places Model (Action & Payoff) of each adversary known Probability distribution known over adversaries Modeled as Bayesian Stackelberg game

22 University of Southern California Bayesian Game It contains:  Set of agents: N (Police and robbers)  A set of types θm (Police and robber types)  Set of strategies σi for each agent i  Probability distribution over types Пj: θj  [0,1]  Utility function: Ui : θ1 * θ2 * σ1 * σ2  R

23 University of Southern California Stackelberg Game Agent as leader  Commits to strategy first: Patrol policy Adversaries as followers  Optimize against leaders fixed strategy –Observe patrol patterns to leverage information Nash Equilibrium: : [2,1] Leader commits to uniform random strategy {.5,.5} Follower plays b: [3.5,1] ab a2,14,0 b1,03,2 Agent Adversary

24 University of Southern California Previous work: Conitzer, Sandholm AAAI’05, EC’06 MIP-Nash (AAAI’05): Efficient best Nash procedure Multiple LPs Method (EC’06): Given normal form game  Finds optimal leader strategy to commit to Bayesian to Normal Form Game  Harsanyi Transformation: Exponential adversary strategies  NP-hard For every joint pure strategy j of adversary: (R, C: Agent, Adversary)

25 University of Southern California Bayesian Stackelberg Game: Approach Two Approaches: 1. Heuristic solution  ASAP: Agent Security via Approximate Policies 2. Exact Solution  DOBSS: Decomposed Optimal Bayesian Stackelberg Solver Exponential savings –No Harsanyi Transformation –No exponential # of LP’s n One MILP program (Mixed Integer Linear Program)

26 University of Southern California ASAP vs DOBSS ASAP: Heuristic  Control probability of strategy –Discrete probability space  Generates k-uniform policies –k = 3 => Probability = {0, 1/3, 2/3, 1} –Simple and easy to implement DOBSS: Exact  Modify ASAP Algorithm  Discrete to continuous probability space  Focus of rest of talk

27 University of Southern California DOBSS Details Previous work:  Fix adversary (joint) pure strategy  Solve LP to find best agent strategy My approach:  For each agent mixed strategy  Find adversary best response Advantages:  Decomposition technique –Given agent strategy n Each adversary can find Best-response independently  Mathematical technique obtains single MILP

28 University of Southern California Obtaining MILP Decomposing Substitute

29 University of Southern California Experiments: Domain Patrolling Domain: Security agent and robber Security agent patrols houses  Ex: Visit house a Observe house and its neighbor  Plan for patrol length 2  6 or 12 strategies : 3 or 4 houses Robbers can attack any house  3 possible choices for 3 houses  Reward dependent on house and agent position  Joint space of robbers exponential – strategies: 3 houses, 10 robbers

30 University of Southern California Sample Patrolling Domain: 3 & 4 houses 3 houses LPs: 7 followers DOBSS: 20 4 houses LP’s: 6 followers DOBSS: 12

31 University of Southern California Conclusion Agent cannot model adversary  Intentional randomization algorithms for MDP/Dec-POMDP Agent has partial model of adversary  Efficient MILP solution for Bayesian Stackelberg games

32 University of Southern California Vision Incorporating machine learning  Dynamic environments Resource constrained agents  Constraints might be unknown in advance Developing real world applications  Police patrolling, Airport security

33 University of Southern California Thank You Any comments/questions ?

34 University of Southern California