Keep the Adversary Guessing: Agent Security by Policy Randomization

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Game Theoretical Insights in Strategic Patrolling: Model and Analysis Nicola Gatti – DEI, Politecnico di Milano, Piazza Leonardo.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

Playing Games for Security: An Efficient Exact Algorithm for Solving Bayesian Stackelberg Games Praveen Paruchuri, Jonathan P. Pearce, Sarit Kraus Catherine.

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.

Security via Strategic Randomization Milind Tambe Fernando Ordonez Praveen Paruchuri Sarit Kraus (Bar Ilan, Israel) Jonathan Pearce, Jansuz Marecki James.

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Planning under Uncertainty

1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

1 University of Southern California Towards A Formalization Of Teamwork With Resource Constraints Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Lecture 5: Learning models using EM

1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

1 University of Southern California Increasing Security through Communication and Policy Randomization in Multiagent Systems Praveen Paruchuri, Milind.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

A Game-Theoretic Approach to Determining Efficient Patrolling Strategies for Mobile Robots Francesco Amigoni, Nicola Gatti, Antonio Ippedico.

Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

online convex optimization (with partial information)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

By: Messias, Spaan, Lima Presented by: Mike Plasker DMES – Ocean Engineering.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Outline The role of information What is information? Different types of information Controlling information.

Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.

1 University of Southern California Between Collaboration and Competition: An Initial Formalization using Distributed POMDPs Praveen Paruchuri, Milind.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Lecture 20 Review of ISM 206 Optimization Theory and Applications.

Partially Observable Markov Decision Process and RL

Extensive-Form Game Abstraction with Bounds

Non-additive Security Games

Making complex decisions

POMDPs Logistics Outline No class Wed

Analytics and OR DP- summary.

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Expectimax Lirong Xia. Expectimax Lirong Xia Project 2 MAX player: Pacman Question 1-3: Multiple MIN players: ghosts Extend classical minimax search.

When Security Games Go Green

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 2 Bayesian Games Zhu Han, Dusit Niyato, Walid Saad, Tamer.

Objective of This Course

Hierarchical POMDP Solutions

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Multiagent Systems Repeated Games © Manfred Huber 2018.

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Dr. Arslan Ornek DETERMINISTIC OPTIMIZATION MODELS

Reinforcement Nisheeth 18th January 2019.

Normal Form (Matrix) Games

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

MSE 606A Engineering Operations Research

Presentation transcript:

Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California paruchur@usc.edu

Motivation: The Prediction Game Police vehicle Patrols 4 regions Can you predict the patrol pattern ? Pattern 1 Pattern 2 Randomization decreases Predictability Increases Security Region 1 Region 2 Region 3 Region 4

Domains Police patrolling groups of houses Scheduled activities at airports like security check, refueling etc Adversary monitors activities Randomized policies

Problem Definition Problem : Security for agents in uncertain adversarial domains Assumptions for Agent/agent-team: Variable information about adversary Adversary cannot be modeled (Part 1) Action/payoff structure unavailable Adversary is partially modeled (Part 2) Probability distribution over adversaries Assumptions for Adversary: Knows agents plan/policy Exploits the action predictability

Outline Security via Randomization No Adversary Model Partial Adversary Model Contributions: New, Efficient Algorithms Randomization + Quality Constraints MDP/Dec-POMDP Mixed strategies: Bayesian Stackelberg Games

No Adversary Model: Solution Technique Intentional policy randomization for security Information Minimization Game MDP/POMDP: Sequential decision making under uncertainty POMDP  Partially Observable Markov Decision Process Maintain Quality Constraints Resource constraints (Time, Fuel etc) Frequency constraints (Likelihood of crime, Property Value)

Randomization with quality constraints Fuel used < Threshold

No Adversary Model: Contributions Two main contributions Single Agent Case: Nonlinear program: Entropy based metric Hard to solve (Exponential) Convert to Linear Program: BRLP (Binary search for randomization) Multi Agent Case: RDR (Rolling Down Randomization) Randomized policies for decentralized POMDPs

MDP based single agent case MDP is tuple < S, A, P, R > S – Set of states A – Set of actions P – Transition function R – Reward function Basic terms used : x(s,a) : Expected times action a is taken in state s Policy (as function of MDP flows) :

Entropy : Measure of randomness Randomness or information content quantified using Entropy ( Shannon 1948 ) Entropy for MDP - Additive Entropy – Add entropies of each state Weighted Entropy – Weigh each state by it contribution to total flow

Randomized Policy Generation Non-linear Program: Max entropy, Reward above threshold Exponential Algorithm Linearize: Obtain Poly-time Algorithm BRLP (Binary Search for Randomization LP) Entropy as function of flows

BRLP: Efficient Randomized Policy Inputs: and target reward can be any high entropy policy (uniform policy) LP for BRLP Entropy control with

BRLP in Action Increasing scale of Beta = .5 = 1 = 0 Deterministic - Max entropy = 0 Deterministic Max Reward Target Reward Increasing scale of

Results (Averaged over 10 MDPs) For a given reward threshold, Highest entropy : Weighted Entropy : 10% avg gain over BRLP Fastest : BRLP : 7 fold average speedup over Expected Entropy

Multi Agent Case: Problem Maximize entropy for agent teams subject to reward threshold For agent team: Decentralized POMDP framework No communication between agents For adversary: Knows the agents policy Exploits the action predictability

Policy trees : Deterministic vs Randomized Deterministic Policy Tree A1 A2 O1 O2 Randomized Policy Tree O1 A1 A2 O2

RDR : Rolling Down Randomization Input : Best ( local or global ) deterministic policy Percent of reward loss d parameter – Number of turns each agent gets Ex: d = .5 => Number of steps = 1/d = 2 Each agent gets one turn ( for 2 agent case ) Single agent MDP problem at each step

RDR : d = .5 Agent 1 Fix Agent 2’s policy Maximize joint entropy Joint Reward > 90% M = Max Reward Agent 2 Fix Agent 1’s policy Maximize joint entropy Joint reward > 80% 90% of M 80% of M

RDR Details To derive single agent MDP: New Transition, Observation and Belief Update rules needed Original Belief Update Rule – New Belief Update Rule –

Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

Security with Partial Adversary Modeled Police agent patrolling a region. Many adversaries (robbers) Different motivations, different times and places Model (Action & Payoff) of each adversary known Probability distribution known over adversaries Modeled as Bayesian Stackelberg game

Bayesian Game It contains: Set of agents: N (Police and robbers) A set of types θm (Police and robber types) Set of strategies σi for each agent i Probability distribution over types Пj: θj  [0,1] Utility function: Ui : θ1 * θ2 * σ1 * σ2  R

Stackelberg Game Agent as leader Commits to strategy first: Patrol policy Adversaries as followers Optimize against leaders fixed strategy Observe patrol patterns to leverage information Adversary Nash Equilibrium: <a,a> : [2,1] Leader commits to uniform random strategy {.5,.5} Follower plays b: [3.5,1] a b 2,1 4,0 1,0 3,2 Agent

Previous work: Conitzer, Sandholm AAAI’05, EC’06 MIP-Nash (AAAI’05): Efficient best Nash procedure Multiple LPs Method (EC’06): Given normal form game Finds optimal leader strategy to commit to Bayesian to Normal Form Game Harsanyi Transformation: Exponential adversary strategies NP-hard For every joint pure strategy j of adversary: (R, C: Agent, Adversary)

Bayesian Stackelberg Game: Approach Two Approaches: Heuristic solution ASAP: Agent Security via Approximate Policies Exact Solution DOBSS: Decomposed Optimal Bayesian Stackelberg Solver Exponential savings No Harsanyi Transformation No exponential # of LP’s One MILP program (Mixed Integer Linear Program)

ASAP vs DOBSS ASAP: Heuristic Control probability of strategy Discrete probability space Generates k-uniform policies k = 3 => Probability = {0, 1/3, 2/3, 1} Simple and easy to implement DOBSS: Exact Modify ASAP Algorithm Discrete to continuous probability space Focus of rest of talk

DOBSS Details Previous work: Fix adversary (joint) pure strategy Solve LP to find best agent strategy My approach: For each agent mixed strategy Find adversary best response Advantages: Decomposition technique Given agent strategy Each adversary can find Best-response independently Mathematical technique obtains single MILP

Obtaining MILP Decomposing Substitute

Experiments: Domain Patrolling Domain: Security agent and robber Security agent patrols houses Ex: Visit house a Observe house and its neighbor Plan for patrol length 2 6 or 12 strategies : 3 or 4 houses Robbers can attack any house 3 possible choices for 3 houses Reward dependent on house and agent position Joint space of robbers exponential strategies: 3 houses, 10 robbers

Sample Patrolling Domain: 3 & 4 houses LPs: 7 followers DOBSS: 20 4 houses LP’s: 6 followers DOBSS: 12

Conclusion Agent cannot model adversary Intentional randomization algorithms for MDP/Dec-POMDP Agent has partial model of adversary Efficient MILP solution for Bayesian Stackelberg games

Vision Incorporating machine learning Dynamic environments Resource constrained agents Constraints might be unknown in advance Developing real world applications Police patrolling, Airport security

Thank You Any comments/questions ?