Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Dynamic Bayesian Networks (DBNs)
Optimal Policies for POMDP Presented by Alp Sardağ.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Chapter 4: Linear Models for Classification
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Visual Recognition Tutorial
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Markov Localization & Bayes Filtering
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
EM and expected complete log-likelihood Mixture of Experts
Reinforcement Learning
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
earthobs.nr.no Temporal Analysis of Forest Cover Using a Hidden Markov Model Arnt-Børre Salberg and Øivind Due Trier Norwegian Computing Center.
Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Solving POMDPs through Macro Decomposition
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Linear Models for Classification
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
1 Random Disambiguation Paths Al Aksakalli In Collaboration with Carey Priebe & Donniell Fishkind Department of Applied Mathematics and Statistics Johns.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Computacion Inteligente Least-Square Methods for System Identification.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Partially Observable Markov Decision Process and RL
Probability Theory and Parameter Estimation I
POMDPs Logistics Outline No class Wed
Reinforcement Learning in POMDPs Without Resets
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Heuristic Search Value Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Presentation transcript:

Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection

Outline  Introduction  Partially observable Markov decision processes (POMDPs)  Model definition & offline learning  Lifelong-learning algorithm  Experimental results

 Landmine detection: – By robots instead of human beings – Underlying model controlling the robot: POMDP  Multiple sensors: – Single sensor: sensitive to only certain types of objects EMI sensor: conductivity GPR sensor: dielectric property Seismic sensor: mechanical property – Multiple complementary sensors: improve detection performance Landmine detection (1)

 Statement of the problem:  Given a minefield where some landmines and clutter are buried underground;  Two types of sensors are available: EMI sensor and GPR sensor;  Sensing has cost;  Correct / incorrect declarations have corresponding reward / penalty; How can we develop a strategy to effectively find the landmines in this minefield with the minimal cost ?  How to optimally choose sensing positions in the field, so as to use as few sensing points as possible to find the landmines?  How to optimally choose sensors in each sensing position?  When to sense and when to declare? Questions in it: Landmine detection (2)

 Solution sketch: A partially observable Markov decision process (POMDP) model is built to solve this problem since it provides an approach to select actions (sensor deployment, sensing positions and declarations) optimally based on maximal reward / minimal cost.  Lifelong learning  The robot learns the model at the same time as it moves and senses in the mine field (combining exploration and exploitation);  Model is updated based on the exploration process. Landmine detection (3)

Outline  Introduction  Partially observable Markov decision processes (POMDPs)  Model definition & offline learning  Lifelong-learning algorithm  Experimental results

POMDP (1) POMDP = HMM + controllable actions + rewards A POMDP is a model of an agent interacting synchronously with its environment. The agent takes as input the observations of the environment, estimates the state according to observed information, and then generates as output actions based on its policy. During the periodic observation-action loops, the agent gets maximal reward, or equivalently, minimal cost. Policy b ENVIRONMENT Observation Action State estimation AGENT

POMDP (2) A POMDP model is defined by the tuple  S is a finite set of discrete states of the environment.  A is a finite set of discrete actions.   is a finite set of discrete observations providing noisy state information  T: S  A   (S) is the state transition probability: the probability of transitioning from state s to s’ when taking action a  O: S  A   (  ) is the observation function: the probability of receiving observation o after taking action a, landing in state s’.  R: S  A  , R(s, a) is the expected reward the agent receives by taking action a in state s.

POMDP (3)  The agent believes which state it is currently in;  A probability distribution over all the state S;  A summary of past information;  updated in each step by Bayes rule;  based on the latest action and observation, and the previous belief state: Belief state b:

POMDP (4) Policy:  A mapping from belief states to actions;  Telling agent which action it should take given current belief state. Optimal policy:  Maximize the expected discounted reward Immediate rewardDiscounted future reward  V * (b) is piecewise linear and convex in belief state (Sondik, 1971);  Represent V * (b) by a set of |S|-dimensional vector {α 1 *, …, α m * } Horizon length

POMDP (5) Policy learning:  Solve for vectors {α 1 *, …, α m * };  Point based value iteration (PBVI) algorithm;  Iteratively updates the vector α and value V for a set of sample belief points. with where One step from the horizon n+1 step from the horizon is computed from the results of the n step

Outline  Introduction  Partially observable Markov decision processes (POMDPs)  Model definition & offline learning  Lifelong-learning algorithm  Experimental results

 Feature extraction – EMI sensor EMI Model: Sensor measurements: Model parameters extracted by nonlinear fitting method: Model definition (1)

 Feature extraction – GPR sensor  Raw moments – energy features Time Down-track position  Central moments – variance and asymmetry of the wave Model definition (2)

 Definition: observation Ω EMI feature vectors GPR feature vectors Vector quantization EMI codebook GPR codebook Union Ω Model definition (3)  Definition: state S s3s3 s2s2 s1s1 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 Target Ground surface

 Estimate |S| and |Ω|  Variational Bayesian (VB) expectation-maximization (EM) method for model selection.  Bayesian learning  One Criterion: compare model evidence (marginal likelihood) likelihood prior evidence posterior Model definition (4)

 Estimate |S| and |Ω| Candidate models: GPR observation EMI observation Underlying state Vertical sensing sequence Horizontal sensing sequence Sensing point: EMI and GPR measurements – HMMs with two sets of observations – |S|=1,5,9,… – |Ω|=2,3,4,… Model definition (5) |S|=5|S|=9|S|=13 |S|=1

 Estimate |S| and |Ω| Estimate |S| Estimate |Ω| Model definition (6)

 Specification of action A 11: Declare as ‘metal mine’ 12: Declare as ‘plastic mine’ 13: Declare as ‘Type-1 clutter’ 14: Declare as ‘Type-2 clutter’ 15: Declare as ‘clean’ 10 sensing actions : allow movements of 4 directions E W S N 1: Stay, GPR sensing2: South, GPR sensing 3: North, GPR sensing 4: West, GPR sensing5: East, GPR sensing 6: Stay, EMI sensing 7: South, EMI sensing8: North, EMI sensing 9: West, EMI sensing10: East, EMI sensing Declaration actions : declare as one type of target Model definition (7)

 Estimate T Across all 5 types of mines and clutter, a total of 29 states are defined. Model definition (8) metal mine plastic mine type-1 clutter type-2 clutter clean

 Estimate T  “Stay” actions do not cause state transition – identity matrix  Other sensing actions cause state transition – by elementary geometric probability computation  “Declaration” actions reset the problem – uniform distribution over states δ δ σ4σ4 σ1σ1 σ2σ2 σ where a=“walk south and then sense with EMI” or“walk south and then sense with GPR” δ: distance traveled in a single step by a robot σ1, σ2, σ3 and σ4 denote the 4 borders of state 5, as well as their respective area metric. Model definition (9)

 Estimate T  Assume a mine or clutter is buried separately;  State transitions happen only within the states of a target when robot moves;  “Clean” is a bridge between the targets Metal mine states“clean” Other target Model definition (10)

 Estimate T metal mine block plastic mine block Type-1 clutter block Type-2 clutter block “clean”  State transition matrix: diagonal block  Model expansion: add more diagonal block, each one a target Model definition (11)

 Specification of reward R Missing: declare as “clean” or clutter when it is a landmine : -100 False alarm: declare as a landmine when it is clean or clutter : -50 Sensing : -1 Each sensing (either EMI or GPR) has a cost -1 Correct declaration : +10 Correctly declare a target Incorrect declaration : large penalty Partially correct declaration : +5 Confused between different types of landmines Confused between different types of clutter Model definition (12)

Outline  Introduction  Partially observable Markov decision processes (POMDPs)  Model definition & offline learning  Lifelong-learning algorithm  Experimental results

Lifelong learning (1) – Model-based algorithm – No training data available in advance: Learn the POMDP model by Bayes approach during the exploration & exploitation processes. – Assume a rough model is given, but some model parameters are uncertain – An oracle is available, which can provide exact information about target label, size and position, but using oracle is expansive. – Criteria to use oracle: 1. Policy selects “oracle query” action 2. Agent finds new observations – new knowledge 3. After sensing a lot, agent still cannot make decisions – too difficult

Lifelong learning (2) – “Oracle query” includes three steps: 1. measures data of both sensors on a grid 2. true target label is revealed 3. build target model based on measured data – Two learning approaches: 1. model expansion (more target types are considered) 2. model hyper-parameter update

Lifelong learning (3) Dirichlet distribution  A distribution of multinomial distribution parameters.  A conjugate prior to the multinomial distribution.  We can put a Dirichlet prior for each state-action pair in transition probability and observation function variables, parameters with where

Lifelong learning (4) Algorithm: 1. Imperfect model M 0, containing “clean” and some mine or clutter types, with the corresponding S and Ω; S and Ω could be expanded in the learning process; 2.“Oracle query” is one possible action; 3.Set learning rate λ; 4.Set Dirichlet prior according to the imperfect model M 0 ; ●●●● ●●●● ●●●● ●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● For any unknown transition probability, For any unknown observation,

Lifelong learning (4) Algorithm: 5. Sample N models, solve policies ; 6.Initial the weights w i =1/N; 7.Initial the history h={}; 8.Initial the belief state b 0 for each model 9.Run the experiment. At each time step: a.Compute the optimal actions for each model: a i =π i (b i ) for i=1,…,N; b.Pick an action a according to the weights w i : p(a i )=w i ; c.If one of the three query conditions is met (exploration): (1) Sense current local area on a grid; (2) Current target label is revealed; (3) Build the sub-model for current target and compute the hyper-parameters. If the target is a new target type Expand model by including the new target type as a diagonal block; Else, the target is an existing target type Update the Dirichlet parameters of the current target type (next page):

Lifelong learning (4) Algorithm: d.If query is not required (exploitation): (1) Take action a; (2) Receive observation o; (3) Update belief state for each sampled model. (4) Update the history e.Update the weights w i by forward-backward algorithm f.Pruning. At regular intervals: (1) Remove the model samples with the lowest weights and redraw new ones ; (2) Solve the new model policies ; (3) Update the belief according to the history h until current time ; (4) Recompute the weights according to the history h until current time.

Outline  Introduction  Partially observable Markov decision processes (POMDPs)  Model definition & offline learning  Lifelong-learning algorithm  Experimental results

 Data description: – mine fields : 1.6×1.6 m 2 – sensing on a spatial grid of 2cm by 2cm – two sensors: EMI and GPR  Robot navigation: – search almost everywhere to avoid missing landmines – active sensing to minimize the cost “Basic path” + “lanes” Results (1) – The “basic path” restrains the robot from moving across the lanes. – the robot takes actions to determine its sensing positions within the lanes.

 Offline-learning approach: performance summary Mine field 1Mine field 2Mine field 3 Ground truth Number of mines (metal + plastic) 5 (3+2)7 (4+3) Number of clutter (metal + nonmetal) 21 (18+3)57 (34+23)29 (23+6) Detection result Number of mines missed 112 Number of false alarms 222 Metal clutter: soda can, shell, nail, quarter, penny, screw, lead, rod, ball bearing Nonmetal clutter: rock, bag of wet sand, bag of dry sand, CD Results (2)

M M M P P Ground truth Detection result 1 missing; 2 false alarms “clean” metal mine plastic mine type-1 clutter type-2 clutter “unknown ” * * * * C * Declaration marks: Results (3)  Offline-learning approach: Minefield 1 P: plastic mine; M: metal mine; Other: clutter

 Sensor deployment “clean” metal mine plastic mine “unknown” * * C ? Declaration marks: Sensing marks: EMI sensor GPR sensor – Plastic mine: GPR sensor – Metal mine: EMI sensor – “Clean” & center of a mine: sensing few times (2-3 times in general) – interface of mine/“clean”: sensing many times Results (4)

Results (5) Ground truth Red rectangular region: oracle query Other mark: declarations: Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean M M M P P Initial learning from Mine field 1  Lifelong-learning approach: Minefield 1

Results (6) Difference of the parameters between the model learned by lifelong learning and the model learned by offline learning (training data are given in advance). The three big error drops correspond to adding new targets into the model.  Lifelong-learning approach: compared with offline learning

19 Rocks M P P M M P M Results (7) Ground truth Red rectangular region: oracle query Other mark: declarations Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean Sensing Minefield 2 after the model was learned from Minefield 1  Lifelong-learning approach: Minefield 2

Results (8) Ground truth Red rectangular region: oracle query Other mark: declarations Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean  Lifelong-learning approach: Minefield 3 P P M M M M P