Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection.

Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection

Outline  Introduction  Partially observable Markov decision processes (POMDPs)  Model definition & offline learning  Lifelong-learning algorithm  Experimental results

 Landmine detection: – By robots instead of human beings – Underlying model controlling the robot: POMDP  Multiple sensors: – Single sensor: sensitive to only certain types of objects EMI sensor: conductivity GPR sensor: dielectric property Seismic sensor: mechanical property – Multiple complementary sensors: improve detection performance Landmine detection (1)

 Statement of the problem:  Given a minefield where some landmines and clutter are buried underground;  Two types of sensors are available: EMI sensor and GPR sensor;  Sensing has cost;  Correct / incorrect declarations have corresponding reward / penalty; How can we develop a strategy to effectively find the landmines in this minefield with the minimal cost ?  How to optimally choose sensing positions in the field, so as to use as few sensing points as possible to find the landmines?  How to optimally choose sensors in each sensing position?  When to sense and when to declare? Questions in it: Landmine detection (2)

 Solution sketch: A partially observable Markov decision process (POMDP) model is built to solve this problem since it provides an approach to select actions (sensor deployment, sensing positions and declarations) optimally based on maximal reward / minimal cost.  Lifelong learning  The robot learns the model at the same time as it moves and senses in the mine field (combining exploration and exploitation);  Model is updated based on the exploration process. Landmine detection (3)

POMDP (1) POMDP = HMM + controllable actions + rewards A POMDP is a model of an agent interacting synchronously with its environment. The agent takes as input the observations of the environment, estimates the state according to observed information, and then generates as output actions based on its policy. During the periodic observation-action loops, the agent gets maximal reward, or equivalently, minimal cost. Policy b ENVIRONMENT Observation Action State estimation AGENT

POMDP (2) A POMDP model is defined by the tuple  S is a finite set of discrete states of the environment.  A is a finite set of discrete actions.   is a finite set of discrete observations providing noisy state information  T: S  A   (S) is the state transition probability: the probability of transitioning from state s to s’ when taking action a  O: S  A   (  ) is the observation function: the probability of receiving observation o after taking action a, landing in state s’.  R: S  A  , R(s, a) is the expected reward the agent receives by taking action a in state s.

POMDP (3)  The agent believes which state it is currently in;  A probability distribution over all the state S;  A summary of past information;  updated in each step by Bayes rule;  based on the latest action and observation, and the previous belief state: Belief state b:

POMDP (4) Policy:  A mapping from belief states to actions;  Telling agent which action it should take given current belief state. Optimal policy:  Maximize the expected discounted reward Immediate rewardDiscounted future reward  V * (b) is piecewise linear and convex in belief state (Sondik, 1971);  Represent V * (b) by a set of |S|-dimensional vector {α 1 *, …, α m * } Horizon length

POMDP (5) Policy learning:  Solve for vectors {α 1 *, …, α m * };  Point based value iteration (PBVI) algorithm;  Iteratively updates the vector α and value V for a set of sample belief points. with where One step from the horizon n+1 step from the horizon is computed from the results of the n step

 Feature extraction – EMI sensor EMI Model: Sensor measurements: Model parameters extracted by nonlinear fitting method: Model definition (1)

 Feature extraction – GPR sensor  Raw moments – energy features Time Down-track position  Central moments – variance and asymmetry of the wave Model definition (2)

 Definition: observation Ω EMI feature vectors GPR feature vectors Vector quantization EMI codebook GPR codebook Union Ω Model definition (3)  Definition: state S s3s3 s2s2 s1s1 s4s4 s5s5 s6s6 s7s7 s8s8 s9s9 Target Ground surface

 Estimate |S| and |Ω|  Variational Bayesian (VB) expectation-maximization (EM) method for model selection.  Bayesian learning  One Criterion: compare model evidence (marginal likelihood) likelihood prior evidence posterior Model definition (4)

 Estimate |S| and |Ω| Candidate models: GPR observation EMI observation Underlying state Vertical sensing sequence Horizontal sensing sequence Sensing point: EMI and GPR measurements – HMMs with two sets of observations – |S|=1,5,9,… – |Ω|=2,3,4,… Model definition (5) |S|=5|S|=9|S|=13 |S|=1

 Estimate |S| and |Ω| Estimate |S| Estimate |Ω| Model definition (6)

 Specification of action A 11: Declare as ‘metal mine’ 12: Declare as ‘plastic mine’ 13: Declare as ‘Type-1 clutter’ 14: Declare as ‘Type-2 clutter’ 15: Declare as ‘clean’ 10 sensing actions : allow movements of 4 directions E W S N 1: Stay, GPR sensing2: South, GPR sensing 3: North, GPR sensing 4: West, GPR sensing5: East, GPR sensing 6: Stay, EMI sensing 7: South, EMI sensing8: North, EMI sensing 9: West, EMI sensing10: East, EMI sensing Declaration actions : declare as one type of target Model definition (7)

 Estimate T Across all 5 types of mines and clutter, a total of 29 states are defined. Model definition (8) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 28 29 metal mine plastic mine type-1 clutter type-2 clutter clean 19 20 22 23 24 25 26 27 21

 Estimate T  “Stay” actions do not cause state transition – identity matrix  Other sensing actions cause state transition – by elementary geometric probability computation  “Declaration” actions reset the problem – uniform distribution over states δ δ σ4σ4 σ1σ1 σ2σ2 σ 3 1 24 5 68 9 where a=“walk south and then sense with EMI” or“walk south and then sense with GPR” δ: distance traveled in a single step by a robot σ1, σ2, σ3 and σ4 denote the 4 borders of state 5, as well as their respective area metric. Model definition (9)

 Estimate T  Assume a mine or clutter is buried separately;  State transitions happen only within the states of a target when robot moves;  “Clean” is a bridge between the targets. 1 2 3 4 5 6 7 8 9 29 Metal mine states“clean” 1 2 3 4 5 6 7 8 9 Other target Model definition (10)

 Estimate T metal mine block plastic mine block Type-1 clutter block Type-2 clutter block “clean”  State transition matrix: diagonal block  Model expansion: add more diagonal block, each one a target Model definition (11)

 Specification of reward R Missing: declare as “clean” or clutter when it is a landmine : -100 False alarm: declare as a landmine when it is clean or clutter : -50 Sensing : -1 Each sensing (either EMI or GPR) has a cost -1 Correct declaration : +10 Correctly declare a target Incorrect declaration : large penalty Partially correct declaration : +5 Confused between different types of landmines Confused between different types of clutter Model definition (12)

Lifelong learning (1) – Model-based algorithm – No training data available in advance: Learn the POMDP model by Bayes approach during the exploration & exploitation processes. – Assume a rough model is given, but some model parameters are uncertain – An oracle is available, which can provide exact information about target label, size and position, but using oracle is expansive. – Criteria to use oracle: 1. Policy selects “oracle query” action 2. Agent finds new observations – new knowledge 3. After sensing a lot, agent still cannot make decisions – too difficult

Lifelong learning (2) – “Oracle query” includes three steps: 1. measures data of both sensors on a grid 2. true target label is revealed 3. build target model based on measured data – Two learning approaches: 1. model expansion (more target types are considered) 2. model hyper-parameter update

Lifelong learning (3) Dirichlet distribution  A distribution of multinomial distribution parameters.  A conjugate prior to the multinomial distribution.  We can put a Dirichlet prior for each state-action pair in transition probability and observation function variables, parameters with where

Lifelong learning (4) Algorithm: 1. Imperfect model M 0, containing “clean” and some mine or clutter types, with the corresponding S and Ω; S and Ω could be expanded in the learning process; 2.“Oracle query” is one possible action; 3.Set learning rate λ; 4.Set Dirichlet prior according to the imperfect model M 0 ; ●●●● ●●●● ●●●● ●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● For any unknown transition probability, For any unknown observation,

Lifelong learning (4) Algorithm: 5. Sample N models, solve policies ; 6.Initial the weights w i =1/N; 7.Initial the history h={}; 8.Initial the belief state b 0 for each model 9.Run the experiment. At each time step: a.Compute the optimal actions for each model: a i =π i (b i ) for i=1,…,N; b.Pick an action a according to the weights w i : p(a i )=w i ; c.If one of the three query conditions is met (exploration): (1) Sense current local area on a grid; (2) Current target label is revealed; (3) Build the sub-model for current target and compute the hyper-parameters. If the target is a new target type Expand model by including the new target type as a diagonal block; Else, the target is an existing target type Update the Dirichlet parameters of the current target type (next page):

Lifelong learning (4) Algorithm: d.If query is not required (exploitation): (1) Take action a; (2) Receive observation o; (3) Update belief state for each sampled model. (4) Update the history e.Update the weights w i by forward-backward algorithm f.Pruning. At regular intervals: (1) Remove the model samples with the lowest weights and redraw new ones ; (2) Solve the new model policies ; (3) Update the belief according to the history h until current time ; (4) Recompute the weights according to the history h until current time.

 Data description: – mine fields : 1.6×1.6 m 2 – sensing on a spatial grid of 2cm by 2cm – two sensors: EMI and GPR  Robot navigation: – search almost everywhere to avoid missing landmines – active sensing to minimize the cost “Basic path” + “lanes” Results (1) – The “basic path” restrains the robot from moving across the lanes. – the robot takes actions to determine its sensing positions within the lanes.

 Offline-learning approach: performance summary Mine field 1Mine field 2Mine field 3 Ground truth Number of mines (metal + plastic) 5 (3+2)7 (4+3) Number of clutter (metal + nonmetal) 21 (18+3)57 (34+23)29 (23+6) Detection result Number of mines missed 112 Number of false alarms 222 Metal clutter: soda can, shell, nail, quarter, penny, screw, lead, rod, ball bearing Nonmetal clutter: rock, bag of wet sand, bag of dry sand, CD Results (2)

M M M P P Ground truth Detection result 1 missing; 2 false alarms “clean” metal mine plastic mine type-1 clutter type-2 clutter “unknown ” * * * * C * Declaration marks: Results (3)  Offline-learning approach: Minefield 1 P: plastic mine; M: metal mine; Other: clutter

 Sensor deployment “clean” metal mine plastic mine “unknown” * * C ? Declaration marks: Sensing marks: EMI sensor GPR sensor – Plastic mine: GPR sensor – Metal mine: EMI sensor – “Clean” & center of a mine: sensing few times (2-3 times in general) – interface of mine/“clean”: sensing many times Results (4)

Results (5) Ground truth Red rectangular region: oracle query Other mark: declarations: Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean M M M P P Initial learning from Mine field 1  Lifelong-learning approach: Minefield 1

Results (6) Difference of the parameters between the model learned by lifelong learning and the model learned by offline learning (training data are given in advance). The three big error drops correspond to adding new targets into the model.  Lifelong-learning approach: compared with offline learning

19 Rocks M P P M M P M Results (7) Ground truth Red rectangular region: oracle query Other mark: declarations Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean Sensing Minefield 2 after the model was learned from Minefield 1  Lifelong-learning approach: Minefield 2

Results (8) Ground truth Red rectangular region: oracle query Other mark: declarations Red: metal mine Pink: plastic mine Yellow: clutter1 Cyan: clutter2 Blue “c”: clean  Lifelong-learning approach: Minefield 3 P P M M M M P

Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection.

Similar presentations

Presentation on theme: "Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection.

Similar presentations

Presentation on theme: "Lihan He, Shihao Ji and Lawrence Carin ECE, Duke University Combining Exploration and Exploitation in Landmine Detection."— Presentation transcript:

Similar presentations

About project

Feedback