Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Beyond Convexity – Submodularity in Machine Learning

Advertisements

1 Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization Joint work with Andreas Krause 1 Daniel Golovin.

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Submodularity for Distributed Sensing Problems Zeyn Saigol IR Lab, School of Computer Science University of Birmingham 6 th July 2010.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Evaluating Classifiers

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

Introduction to Markov Random Fields and Graph Cuts Simon Prince

Maximizing the Spread of Influence through a Social Network

Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

CMPUT 466/551 Principal Source: CMU

Carnegie Mellon Selecting Observations against Adversarial Objectives Andreas Krause Brendan McMahan Carlos Guestrin Anupam Gupta TexPoint fonts used in.

Near-Optimal Sensor Placements in Gaussian Processes Carlos Guestrin Andreas KrauseAjit Singh Carnegie Mellon University.

Efficient Informative Sensing using Multiple Robots

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.

Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science.

Near-optimal Observation Selection using Submodular Functions Andreas Krause joint work with Carlos Guestrin (CMU)

Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam

1 Budgeted Nonparametric Learning from Data Streams Ryan Gomes and Andreas Krause California Institute of Technology.

1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

Near-optimal Sensor Placements: Maximizing Information while Minimizing Communication Cost Andreas Krause, Carlos Guestrin, Anupam Gupta, Jon Kleinberg.

Efficiently handling discrete structure in machine learning Stefanie Jegelka MADALGO summer school.

Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.

Crash Course on Machine Learning

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Toward Community Sensing Andreas Krause Carnegie Mellon University Joint work with Eric Horvitz, Aman Kansal, Feng Zhao Microsoft Research Information.

Bayesian Networks Martin Bachler MLA - VO

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.

Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):

5 Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable!

Tractable Higher Order Models in Computer Vision (Part II) Slides from Carsten Rother, Sebastian Nowozin, Pusohmeet Khli Microsoft Research Cambridge Presented.

Adaptive Dependent Context BGL: Budgeted Generative (-) Learning Given nothing about training instances, pay for any feature [no “labels”, no “attributes”

A Unified Continuous Greedy Algorithm for Submodular Maximization Moran Feldman Roy SchwartzJoseph (Seffi) Naor Technion – Israel Institute of Technology.

Machine Learning 5. Parametric Methods.

Deterministic Algorithms for Submodular Maximization Problems Moran Feldman The Open University of Israel Joint work with Niv Buchbinder.

Maximizing Symmetric Submodular Functions Moran Feldman EPFL.

G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Monitoring rivers and lakes [IJCAI ‘07]

Near-optimal Observation Selection using Submodular Functions

Inference in Bayesian Networks

ECE 5424: Introduction to Machine Learning

Distributed Submodular Maximization in Massive Datasets

Cost-effective Outbreak Detection in Networks

CS 188: Artificial Intelligence

Parametric Methods Berlin Chen, 2005 References:

Near-Optimal Sensor Placements in Gaussian Processes

CS639: Data Management for Data Science

Presentation transcript:

Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University

Applications for sensor selection Medical domain  select among potential examinations Sensor networks  observations drain power, require storage Feature selection  select most informative attributes for classification, regression etc....

An example: Temperature prediction Estimating temperature in a building Wireless sensors with limited battery

T1T1 T2T2 Probabilistic model T5T5 T4T4 T3T3 S5S5 S2S2 S4S4 S3S3 S1S1 Hidden variables of interest U Observable variables O Task:Select subset of observations to become most certain about U Values: (C)old, (N)ormal, (H)ot C N H What does “become most certain” mean?

T1T1 T2T2 Making observations T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S2S2 S4S4 observed Reward = 0.2 C N H S 1 =hot S3S3 C N H S1S1

T2T2 T4T4 T3T3 Making observations T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S1S1 S2S2 S4S4 S 3 =hot observed Reward = 0.4 C N H S3S3

T2T2 T3T3 A different outcome... T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S1S1 S2S2 S4S4 S 3 =cold Reward = 0.1 observed C N H Need to compute expected reduction of uncertainty for any sensor selection! How should uncertainty be defined?

Consider myopically selecting This can be seen as an attempt to nonmyopically maximize Effect: Selects sensors which are most uncertain about each other Selection criteria: Entropy [Cressie ’91] H(O 1 )+ H(O 2 | {O 1 }) H(O k | {O 1... O k-1 }) most uncertain most uncertain given O 1 most uncertain given O 1... O k-1 This is exactly the joint entropy H(O) = H({O 1... O k })

Nonmyopically select sensors O ½ S to maximize Effect: Selects sensors which most effectively reduce uncertainty about variables of interest Selection criteria: Information Gain Expected posterior uncertainty about U Prior uncertainty about U

Observations can have different cost T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S1S1 S2S2 S4S4 S3S3 $$$ $$ $ $ $ Sensor networks: Power consumption Each variable S i has cost c(S i ) Medical domain: Cost of Examinations Feature selection: Computational complexity SensorEnergy / sample (mJ) Humidity and Temperature 0.5 Voltage

Inference in graphical models Inference P(X = x | O = o) needed to compute entropy or information gain Efficient inference possible for many graphical models: X1X1 X2X2 X3X3 X3X3 X1X1 X2X2 X4X4 X5X5 X1X1 X3X3 X5X5 X2X2 X4X4 X6X6 What about nonmyopically optimizing sensor selections?

Results for optimal nonmyopic algorithms (presented at IJCAI ‘05) Efficiently and optimally solvable for chains! X1X1 X2X2 X3X3 X3X3 X1X1 X2X2 X4X4 X5X5 Even on discrete polytree graphical models, subset selection is NP PP -complete! but If we cannot solve exactly, can we approximate?

T1T1 T2T2 An important observation T5T5 T4T4 T3T3 S5S5 S2S2 S4S4 S3S3 S1S1 Observing S 1 tells sth. about T 1, T 2 and T 5 Observing S 3 tells sth. about T 3, T 2 and T 4 Now adding S 2 would not help much. In many cases, new information is worth less if we know more (diminishing returns)!

Submodular set functions Submodular set functions are a natural formalism for this idea: f(A [ {X}) – f(A) Maximization of SFs is NP-hard  Let’s look at a heuristic! B A {X} ¸ f(B [ {X}) – f(B) for A µ B

S1S1 The greedy algorithm T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S1S1 S2S2 S4S4 S3S3 R = 0.3R = 0.5R = 0.4 R = 0.2 R = 0.1 S2S2 R = 0.2R = 0.3 S3S3 S1S1 S2S2 S3S3 S4S4 S5S Gain by adding new element

How can we leverage submodularity? Theorem [Nemhauser et al]: The greedy algorithm guarantees (1-1/e) OPT approximation for monotone SFs, i.e. Same guarantees hold for the budgeted case: [Sviridenko / Krause, Guestrin] Here, OPT = max {f(A):  X2 A c(X) · B} ~ 63%

How can we leverage submodularity? Theorem [Nemhauser et al]: The greedy algorithm guarantees (1-1/e) OPT approximation for monotone SFs, i.e. Same guarantees hold for the budgeted case: [Sviridenko / Krause, Guestrin] Here, OPT = max {f(A):  X2 A c(X) · B} ~ 63%

Are our objective functions submodular and monotonic? (Discrete) Entropy is! [Fujishige ‘78] However, entropy can waste information: “Wasted” information H(O 1 )+ H(O 2 | {O 1 }) H(O k | {O 1... O k-1 })

Information Gain in general is not submodular A, B ~ Bernoulli(0.5) C = A XOR B C | A and C | B ~ Bernoulli(0.5) (entropy 1) C | A,B is deterministic! (entropy 0) Hence IG(C;{A,B})– IG(C;{A}) = 1, but IG(C;{B}) – IG(C;{}) = 0 A C B Hence we cannot get the (1-1/e) approximation guarantee! Or can we?

Conflict between maximizing Entropy and Information Gain Results on temperature data from real sensor network Can we optimize information gain directly?

Submodularity of information gain Theorem: Under certain conditional independence assumptions, information gain is submodular and nondecreasing!

Example with fulfilled conditions Feature selection in Naive Bayes models Fundamentally relevant for many classification tasks T S5S5 S1S1 S2S2 S4S4 S3S3

Example with fulfilled conditions T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S1S1 S2S2 S4S4 S3S3 General sensor selection problem Noisy sensors which are conditionally independent given the hidden variables True for many practical problems

Sometimes the hidden variables can also be queried directly (at potentially higher cost) We also address this case! Example with fulfilled conditions T1T1 T2T2 T5T5 T4T4 T3T3 S5S5 S1S1 S2S2 S4S4 S3S3

Algorithms and Complexity Unit-cost case: Greedy algorithm Complexity: O( k n ) Budgeted case: Partial enumeration + greedy Complexity: O( n 5 ) For guarantee of ½ (1-1/e) OPT: O( n 2 ) possible! Complexity measured in evaluations of greedy rule Caveat: Often, evaluating the greedy rule is itself a hard problem! k: number of selected sensors n: number of sensors to select from

Greedy rule X k+1 = arg max H(X | A k ) – H(X | U) X 2 S n A k How to compute conditional entropies? Prefers sensors which are relevant to U Prefers sensors which are different

Hardness of computing conditional entropies Entropy decomposes along graphical model Conditional entropies do not decompose along graphical model structure  does not decompose!

Hardness of computing conditional entropies Entropy decomposes along graphical model Conditional entropies do not decompose along graphical model structure  T S1S1 S2S2 S4S4 S3S3 S1S1 S2S2 S4S4 S3S3 Summing out T makes all variables dependent

But how to compute the information gain? Randomized approximation by sampling: a j is sampled from the graphical model H(X | a j ) is computed using exact inference for particular instantiations a j

How many samples are needed? H(X | A) can be approximated with absolute error  and confidence 1-  using samples (using Hoeffding’s inequality). Empirically, many fewer samples suffice!

Theoretical Guarantee Theorem: For any graphical model (satisfied conditional independence, efficient inference), one can nonmyopically select a subset of variables O s.t. IG(O;U) ¸ (1-1/e) OPT –  with confidence 1- , using a number of samples polynomial in 1/ , log 1/ , log |dom(X)| and |V| 1-1/e is only ~ 63%... Can we do better?

Hardness of Approximation Proof by reduction from MAX-COVER How to interpret our results? Positive: We give a 1-1/e approximation Negative: No efficient algorithm can provide better guarantees Positive: Our result provides a baseline for any algorithm maximizing information gain Theorem: If maximization of information gain can be approximated by a constant factor better than 1-1/e, then P = NP

Baseline In general, no algorithm will be able to provide better results than the greedy method unless P = NP But, in special cases, we may get lucky Assume, algorithm TUAFMIG gives results which are 10% better than the results obtained from the greedy algorithm Then we immediately know, TUAFMIG is within 70% of optimum!

Evaluation Two real world data sets Temperature data from sensor network deployment Traffic data from California Bay area

Temperature prediction 52 Sensor network deployed at a research lab Predict mean temperature in building areas Training data 5 days, testing 2 days

Temperature monitoring

Entropy Information gain

Temperature monitoring Information gain provides significantly higher prediction accuracy

Do fewer samples suffice? Sample size bounds are very loose; Quality of selection quite constant

Traffic monitoring 77 Detector stations at Bay Area highways Predict minimum speed in different areas Training data 18 days, testing data 2 days

Zones represent highway segments Hierarchical model

Traffic monitoring: Entropy Entropy selects most variable nodes

Traffic monitoring: Information Gain Information gain selects nodes relevant to aggregate nodes

Traffic monitoring: Prediction Information gain provides significantly higher prediction accuracy

Summary of Results Efficient randomized algorithms for information gain with strong approximation guarantee (1-1/e) OPT for large class of graphical models This is (more or less) the best possible guarantee unless P = NP Methods lead to improved prediction accuracy