Adaptive Importance Sampling for Estimation in Structured Domains L.E. Ortiz and L.P. Kaelbling.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Maximum Likelihood Method
1 Chapter 5 Belief Updating in Bayesian Networks Bayesian Networks and Decision Graphs Finn V. Jensen Qunyuan Zhang Division. of Statistical Genomics,
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The loss function, the normal equation,
Visual Recognition Tutorial
Bayesian network inference
Performance Optimization
Classification and risk prediction
Motion Analysis (contd.) Slides are from RPI Registration Class.
Neural Networks Marco Loog.
Propagation in Poly Trees Given a Bayesian Network BN = {G, JDP} JDP(a,b,c,d,e) = p(a)*p(b|a)*p(c|e,b)*p(d)*p(e|d) a d b e c.
Computer vision: models, learning and inference
Chapter 5 Transformations and Weighting to Correct Model Inadequacies
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. Concept Learning Reference : Ch2 in Mitchell’s book 1. Concepts: Inductive learning hypothesis General-to-specific.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Mean and Standard Deviation of Discrete Random Variables.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Non-Bayes classifiers. Linear discriminants, neural networks.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
BCS547 Neural Decoding.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Lecture 2: Statistical learning primer for biologists
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Machine Learning 5. Parametric Methods.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Lecture 1.31 Criteria for optimal reception of radio signals.
Oliver Schulte Machine Learning 726
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
The Maximum Likelihood Method
Data Mining Lecture 11.
Roberto Battiti, Mauro Brunato
CHAPTER 3: Bayesian Decision Theory
The Maximum Likelihood Method
Statistical Learning Dong Liu Dept. EEIS, USTC.
'Linear Hierarchical Models'
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The estimate of the proportion (“p-hat”) based on the sample can be a variety of values, and we don’t expect to get the same value every time, but the.
Lecture 15: Data Cleaning for ML
Confidence as Bayesian Probability: From Neural Origins to Behavior
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Uncertainty Propagation
Presentation transcript:

Adaptive Importance Sampling for Estimation in Structured Domains L.E. Ortiz and L.P. Kaelbling

2 Contents t Notations t Importance Sampling t Adaptive Importance Sampling t Empirical Results

3 Notations t Bayesian network (BN) and influence diagram (ID) (A: decision node, U: utitity node)

4 t Probabilities of interest (O: variables of interest, Z: remaining ones) t Best strategy: The strategy with the highest expected utility. The action ‘a’ maximizing the value associated with the evidence ‘o’ (i.e. the parents of ‘a’). t Importance sampling is needed to calculate the above summations

5 Importance Sampling t quantity of interest: t Z ~ important sampling distribution f(z): t estimation of G : (sampling of w from f) t Cf. Estimation of

6 t BN: likelihood weighting (prior) (likelihood) t ID:

7 t Eg. t G can be calculated by sampling of w’s. t Cf.

8 t Variance of the weights: t Minimum variance importance sampling distributions: (taking a derivitive from above) t The weights have 0 variance in this case(w=G) t f (z) must have “ Fat Tail ”: as for at least one value of Z.

9 Adaptive Importance Sampling t Parameterizing the importance sampling distribution (tabularizing) t Update rules based on gradient descent

10 t Three different forms of gradient  minimize variance directly  minimize distance between the current sampling distribution and approximate optimal sampling distribution  minimize distance between the current sampling distribution and empirical optimal distribution

11 t Minimizing variance: t via approximate optimal distribution:

12 t via parameterized empirical distribution: (, if RHS=0)

13 Remarks t  ’s are proportional to square, linear, logarithmic of the weights. t  L2 is positive if w/G > 1 (under estimation of g) t The size and sign of  are related to under or over estimation of g.

14

15 Empirical Results t Problem: Calculate V MP(t) (A) for A=2, MP(t)=1 in the computer mouse problem. t Evaluation: by MSE between the true value and the estimation from sampling method. t Var and L2 are better than LW(traditional method) t L2 is more stable than other methods

16