Bayesian Networks in Educational Assessment

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods

Other MCMC features in MLwiN and the MLwiN->WinBUGS interface

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.

Bayesian Estimation in MARK

Gibbs Sampling Qianji Zheng Oct. 5th, 2010.

Markov-Chain Monte Carlo

Bayesian statistics – MCMC techniques

BAYESIAN INFERENCE Sampling techniques

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.

Bayes Factor Based on Han and Carlin (2001, JASA).

Model Inference and Averaging

Priors, Normal Models, Computing Posteriors

2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.

Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University

Bayesian Statistics Lecture 8 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006.

Lecture 2: Statistical learning primer for biologists

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Item Parameter Estimation: Does WinBUGS Do Better Than BILOG-MG?

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.

1 Getting started with WinBUGS Mei LU Graduate Research Assistant Dept. of Epidemiology, MD Anderson Cancer Center Some material was taken from James and.

G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.

Bayesian II Spring Major Issues in Phylogenetic BI Have we reached convergence? If so, do we have a large enough sample of the posterior?

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.

Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,

Chapter 10 Confidence Intervals for Proportions © 2010 Pearson Education 1.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Markov Chain Monte Carlo in R

Missing data: Why you should care about it and what to do about it

Bayesian Semi-Parametric Multiple Shrinkage

MCMC Output & Metropolis-Hastings Algorithm Part I

MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.

Bayesian data analysis

Introducing Bayesian Approaches to Twin Data Analysis

Model Inference and Averaging

Ch3: Model Building through Regression

Non-Parametric Models

Statistical Models for Automatic Speech Recognition

Linear and generalized linear mixed effects models

CJT 765: Structural Equation Modeling

Artificial Intelligence

Latent Variables, Mixture Models and EM

Bayesian inference Presented by Amir Hadadi

Markov Chain Monte Carlo

Markov chain monte carlo

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

Multidimensional Integration Part I

'Linear Hierarchical Models'

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Ch13 Empirical Methods.

CS 188: Artificial Intelligence

LECTURE 09: BAYESIAN LEARNING

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Bayesian Statistics on a Shoestring Assaf Oron, May 2008

Mathematical Foundations of BME Reza Shadmehr

Applied Statistics and Probability for Engineers

Presentation transcript:

Bayesian Networks in Educational Assessment Session III: Refining Bayes Net with Data Estimating Parameters with MCMC Duanli Yan, ETS, Roy Levy, ASU © 2019

Bayesian Inference: Expanding Our Context

Posterior Distribution Posterior distribution for unknowns given knowns is Inference about examinee latent variables (θ) given observables (x) Example: ACED Bayes Net Fragment for Common Ratio θ = Common Ratio x = Observables from tasks that measure Common Ratio

Bayes Net Fragment θ θ = Common Ratio xs = Observables from tasks that measure Common Ratio p(θ) x1 x2 x3 x4 x5 x6

Probability Distribution for the Latent Variable θ θ = Common Ratio θ ~ Categorical(λ) ACED Example 2 Levels of θ (Low, High) λ = (λ1, λ2) contains probabilities for Low and High x1 x2 x3 x4 x5 x6 θ (Common Ratio) 1 2 Prob. λ1 λ2

Probability Distribution for the Observables θ xs = Observables from tasks that measure Common Ratio (xj | θ = c) ~ Bernoulli(πcj) ACED Example πcj is the probability of correct response on task j given θ = c x1 x2 x3 x4 x5 x6 p(xj | θ) θ 1 1 – π1j π1j 2 1 – π2j π2j

Bayesian Inference θ (Common Ratio) 1 2 Prob. λ1 λ2 p(xj | θ) θ 1 1 1 – π1j π1j 2 1 – π2j π2j If the λs and πs are unknown, they become subject to posterior inference too

Bayesian Inference p(xj | θ) θ 1 1 – π1j π1j 2 1 – π2j π2j 1 1 – π1j π1j 2 1 – π2j π2j A convenient choice for prior distribution is the beta distribution ACED Example: π1j ~ Beta(1, 1) π2j ~ Beta(1, 1) For first task, constrain (π21 > π11) to resolve indeterminacy in the latent variable and avoid label switching

Bayesian Inference θ (Common Ratio) 1 (Low) 2 (High) Prob. λ1 λ2 A convenient choice for the prior distribution is the Dirichlet distribution which generalizes the Beta distribution to the case of multiple categories ACED Example: λ = (λ1, λ2) ~ Dirichlet(1, 1) λ ~ Dirichlet(αλ)

Model Summary θ θi ~ Categorical(λ) λ ~ Dirichlet(1, 1) (xij | θi = c) ~ Bernoulli(πcj) π11 ~ Beta(1, 1) x1 x2 x3 x4 x5 x6 π21 ~ Beta(1, 1) I(π21 > π11) πcj ~ Beta(1, 1) for others obs.

JAGS Code

JAGS Code for (i in 1:n){ for(j in 1:J){ x[i,j] ~ dbern(pi[theta[i],j]) } (xij | θi = c) ~ Bernoulli(πcj) p(xj | θ) θ 1 1 – π1j π1j 2 1 – π2j π2j Referencing the table for πjs in terms of θ = 1 or 2

JAGS Code π11 ~ Beta(1, 1) pi[1,1] ~ dbeta(1,1) pi[2,1] ~ dbeta(1,1) T(pi[1,1], ) for(c in 1:C){ for(j in 2:J){ pi[c,j] ~ dbeta(1,1) } π11 ~ Beta(1, 1) π21 ~ Beta(1, 1) I(π21 > π11) πcj ~ Beta(1, 1) for remaining observables

JAGS Code for (i in 1:n){ theta[i] ~ dcat(lambda[]) } θi ~ Categorical(λ) lambda[1:C] ~ ddirch(alpha_lambda[]) for(c in 1:C){ alpha_lambda[c] <- 1 } λ ~ Dirichlet(1, 1)

Markov Chain Monte Carlo

Estimation in Bayesian Modeling Our “answer” is a posterior distribution All parameters treated as random, not fixed Contrasts with frequentist approaches to inference, estimation Parameters are fixed, so estimation comes to finding the single best value “Best” here in terms of a criterion (ML, LS, etc.) Peak of a mountain vs. mapping the entire terrain of peaks, valleys, and plateaus (of a landscape)

What’s In a Name? Markov chain Monte Carlo Construct a sampling algorithm to simulate or draw from the posterior. Collect many such draws, which serve to empirically approximate the posterior distribution, and can be used to empirical approximate summary statistics. Monte Carlo Principle: Anything we want to know about a random variable θ can be learned by sampling many times from f(θ), the density of θ. -- Jackman (2009)

What’s In a Name? Markov chain Monte Carlo Values really generated as a sequence or chain t denotes the step in the chain θ(0), θ(1), θ(2),…, θ(t),…, θ(T) Also thought of as a time indicator Follows the Markov property…

The Markov Property Current state depends on previous position Examples: weather, checkers, baseball counts & scoring Next state conditionally independent of past, given the present Akin to a full mediation model p(θ(t+1) | θ(t), θ(t-1), θ(t-2) ,…) = p(θ(t+1) | θ(t)) θ(0) θ(1) θ(2) θ(3)

Visualizing the Chain: Trace Plot

Markov Chain Monte Carlo Markov chains are sequences of numbers that have the Markov property Draws in cycle t+1 depend on values from cycle t, but given those not on previous cycles (Markov property) Under certain assumptions Markov chains reach stationarity The collection of values converges to a distribution, referred to as a stationary distribution Memoryless: It will “forget” where it starts Start anywhere, will reach stationarity if regularity conditions hold For Bayes, set it up so that this is the posterior distribution Upon convergence, samples from the chain approximate the stationary (posterior) distribution

Assessing Convergence

Diagnosing Convergence With MCMC, convergence to a distribution, not a point ML: Convergence is when we’ve reached the highest point in the likelihood, The highest peak of the mountain MCMC: Convergence when we’re sampling values from the correct distribution, We are mapping the entire terrain accurately

Diagnosing Convergence A properly constructed Markov chain is guaranteed to converge to the stationary (posterior) distribution…eventually Upon convergence, it will sample over the full support of the stationary (posterior) distribution…over an ∞ number of draws In a finite chain, no guarantee that the chain has converged or is sampling through the full support of the stationary (posterior) distribution Many ways to diagnose convergence Whole software packages dedicated to just assessing convergence of chains (e.g., R packages ‘coda’ and ‘boa’)

Gelman & Rubin’s (1992) Potential Scale Reduction Factor (PSRF) Run multiple chains from dispersed starting points Suggest convergence when the chains come together If they all go to the same place, it’s probably the stationary distribution

Gelman & Rubin’s (1992) Potential Scale Reduction Factor (PSRF) An analysis of variance type argument PSRF or R = If there is substantial between-chain variance, will be >> 1

Gelman & Rubin’s (1992) Potential Scale Reduction Factor (PSRF) Run multiple chains from dispersed starting points Suggest convergence when the chains come together Operationalized in terms of partitioning variability Run multiple chains for 2T iterations, discard first half Examine between and within chain variability Various versions, modifications suggested over time

Potential Scale Reduction Factor (PSRF) For any θ, for any chain c the within-chain variance is For all chains, the pooled within-chain variance is

Potential Scale Reduction Factor (PSRF) The between-chain variance is The estimated variance is

Potential Scale Reduction Factor (PSRF) The potential scale reduction factor is If close to 1 (e.g., < 1.1) for all parameters, can conclude convergence

Potential Scale Reduction Factor (PSRF) Examine it over “time”, look for , stability of B and W If close to 1 (e.g., < 1.2, or < 1.1) can conclude convergence

Assessing Convergence: No Guarantees Multiple chains coming together does not guarantee they have converged

Assessing Convergence: No Guarantees multiple chains come together does not guarantee they have converged

Assessing Convergence: No Guarantees Multiple chains coming together does not guarantee they have converged

Assessing Convergence Recommend running multiple chains far apart and determine when they reach the same “place” PSRF criterion an approximation to this Akin to starting ML from different start values and seeing if they reach the same maximum Here, convergence to a distribution, not a point A chain hasn’t converged until all parameters converged Brooks & Gelman multivariate PSRF

Serial Dependence

Serial Dependence Serial dependence between draws due to the dependent nature of the draws (i.e., the Markov structure) p(θ(t+1) | θ(t), θ(t-1), θ(t-2) ,…) = p(θ(t+1) | θ(t)) However there is a marginal dependence across multiple lags Can examine the autocorrelation across different lags θ(0) θ(1) θ(2) θ(3)

Autocorrelation

Thinning Can “thin” the chain by dropping certain iterations Thin = 1  keep every iteration Thin = 2  keep every other iteration (1, 3, 5,…) Thin = 5  keep every 5th iteration (1, 6, 11,…) Thin = 10  keep every 10th iteration (1, 11, 21,…) Thin = 100  keep every 100th iteration (1, 101, 201,…)

Thinning Thin = 1 Thin = 5 Thin = 10

Thinning Can “thin” the chain by dropping certain iterations Thin = 1  keep every iteration Thin = 2  keep every other iteration (1, 3, 5,…) Thin = 5  keep every 5th iteration (1, 6, 11,…) Thin = 10  keep every 10th iteration (1, 11, 21,…) Thin = 100  keep every 100th iteration (1, 101, 201,…) Thinning does not provide a better portrait of the posterior A loss of information May want to keep, and account for time-series dependence Useful when data storage, other computations an issue I want 1000 iterations, rather have 1000 approximately independent iterations Dependence within chains, but none between chains

Mixing

Mixing We don’t want the sampler to get “stuck” in some region of the posterior , or ignore a certain area of the posterior Mixing refers to the chain “moving” throughout the support of the distribution in a reasonable way relatively good mixing relatively poor mixing

Mixing Mixing ≠ convergence, but better mixing usually leads to faster convergence Mixing ≠ autocorrelation, but better mixing usually goes with lower autocorrelation (and cross-correlations between parameters) With better mixing, then for a given number of MCMC iterations, get more information about the posterior Ideal scenario is independent draws from the posterior With worse mixing, need more iterations to (a) achieve convergence and (b) achieve a desired level of precision for the summary statistics of the posterior

Mixing Chains may mix differently at different times Often indicative of an adaptive MCMC algorithm relatively poor mixing relatively good mixing

Mixing Slow mixing can also be caused by high dependence between parameters Example: multicollinearity Reparameterizing the model can improve mixing Example: centering predictors in regression

Stopping the Chain(s)

When to Stop The Chain(s) Discard the iterations prior to convergence as burn-in How many more iterations to run? As many as you want  As many as time provides Autocorrelaion complicates things Software may provide the “MC error” Estimate of the sampling variability of the sample mean Sample here is the sample of iterations Accounts for the dependence between iterations Guideline is to go at least until MC error is less than 5% of the posterior standard deviation Effective sample size Approximation of how many independent samples we have

Steps in MCMC in Practice

Steps in MCMC (1) Setup MCMC using any of a number of algorithms Program yourself (have fun ) Use existing software (BUGS, JAGS) Diagnose convergence Monitor trace plots, PSRF criteria Discard iterations prior to convergence as burn-in Software may indicate a minimum number of iterations needed A lower bound

Adapting MCMC  Automatic Discard relatively poor mixing during adaptive phase relatively good mixing after adaptive phase

Steps in MCMC (2) Run the chain for a desired number of iterations Understanding serial dependence/autocorrelation Understanding mixing Summarize results Monte Carlo principle Densities Summary statistics

ACED Example

Model Summary θ θi ~ Categorical(λ) λ ~ Dirichlet(1, 1) (xij | θi = c) ~ Bernoulli(πcj) π11 ~ Beta(1, 1) x1 x2 x3 x4 x5 x6 π21 ~ Beta(1, 1) I(π21 > π11) πcj ~ Beta(1, 1) for others obs.

ACED Example See ‘ACED Analysis ACED Example See ‘ACED Analysis.R’ for Running the analysis in R See Following Slides for Select Results

Convergence Assessment (1)

Convergence Assessment (2)

Posterior Summary (1)

Posterior Summary (2)

Posterior Summary (3) Mean SD Naive SE Time-series SE 0.025 0.25 0.5 0.75 0.975 Median 95% HPD lower 95% HPD Upper lambda[1] 0.51 0.04 0.42 0.48 0.54 0.6 0.43 lambda[2] 0.49 0.4 0.46 0.52 0.58 0.57 pi[1,1] 0.13 0.06 0.1 0.16 0.23 0.05 0.22 pi[2,1] 0.84 0.81 0.87 0.91 0.92 pi[1,2] 0.12 0.18 0.26 0.33 pi[2,2] 0.98 0.02 0.93 0.97 0.99 1 0.94 pi[1,3] 0.01 0.03 pi[2,3] 0.19 0.17 0.28 0.27 pi[1,4] 0.07 pi[2,4] 0.15 0.2 pi[1,5] 0.08 pi[2,5] 0.64 0.53 0.67 0.74 pi[1,6] 0.14 pi[2,6] 0.82 0.72 0.79 0.86 0.73 theta[1] 2 theta[2] theta[3] theta[4] 1.97 theta[5] 1.17 0.38 theta[6] theta[7] 1.01

Summary and Conclusion

Summary Dependence on initial values is “forgotten” after a sufficiently long run of the chain (memoryless) Convergence to a distribution Recommend monitoring multiple chains PSRF as approximation Let the chain “burn-in” Discard draws prior to convergence Retain the remaining draws as draws from the posterior Dependence across draws induce autocorrelations Can thin if desired Dependence across draws within and between parameters can slow mixing Reparameterizing may help

Beware: MCMC sampling can be dangerous! Wise Words of Caution Beware: MCMC sampling can be dangerous! -- Spiegelhalter, Thomas, Best, & Lunn (2007) (WinBUGS User Manual)