Navigating the parameter space of Bayesian Knowledge Tracing models Visualizations of the convergence of the Expectation Maximization algorithm Zachary.

Slides:

Advertisements

Similar presentations

Ulams Game and Universal Communications Using Feedback Ofer Shayevitz June 2006.

Advertisements

Autonomic Scaling of Cloud Computing Resources

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.

Fast Algorithms For Hierarchical Range Histogram Constructions

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Knowledge Inference: Advanced BKT Week 4 Video 5.

Bayesian Knowledge Tracing and Other Predictive Models in Educational Data Mining Zachary A. Pardos PSLC Summer School 2011 Bayesian Knowledge Tracing.

Modeling Student Knowledge Using Bayesian Networks to Predict Student Performance By Zach Pardos, Neil Heffernan, Brigham Anderson and Cristina Heffernan.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Introduction of Probabilistic Reasoning and Bayesian Networks

Effective Skill Assessment Using Expectation Maximization in a Multi Network Temporal Bayesian Network By Zach Pardos, Advisors: Neil Heffernan, Carolina.

Visual Recognition Tutorial

EE-148 Expectation Maximization Markus Weber 5/11/99.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Overview Full Bayesian Learning MAP learning

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.

Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.

Hidden Process Models Rebecca Hutchinson Tom M. Mitchell Indrayana Rustandi October 4, 2006 Women in Machine Learning Workshop Carnegie Mellon University.

Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.

1 Prediction of Software Reliability Using Neural Network and Fuzzy Logic Professor David Rine Seminar Notes.

Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Water Contamination Detection – Methodology and Empirical Results IPN-ISRAEL WATER WEEK (I 2 W 2 ) Eyal Brill Holon institute of Technology, Faculty of.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Choosing Sample Size for Knowledge Tracing Models DERRICK COETZEE.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Decisions from Data: The Role of Simulation Gail Burrill Gail Burrill Michigan State University

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

User Interests Imbalance Exploration in Social Recommendation: A Fitness Adaptation Authors : Tianchun Wang, Xiaoming Jin, Xuetao Ding, and Xiaojun Ye.

Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 January 28, 2013.

Museum and Institute of Zoology PAS Warsaw Magdalena Żytomska Berlin, 6th September 2007.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 4, 2013.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Systems Realization Laboratory The Role and Limitations of Modeling and Simulation in Systems Design Jason Aughenbaugh & Chris Paredis The Systems Realization.

Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백

Core Methods in Educational Data Mining HUDK4050 Fall 2015.

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 25, 2012.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

Core Methods in Educational Data Mining HUDK4050 Fall 2014.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Core Methods in Educational Data Mining

Chapter 7. Classification and Prediction

Michael V. Yudelson Carnegie Mellon University

Special Topics in Educational Data Mining

Let’s do a Bayesian analysis

Bayes Net Toolbox for Student Modeling (BNT-SM)

Using Bayesian Networks to Predict Test Scores

Objective of This Course

Detecting the Learning Value of Items In a Randomized Problem Set

Discrete Event Simulation - 4

Addressing the Assessing Challenge with the ASSISTment System

Knowledge Tracing Parameters can be learned with the EM algorithm!

Probabilistic Latent Preference Analysis

CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Core Methods in Educational Data Mining

GhostLink: Latent Network Inference for Influence-aware Recommendation

Presentation transcript:

Navigating the parameter space of Bayesian Knowledge Tracing models Visualizations of the convergence of the Expectation Maximization algorithm Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of Computer Science

Outline Introduction – Knowledge Tracing/EM – Past work – Research Overview Analysis Procedure Results (pretty pictures) Contributions Presentation available: wpi.edu/~zpardos

Introduction of BKT Bayesian Knowledge Tracing (BKT) is a hidden Markov model that estimates the probability a student knows a particular skill based on: – the student’s past history of incorrect and correct responses to problems of that skill – the four parameters of the skill 1.Prior: The probability the skill was known before use of the tutor 2.Learn rate: the probability of learning the skill between each opportunity 3.Guess: The probability of answering correctly if the skill is not known 4.Slip: the probability of answering incorrectly if the skill is known

Introduction of EM Expectation Maximization (EM) algorithm is a commonly used algorithm for learning parameters based on maximum likelihood estimates. EM is especially well suited to learn the four BKT parameters because it supports learning parameters for models with unobserved or latent variables Latent

Motivation Results of past and emerging work by the authors rely on interpretation of parameters learned with BKT and EM Pardos, Z. A., Heffernan, N. T. (2009). Determining the Significance of Item Order In Randomized Problem Sets. In Barnes, Desmarais, Romero, & Ventura (Eds.). In Proceedings of the 2nd International Conference on Educational Data Mining. pp Cordoba, Spain. *Best Student Paper Pardos, Z. A., Dailey, M. D., Heffernan, N. T. In Press (2010) Learning what works in ITS from non- traditional randomized controlled trial data. In Proceedings of the 10th International Conference on Intelligent Tutoring Systems. Pittsburg, PA. Springer-Verlag: Berlin. *Nominated for Best Student Paper

Motivation Learned parameter values work to dictate when a student should advance in a curriculum in the Cognitive Tutors

Past work and relevance Beck et al (2007) expressed caution with using Knowledge Tracing, giving an example of how KT could fit data equally well with two separate sets of learned parameters. One set being the plausible set, the other being the degenerate set. – Proposed using Dirichlet priors to keep parameters close to reasonable values Better fit was not accomplished with this method when learning the parameters from data

Past work and relevance Baker (2009) argued that using brute force to fit the parameters of KT results in a better fit than when using Expectation Maximization (personal communication) – Gong et al are challenging this at EDM2010 Work by Baker & Corbett has addressed the degenerate parameter problem by bounding the learned parameter values

Past work and relevance Ritter et al (2009) used visualization of the KT parameters to show how many of the Cognitive tutor skills were being fit with similar parameters. The authors used that information to cluster the learning of groups of skills; saving compute time with negligible impact on accuracy.

Initial EM parameters Bad fit Ineffective learning Bad pedagogical decisions Good fit Effective learning Many publications You’re a hero Bayesian Knowledge Tracing: Method for estimating if a student knows a skill or not based on the student’s past responses and the parameter values of the skill Expectation Maximization (EM): Method for estimating the skill parameters for Bayesian Knowledge Tracing EM needs starting values for the parameters to begin its search Research Overview

Initial EM parameters Bad fit Ineffective learning Bad pedagogical decisions Good fit Effective learning Many publications You’re a hero Bayesian Knowledge Tracing: Method for estimating if a student knows a skill or not based on the student’s past responses and the parameter values of the skill Expectation Maximization (EM): Method for estimating the skill parameters for Bayesian Knowledge Tracing EM needs starting values for the parameters to begin its search Are the starting locations that lead to good fit scattered randomly? Research Questions: Research Overview

Initial EM parameters Bad fit Ineffective learning Bad pedagogical decisions Good fit Effective learning Many publications You’re a hero Bayesian Knowledge Tracing: Method for estimating if a student knows a skill or not based on the student’s past responses and the parameter values of the skill Expectation Maximization (EM): Method for estimating the skill parameters for Bayesian Knowledge Tracing EM needs starting values for the parameters to begin its search Are the starting locations that lead to good fit scattered randomly? Do they exist within boundaries? Research Questions: Research Overview

Initial EM parameters Bad fit Ineffective learning Bad pedagogical decisions Good fit Effective learning Many publications You’re a hero Bayesian Knowledge Tracing: Method for estimating if a student knows a skill or not based on the student’s past responses and the parameter values of the skill Expectation Maximization (EM): Method for estimating the skill parameters for Bayesian Knowledge Tracing EM needs starting values for the parameters to begin its search Are the starting locations that lead to good fit scattered randomly? Do they exist within boundaries? Can good convergence always be achieved? Research Questions: Research Overview

Past work and relevance Past work lacks the benefit of knowing the ground truth parameters This makes it difficult to study the behavior of EM and measure the accuracy of learned parameters

Our approach: Simulation Approach of this work is to – construct a BKT model with known parameters – simulate student responses by sampling from that model – explore how EM converges or does not converge to the ground truth parameters based a grid- search of initial parameter starting positions – since we know the true parameters we can now study the accuracy of parameter learning in depth

Initial EM parameters Bad fit Ineffective learning Bad pedagogical decisions Good fit Effective learning Many publications You’re a hero Research Overview Inaccurate fit Accurate fit

Simulation Procedure KTmodel.lrate = 0.09 KTmodel.guess = 0.14 KTmodel.slip = 0.09 KTmodel.num_questions = 4 For user 1 to 100 prior(user) = rand() KTmodel.prior = prior(user) sim_responses(user) = sample.KTmodel End For

Simulation Procedure Simulation produces a vector of responses for each student probabilistically based on underlying parameter values EM can now try to learn back the true parameters from the simulated student data EM allows the user to specify which initialization values of the KT parameters should be fixed and which should be learned

Simulation Procedure We can start to build intuition about EM by fixing the prior and learn rate and having only two free parameter to learn (Guess and Slip) – Prior: 0.49 (fixed) – Learn rate: 0.09 (fixed) – Guess: learned – Slip: learned We can see how well EM does with two free parameters and then then later step up to the more complexity four free parameter case

Grid-search Procedure GuessT (true parameter)SlipT (true parameter) GuessI (EM initial parameter) SlipI (EM initial parameter) GuessL (EM learned parameter) SlipL (EM learned parameter) Error = (abs(GuessT – GuessL) + abs(SlipT – SlipL)) / 2 Learning the Guess and Slip parameter from Data Prior and Learn rate already known (fixed)

Grid-search Procedure GuessTSlipT GuessISlipI GuessLSlipLErrorLLstartLLend ……………………… These parameters are iterated in intervals of / = 51, 51*51 = 2601 total iterations EM log likelihood Higher = better fit to data Resulting data file after all iterations are completed

Grid-search Procedure GuessTSlipT GuessISlipI GuessLSlipLErrorLLstartLLend ……………………… These parameters are iterated in intervals of / = 51, 51*51 = 2601 total iterations EM log likelihood Higher = better fit to data Initial parameters of 0 or 1 will stay at 0 or 1

Grid-search Procedure GuessTSlipT GuessISlipI GuessLSlipLErrorLLstartLLend ……………………… These parameters are iterated in intervals of / = 51, 51*51 = 2601 total iterations EM log likelihood Higher = better fit to data Grid-search run in intervals of 0.02

Visualizations What does the parameter space look like? Which starting locations lead to the ground truth parameter values?

Analyzing the 3 & 4 parameter case Similar results were found with the 3 parameter case with learn, guess and slip as free parameters. The starting position of the learn parameter wasn’t important as long as guess + slip <= 1 In the four parameter case a grid-search was run at 0.05 resolution and histograms were generated showing the frequency of parameter occurrences. We found that when guess and slip were set to sum to less than 1, the bottom row of histograms were achieved that minimized degenerate parameter occurrences.

Pardos, Z. A., Heffernan, N. T. In Press (2010) Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing. In Proceedings of the 18 th International Conference on User Modeling, Adaptation and Personalization. Hawaii. *Nominated for Best Student Paper

KT vs. PPS visualizations Knowledge TracingPrior Per Student Ground truth parameters: guess/slip = 0.14/0.09

KT vs. PPS visualizations Knowledge TracingPrior Per Student Ground truth parameters: guess/slip = 0.30/0.30

KT vs. PPS visualizations Knowledge TracingPrior Per Student Ground truth parameters: guess/slip = 0.50/0.50

KT vs. PPS visualizations Knowledge TracingPrior Per Student Ground truth parameters: guess/slip = 0.60/0.10

PPS in the KDD Cup Prior Per Student model used in KDD Cup competition submission. PPS was the most accurate Bayesian predictor in all 5 of the Cognitive tutor datasets Preliminary leaderboard RMSE: – One place behind the netflix winners, BigChaos This suggests that the positive simulation results are real, substantiated empirically

Contributions EM starting parameter values that lead to degenerate learned parameters exist within large boundaries, not scattered randomly throughout the parameter space Using a novel simulation approach and visualizations we were able to clearly depict the multiple maxima characteristics of Knowledge Tracing Using this analysis of algorithm behavior we were able to explain the positive performance of the Prior Per Student model by showing its convergence near the ground truth parameters regardless of starting position Initial values of Guess and Slip are very significant

Unknowns / Future Work How does PPS compare to KT when priors are not from a uniform random distribution – Normal distribution – All students have the same prior – Bi-modal (high / low knowledge students) How does length of sequence and number of students affect algorithm behavior?

Thank you Please find a copy of our paper on the Prior Per Student model at “Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing” Acknowledgement: This material is based in part upon work supported by the National Science Foundation under the GK-12 PIMPSE Grant. Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Limitations of past work – Bounding approach has shown instances where the learned parameters all hit the bounding ceiling indicating that the best fit parameters may be higher than was arbitrarily set – Plausible parameter approach in part relies on domain knowledge to identify what is plausible and what is not Reading tutors may have plausible guess/slip values > 0.70 Cognitive tutors’ plausible guess/slip values are < 0.40