Educational Data Mining March 3, 2010. Today’s Class EDM Assignment#5 Mega-Survey.

Slides:



Advertisements
Similar presentations
Bayesian Knowledge Tracing and Discovery with Models
Advertisements

Week 1, video 2: Regressors. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other.
Bayesian Knowledge Tracing Prediction Models
Design of Experiments Lecture I
Intro to EDM Why EDM now? Which tools to use in class Week 1, video 1.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Random Forest Predrag Radenković 3237/10
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2012.
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2010.
Feature Engineering Studio Special Session October 23, 2013.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 18, 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Knowledge Engineering Week 3 Video 5. Knowledge Engineering  Where your model is created by a smart human being, rather than an exhaustive computer.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 27, 2012.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Discovery with Models Week 8 Video 1. Discovery with Models: The Big Idea  A model of a phenomenon is developed  Via  Prediction  Clustering  Knowledge.
Week 2 Video 4 Metrics for Regressors.
Model Assessment, Selection and Averaging
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Statistical Methods Chichang Jou Tamkang University.
Three kinds of learning
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Topic 3: Regression.
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Part I: Classification and Bayesian Learning
Educational Data Mining and DataShop John Stamper Carnegie Mellon University 1 9/12/2012 PSLC Corporate Partner Meeting 2012.
Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures  Today we’ll focus on metrics for classifiers  Later this week we’ll.
Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.
Data Annotation for Classification. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Prediction (Classification, Regression) Ryan Shaun Joazeiro de Baker.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Educational Data Mining: Discovery with Models Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Ken Koedinger CMU Director of PSLC Professor of Human-Computer.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Classification Ensemble Methods 1
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Core Methods in Educational Data Mining HUDK4050 Fall 2015.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 3, 2013.
LECTURE 16: BEYOND LINEARITY PT. 1 March 28, 2016 SDS 293 Machine Learning.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 April 15, 2013.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Core Methods in Educational Data Mining
Advanced Methods and Analysis for the Learning and Social Sciences
Chapter 7. Classification and Prediction
Prediction (Classification, Regression)
Data Mining Practical Machine Learning Tools and Techniques
Big Data, Education, and Society
Big Data, Education, and Society
Introduction to Predictive Modeling
Core Methods in Educational Data Mining
Welcome! Knowledge Discovery and Data Mining
Core Methods in Educational Data Mining
MGS 3100 Business Analysis Regression Feb 18, 2016
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Educational Data Mining March 3, 2010

Today’s Class EDM Assignment#5 Mega-Survey

Educational Data Mining “Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in.” –

Classes of EDM Method (Romero & Ventura, 2007) Information Visualization Web mining – Clustering, Classification, Outlier Detection – Association Rule Mining/Sequential Pattern Mining – Text Mining

Classes of EDM Method (Baker & Yacef, 2009) Prediction Clustering Relationship Mining Discovery with Models Distillation of Data For Human Judgment

Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables) Which students are using CVS? Which students will fail the class?

Clustering Find points that naturally group together, splitting full data set into set of clusters Usually used when nothing is known about the structure of the data – What behaviors are prominent in domain? – What are the main groups of students?

Relationship Mining Discover relationships between variables in a data set with many variables – Association rule mining – Correlation mining – Sequential pattern mining – Causal data mining Beck & Mostow (2008) article is a great example of this

Discovery with Models Pre-existing model (developed with EDM prediction methods… or clustering… or knowledge engineering) Applied to data and used as a component in another analysis

Distillation of Data for Human Judgment Making complex data understandable by humans to leverage their judgment Text replays are a simple example of this

Focus of today’s class Prediction Clustering Relationship Mining Discovery with Models Distillation of Data For Human Judgment There will be a term-long class on this, taught by Joe Beck, in coordination with Carolina Ruiz’s Data Mining class, in a future year – Strongly recommended

Prediction Pretty much what it says A student is using a tutor right now. Is he gaming the system or not? A student has used the tutor for the last half hour. How likely is it that she knows the knowledge component in the next step? A student has completed three years of high school. What will be her score on the SAT-Math exam?

Two Key Types of Prediction This slide adapted from slide by Andrew W. Moore, Google

Classification General Idea Canonical Methods Assessment Ways to do assessment wrong

Classification There is something you want to predict (“the label”) The thing you want to predict is categorical – The answer is one of a set of categories, not a number – CORRECT/WRONG (sometimes expressed as 0,1) – HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE – WILL DROP OUT/WON’T DROP OUT – WILL SELECT PROBLEM A,B,C,D,E,F, or G

Classification Associated with each label are a set of “features”, which maybe you can use to predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification The basic idea of a classifier is to determine which features, in which combination, can predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification Of course, usually there are more than 4 features And more than 7 actions/data points I’ve recently done analyses with 800,000 student actions, and 26 features

Classification Of course, usually there are more than 4 features And more than 7 actions/data points I’ve recently done analyses with 800,000 student actions, and 26 features 5 years ago that would’ve been a lot of data These days, in the EDM world, it’s just a medium-sized data set

Classification One way to classify is with a Decision Tree (like J48) PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4

Classification One way to classify is with a Decision Tree (like J48) PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4 Skillpknowtimetotalactionsright COMPUTESLOPE ?

Classification Another way to classify is with step regression Linear regression (discussed later), with a cut- off

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package – WEKA

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package – WEKA – RapidMiner

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package – WEKA – RapidMiner – KEEL

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package – WEKA – RapidMiner – KEEL – RapidMiner

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package – WEKA – RapidMiner – KEEL – RapidMiner

And of course… There are lots of other classification algorithms you can use... SMO (support vector machine) In your favorite Machine Learning package – WEKA – RapidMiner – KEEL – RapidMiner

Comments? Questions?

How can you tell if a classifier is any good?

What about accuracy? # correct classifications total number of classifications 9200 actions were classified correctly, out of actions = 92% accuracy, and we declare victory.

What are some limitations of accuracy?

Biased training set What if the underlying distribution that you were trying to predict was: 9200 correct actions, 800 wrong actions And your model predicts that every action is correct Your model will have an accuracy of 92% Is the model actually any good?

What are some alternate metrics you could use?

Kappa (Accuracy – Expected Accuracy) (1 – Expected Accuracy)

What are some alternate metrics you could use? A’ The probability that if the model is given an example from each category, it will accurately identify which is which

Comparison Kappa – easier to compute – works for an unlimited number of categories – wacky behavior when things are worse than chance – difficult to compare two kappas in different data sets (K=0.6 is not always better than K=0.5)

Comparison A’ – more difficult to compute – only works for two categories (without complicated extensions) – meaning is invariant across data sets (A’=0.6 is always better than A’=0.55) – very easy to interpret statistically

Comments? Questions?

What data set should you generally test on? A vote… – Raise your hands as many times as you like

What data set should you generally test on? The data set you trained your classifier on A data set from a different tutor Split your data set in half (by students), train on one half, test on the other half Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. Votes?

What data set should you generally test on? The data set you trained your classifier on A data set from a different tutor Split your data set in half (by students), train on one half, test on the other half Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. What are the benefits and drawbacks of each?

The dangerous one (though still sometimes OK) The data set you trained your classifier on If you do this, there is serious danger of over- fitting

The dangerous one (though still sometimes OK) You have ten thousand data points. You fit a parameter for each data point. “If data point 1, RIGHT. If data point 78, WRONG…” Your accuracy is 100% Your kappa is 1 Your model will neither work on new data, nor will it tell you anything.

The dangerous one (though still sometimes OK) The data set you trained your classifier on When might this one still be OK?

K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this?

K-fold cross validation (standard) Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. What can you infer from this? – Your detector will work with new data from the same students

K-fold cross validation (student-level) Split your data set in half (by student), train on one half, test on the other half What can you infer from this?

K-fold cross validation (student-level) Split your data set in half (by student), train on one half, test on the other half What can you infer from this? – Your detector will work with data from new students from the same population (whatever it was)

A data set from a different tutor The most stringent test When your model succeeds at this test, you know you have a good/general model When it fails, it’s sometimes hard to know why

An interesting alternative Leave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006) – Train on data from 3 or more tutors – Test on data from a different tutor – (Repeat for all possible combinations) – Good for giving a picture of how well your model will perform in new lessons

Comments? Questions?

Regression

There is something you want to predict (“the label”) The thing you want to predict is numerical – Number of hints student requests – How long student takes to answer – What will the student’s test score be

Regression Associated with each label are a set of “features”, which maybe you can use to predict the label Skillpknowtimetotalactionsnumhints ENTERINGGIVEN ENTERINGGIVEN USEDIFFNUM ENTERINGGIVEN REMOVECOEFF REMOVECOEFF USEDIFFNUM ….

Regression The basic idea of regression is to determine which features, in which combination, can predict the label’s value Skillpknowtimetotalactionsnumhints ENTERINGGIVEN ENTERINGGIVEN USEDIFFNUM ENTERINGGIVEN REMOVECOEFF REMOVECOEFF USEDIFFNUM ….

Linear Regression The most classic form of regression is linear regression – Alternatives include Poisson regression, Neural Networks...

Linear Regression The most classic form of regression is linear regression Numhints = 0.12*Pknow *Time – 0.11*Totalactions Skillpknowtimetotalactionsnumhints COMPUTESLOPE ?

Linear Regression Linear regression only fits linear functions (except when you apply transforms to the input variables, which RapidMiner can do for you…)

Linear Regression However… It is blazing fast It is often more accurate than more complex models, particularly once you cross-validate – Data Mining’s “Dirty Little Secret” It is feasible to understand your model (with the caveat that the second feature in your model is in the context of the first feature, and so on)

Example of Caveat Let’s study a classic example

Example of Caveat Let’s study a classic example Drinking too much prune nog at a party, and having an emergency trip to the Little Researcher’s Room

Data

Some people are resistent to the deletrious effects of prunes and can safely enjoy high quantities of prune nog!

Learned Function Probability of “emergency”= 0.25 * # Drinks of nog last 3 hours * (Drinks of nog last 3 hours) 2 But does that actually mean that (Drinks of nog last 3 hours) 2 is associated with less “emergencies”?

Learned Function Probability of “emergency”= 0.25 * # Drinks of nog last 3 hours * (Drinks of nog last 3 hours) 2 But does that actually mean that (Drinks of nog last 3 hours) 2 is associated with less “emergencies”? No!

Example of Caveat (Drinks of nog last 3 hours) 2 is actually positively correlated with emergencies! – r=0.59

Example of Caveat The relationship is only in the negative direction when (Drinks of nog last 3 hours) is already in the model…

Example of Caveat So be careful when interpreting linear regression models (or almost any other type of model)

Comments? Questions?

Discovery with Models

Why do Discovery with Models? Let’s say you have a model of some construct of interest or importance – Knowledge – Meta-Cognition – Motivation – Affect – Inquiry Skill – Collaborative Behavior – Etc.

Why do Discovery with Models? You can use that model to – Find outliers of interest by finding out where the model makes extreme predictions – Inspect the model to learn what factors are involved in predicting the construct – Find out the construct’s relationship to other constructs of interest, by studying its correlations/associations/causal relationships with data/models on the other constructs – Study the construct across contexts or students, by applying the model within data from those contexts or students – And more…

Most frequently Done using prediction models Though other types of models (in particular knowledge engineering models) are amenable to this as well!

Boosting

Let’s say that you have 300 labeled actions randomly sampled from 600,000 overall actions – Not a terribly unusual case, in these days of massive data sets, like those in the PSLC DataShop You can train the model on the 300, cross-validate it, and then apply it to all 600,000 And then analyze the model across all actions – Makes it possible to study larger-scale problems than a human could do without computer assistance – Especially nice if you have some unlabeled data set with nice properties For example, additional data such as questionnaire data (cf. Baker, Walonoski, Heffernan, Roll, Corbett, & Koedinger, 2008)

However… To do this and trust the result, You should validate that the model can transfer across students, populations, and to the learning software you’re using – As discussed earlier

A few examples…

Middle School Gaming Detector HARDEST SKILLS (pknow< 20%) EASIEST SKILLS (pknow> 90%) GAMED HURT 12% of the time 2% of the time GAMED NOT HURT 2% of the time 4% of the time

Skills from the Algebra Tutor skillL0T AddSubtractTypeinSkillIsolatepositiveIso0.01 ApplyExponentExpandExponentsevalradicalE CalculateEliminateParensTypeinSkillElimi CalculatenegativecoefficientTypeinSkillM Changingaxisbounds0.01 Changingaxisintervals0.01 ChooseGraphicala combineliketermssp Initial probability of knowing skill Probability of learning skill at each opportunity

Which skills could probably be removed from the tutor? skillL0T AddSubtractTypeinSkillIsolatepositiveIso0.01 ApplyExponentExpandExponentsevalradicalE CalculateEliminateParensTypeinSkillElimi CalculatenegativecoefficientTypeinSkillM Changingaxisbounds0.01 Changingaxisintervals0.01 ChooseGraphicala combineliketermssp

Which skills could use better instruction? skillL0T AddSubtractTypeinSkillIsolatepositiveIso0.01 ApplyExponentExpandExponentsevalradicalE CalculateEliminateParensTypeinSkillElimi CalculatenegativecoefficientTypeinSkillM Changingaxisbounds0.01 Changingaxisintervals0.01 ChooseGraphicala combineliketermssp

Comments? Questions?

A lengthier example (if there’s time) Applying Baker et al’s (2008) gaming detector across contexts

Research Question Do students game the system because of state or trait factors? If trait factors are the main explanation, differences between students will explain much of the variance in gaming If state factors are the main explanation, differences between lessons could account for many (but not all) state factors, and explain much of the variance in gaming So: is the student or the lesson a better predictor of gaming?

Application of Detector After validating its transfer We applied the gaming detector across 35 lessons, used by 240 students, from a single Cognitive Tutor Giving us, for each student in each lesson, a gaming frequency

Model Linear Regression models Gaming frequency = Lesson +  0 Gaming frequency = Student +  0

Model Categorical variables transformed to a set of binaries i.e. Lesson = Scatterplot becomes 3DGeometry = 0 Percents = 0 Probability = 0 Scatterplot = 1 Boxplot = 0 Etc…

Metrics

r2r2 The correlation, squared The proportion of variability in the data set that is accounted for by a statistical model

r2r2 The correlation, squared The proportion of variability in the data set that is accounted for by a statistical model

r2r2 However, a limitation The more variables you have, the more variance you should be expected to predict, just by chance

r2r2 We should expect 240 students To predict gaming better than 35 lessons Just by overfitting

So what can we do?

BiC Bayesian Information Criterion (Raftery, 1995) Makes trade-off between goodness of fit and flexibility of fit (number of parameters)

Predictors

The Lesson Gaming frequency = Lesson +  0 35 parameters r 2 = 0.55 BiC’ = – Model is significantly better than chance would predict given model size & data set size

The Student Gaming frequency = Student +  parameters r 2 = 0.16 BiC’ = 1382 – Model is worse than chance would predict given model size & data set size!

Standard deviation bars, not standard error bars

Comments? Questions?

EDM – where? Holistic Entitative Existentialist Essentialist

Today’s Class EDM Assignment#5 Mega-Survey

Any questions?

Today’s Class EDM Assignment#5 Mega-Survey

I need a volunteer to bring these surveys to Jim Doyle after class *NOT THE REGISTRAR*

Mega-Survey Additional Questions (See back) #1: In future years, should this class be given 1: In half a semester, as part of a unified semester class, along with Professor Skorinko’s Research Methods class 3: Unsure/neutral 5: As a full-semester class, with Professor Skorinko’s class as a prerequisite #2: Are there any topics you think should be dropped from this class? [write your answer in the space to the right] #3: Are there any topics you think should be added to this class? [write your answer in the space to the right]