Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 18, 2013.

Slides:



Advertisements
Similar presentations
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 January 23, 2012.
Advertisements

Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Brief introduction on Logistic Regression
Automated Regression Modeling Descriptive vs. Predictive Regression Models Four common automated modeling procedures Forward Modeling Backward Modeling.
CHAPTER 9: Decision Trees
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2012.
Educational Data Mining Overview Ryan S.J.d. Baker PSLC Summer School 2010.
Statistical Techniques I EXST7005 Multiple Regression.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Feature Engineering Studio Special Session October 23, 2013.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 27, 2012.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Educational Data Mining March 3, Today’s Class EDM Assignment#5 Mega-Survey.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
x – independent variable (input)
Decision Tree Rong Jin. Determine Milage Per Gallon.
Three kinds of learning
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Educational Data Mining Ryan S.J.d. Baker PSLC/HCII Carnegie Mellon University Richard Scheines Professor of Statistics, Machine Learning, and Human-Computer.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Data Mining Techniques
Wednesday PM  Presentation of AM results  Multiple linear regression Simultaneous Simultaneous Stepwise Stepwise Hierarchical Hierarchical  Logistic.
Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Weka Project assignment 3
Feature Engineering Studio September 23, Welcome to Mucking Around Day.
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
K Nearest Neighbors Classifier & Decision Trees
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Feature Engineering Studio September 9, Welcome to Problem Proposal Day Rules for Presenters Rules for the Rest of the Class.
Feature Engineering Studio September 23, Let’s start by discussing the HW.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Feature Engineering Studio March 1, Let’s start by discussing the HW.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 27, 2013.
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Special Topics in Educational Data Mining HUDK5199 Spring, 2013 April 3, 2013.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 April 15, 2013.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 25, 2013.
Core Methods in Educational Data Mining
Advanced Methods and Analysis for the Learning and Social Sciences
Chapter 7. Classification and Prediction
k-Nearest neighbors and decision tree
Machine Learning in Practice Lecture 17
Support Vector Machines
Core Methods in Educational Data Mining
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 18, 2013

Today’s Class Classification And then some discussion of features in Excel between end of class and 5pm – We will start today, and continue in future classes as needed

Types of EDM method (Baker & Siemens, under review) Prediction – Classification – Regression – Latent Knowledge Estimation Structure Discovery – Clustering – Factor Analysis – Domain Structure Discovery – Network Analysis Relationship mining – Association rule mining – Correlation mining – Sequential pattern mining – Causal data mining Distillation of data for human judgment Discovery with models 3

We have already studied Prediction – Classification – Regression – Latent Knowledge Estimation Structure Discovery – Clustering – Factor Analysis – Domain Structure Discovery – Network Analysis Relationship mining – Association rule mining – Correlation mining – Sequential pattern mining – Causal data mining Distillation of data for human judgment Discovery with models 4

Today’s Class Prediction – Classification – Regression – Latent Knowledge Estimation Structure Discovery – Clustering – Factor Analysis – Domain Structure Discovery – Network Analysis Relationship mining – Association rule mining – Correlation mining – Sequential pattern mining – Causal data mining Distillation of data for human judgment Discovery with models 5

Prediction Pretty much what it says A student is using a tutor right now. Is he gaming the system or not? A student has used the tutor for the last half hour. How likely is it that she knows the skill in the next step? A student has completed three years of high school. What will be her score on the college entrance exam?

Classification There is something you want to predict (“the label”) The thing you want to predict is categorical – The answer is one of a set of categories, not a number – CORRECT/WRONG (sometimes expressed as 0,1) This is what we used in Latent Knowledge Estimation – HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE – WILL DROP OUT/WON’T DROP OUT – WILL SELECT PROBLEM A,B,C,D,E,F, or G

Where do those labels come from? Field observations Text replays Post-test data Tutor performance Survey data School records Where else?

Classification Associated with each label are a set of “features”, which maybe you can use to predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification The basic idea of a classifier is to determine which features, in which combination, can predict the label Skillpknowtimetotalactionsright ENTERINGGIVEN WRONG ENTERINGGIVEN RIGHT USEDIFFNUM WRONG ENTERINGGIVEN RIGHT REMOVECOEFF WRONG REMOVECOEFF RIGHT USEDIFFNUM RIGHT ….

Classification Of course, usually there are more than 4 features And more than 7 actions/data points These days, 800,000 student actions, and 26 features, would be a medium-sized data set

Classifiers There are literally hundreds of classification algorithms that have been proposed/published/tried out A good data mining package will have many implementations – RapidMiner – SAS Enterprise Miner – Weka – KEEL

Domain-Specificity Specific algorithms work better for specific domains and problems We often have hunches for why that is But it’s more in the realm of “lore” than really “engineering”

Some algorithms you probably don’t want to use Support Vector Machines – Conducts dimensionality reduction on data space and then fits hyperplane which splits classes – Creates very sophisticated models – Great for text mining – Great for sensor data – Usually pretty lousy for educational log data

Some algorithms you probably don’t want to use Genetic Algorithms – Uses mutation, combination, and natural selection to search space of possible models – Obtains a different answer every time (usually) – Seems really awesome – Usually doesn’t produce the best answer

Some algorithms you probably don’t want to use Neural Networks – Composes extremely complex relationships through combining “perceptrons” – Usually over-fits for educational log data

Note Support Vector Machines and Neural Networks are great for some problems I just haven’t seen them be the best solution for educational log data

Some algorithms you might find useful Step Regression Logistic Regression J48/C4.5 Decision Trees JRip Decision Rules K* Instance-Based Classifier There are many others!

Logistic Regression

Already discussed in class Fits logistic function to data to find out the frequency/odds of a specific value of the dependent variable Given a specific set of values of predictor variables

Logistic Regression m = a0 + a1v1 + a2v2 + a3v3 + a4v4…

Logistic Regression

Parameters fit Through Expectation Maximization

Relatively conservative Thanks to simple functional form, is a relatively conservative algorithm – Less tendency to over-fit

Good for Cases where changes in value of predictor variables have predictable effects on probability of predictor variable class

Good when multi-level interactions are not particularly common Can be given interaction effects through automated feature distillation – We’ll look at this later But is not particularly optimal for this

Note RapidMiner and Weka do not actually choose features for you You have to select features by hand Or in java code that calls RapidMiner or Weka Easy to implement step-wise regression by hand, painful to implement other feature selection algorithms

Step Regression

Fits a linear regression function – (discussed in detail in a later class) – with an arbitrary cut-off Selects parameters Assigns a weight to each parameter Computes a numerical value Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1

Example Y= 0.5a + 0.7b – 0.2c + 0.4d Cut-off 0.5 abcd

Parameters fit Through Iterative Gradient Descent This is a simple enough model that this approach actually works…

Most conservative This is the most conservative classifier except for the fabled “0R” – More on that in a minute

Good for Cases where relationships between predictor and predicted variables are relatively linear

Good when multi-level interactions are not particularly common Can be given interaction effects through automated feature distillation – We’ll look at this later But is not particularly optimal for this

Feature Selection Greedy – simplest model M5’ – in between None – most complex model

Greedy Also called Forward Selection – Even simpler than Stepwise Regression 1.Start with empty model 2.Which remaining feature best predicts the data when added to current model 3.If improvement to model is over threshold (in terms of SSR or statistical significance) 4.Then Add feature to model, and go to step 2 5.Else Quit

M5’ Will be discussed in detail in regression lecture

0R

Always say 0

Decision Trees

Decision Tree PKNOW TIMETOTALACTIONS RIGHT WRONG <0.5>=0.5 <6s.>=6s.<4>=4 Skillpknowtimetotalactionsright COMPUTESLOPE ?

Decision Tree Algorithms There are several I usually use J48, which is an open-source re- implementation of C4.5 (Quinlan, 1993)

J48/C4.5 Can handle both numerical and categorical predictor variables – Tries to find optimal split in numerical variable Repeatedly looks for variable which best splits the data in terms of predictive power for each variable (using information gain) Later prunes out branches that turn out to have low predictive power Note that different branches can have different features!

Can be adjusted… To split based on more or less evidence To prune based on more or less predictive power

Relatively conservative Thanks to pruning step, is a relatively conservative algorithm – Less tendency to over-fit

Good when data has natural splits

Good when multi-level interactions are common

Good when same construct can be arrived at in multiple ways A student is likely to drop out of college when he – Starts assignments early but lacks prerequisites OR when he – Starts assignments the day they’re due

Decision Rules

Many Algorithms Differences are in terms of what metric is used and how rules are generated Most popular subcategory (including JRip and PART) repeatedly creates decision trees and distills best rules

Generating Rules from Decision Tree 1.Create Decision Tree 2.If there is at least one path that is worth keeping, go to 3 else go to 6 3.Take the “Best” single path from root to leaf and make that path a rule 4.Remove all data points classified by that rule from data set 5.Go to step 1 6.Take all remaining data points 7.Find the most common value for those data points 8.Make an “otherwise” rule using that

Relatively conservative Leads to simpler models than most decision trees – Less tendency to over-fit

Very interpretable model Unlike most other approaches

Example (Baker & Clarke-Midura, under review) 1. IF the student spent at least 66 seconds reading the parasite information page, THEN the student will obtain the correct final conclusion (confidence = 81.5%) 2. IF the student spent at least 12 seconds reading the parasite information page AND the student read the parasite information page at least twice AND the student spent no more than 51 seconds reading the pesticides information page, THEN the student will obtain the correct final conclusion (confidence = 75.0%) 3. IF the student spent at least 44 seconds reading the parasite information page AND the student spent under 56 seconds reading the pollution information page, THEN the student will obtain the correct final conclusion (confidence = 68.8%) 4. OTHERWISE the student will not obtain the correct final conclusion (confidence = 89.0%)

Good when multi-level interactions are common

Good when same construct can be arrived at in multiple ways A student is likely to drop out of college when he – Starts assignments early but lacks prerequisites OR when he – Starts assignments the day they’re due

K*

Instance-Based Classifier Takes a data point to predict Finds the K closest points to that data point, by Euclidean distance – K often equals 3 Gives data point whichever class is most common among those 3 points

Good when data is very divergent Lots of different processes can lead to the same result Impossible to find general rules But data points that are similar tend to be from the same class

Big Drawback To use the model, you need to have the whole data set

Big Advantage Sometimes works when nothing else works Has been useful for my group in affect detection

Comments? Questions?

Generating Confidences Each of these approaches gives not just a final answer, but a confidence (or pseudo- confidence) Many applications of confidences – we’ll discuss in detail in next lecture

Generating Confidences Step Regression – raw value of regression Logistic Regression – p(m) Jrip/J48 – ratio of correct classifications to incorrect classifications in each leaf – I will show you an example of this in a minute

Hands-On Activity Running algorithm in RapidMiner 4.6 Using some made-up data, just to see how things work

Open RapidMiner And open classifier.xml Let’s go through this script

Use RapidMiner Run JRip without cross-validation Run JRip with action-level cross-validation Run Jrip with batch-level cross-validation – Set to implement student-level cross-validation – We’ll discuss this more in the next lecture

Confidences How to interpret model confidence from RapidMiner output

Let’s look at the data Are there any features we shouldn’t be using?

Remove those features Run the model again

Run other algorithms J48

Run other algorithms J48 K*

Run other algorithms J48 K* Step Regression

Run other algorithms J48 K* Step Regression – Why didn’t Step Regression work? – How can we fix it?

Comments? Questions?

Comments or Questions About Assignment 3?

Next Class Wednesday, February 20 Behavior Detection Assignment Due: 3. Behavior Detection

Excel Plan is to go as far as we can by 5pm We will continue after next class session Vote on which topics you most want to hear about

Topics Using average, count, sum, stdev (asgn. 4 data set) Relative and absolute referencing (made up data) Copy and paste values only (made up data) Using sort, filter (asgn. 4 data set) Making pivot table (asgn. 4 data set) Using vlookup (Jan. 28 class data set) Using countif (asgn. 4 data set) Making scatterplot (Jan. 28 class data set) Making histogram (asgn. 4 data set) Equation Solver (Jan. 28 class data set) Z-test (made up data) 2-sample t-test (made up data) Other topics?

The End