Targeted student support: A machine learning approach

Targeted student support: A machine learning approach
Rohan Posthumus T: +27(0) | |

Introduction Some background:
I am a researcher, Natural Sciences background Hired as a data analyst at CTL UFS Inherited lots of data My context changed: Previous research = deductive approach, start with question New research = inductive approach, start with data I need different tools to dig through data Is machine learning tools useful in Higher Education? Test by building an early warning system

Introduction Early Warning System definition:
“See something, say something” system: anyone concerned about student can alert with , online form etc. Flagging system: institution uses student data and flags student when certain criteria is met Artificial Intelligence system: trains algorithms on data and make decisions based on prediction and probability Lets explore…

Early warning system Find risk factors that predict first year marks
Historical data Old students Performance at university 75% - 100% 50% - 75% 25% - 50% Predict categories for intervention 0% - 25% New students New data Intervention Early warning system

Risk factors Population = First time entering undergraduate
students from 2014 – 2018 (N = 35521) BFN and QQ campus Logistical regression to determine significance Dependant variable = students’ average final mark in year1 divided into binary classification “Pass_Fail” metric Independent variable = 11 variables (institutional data) Early available Actionable (e.g. gender is not actionable) Intervention

Risk factors Faculties split by main/extended pathways Variables used
Students per variable Most frequently count of significance if we look at all the faculties and pathways

Risk factors Included in predictive model Yes = significant
Risk factors are faculty sensitive & vary

Processing challenges
Missing values NBT not taken International students’ marks measured different Not all students have the same subjects in Gr12 Imputation - Modal imputation changed statistical interpretation, we didn’t use imputation, too many data points generated for some variables 2. Skewness Some models assumes normality, data had skewness Transformations did not change the interpretation, we didn’t use. 3. Outliers Inter Quartile Range (k = 1.5) method removed outliers but did not change the interpretations, used as data validation

Predictive model Selected most frequently significant variables
Created success metric for classification, called “Risk categories” based on average final marks: Category 1 = %, Category 2 = 25 – 50%, Category 3 = 50 – 75% Category 4 = 75 – 100% Tested a few models and Random Forest gave best results, thus model of choice Partitioned data into two groups (train and test) with 50% each, stratified by 4 categories.

Predictive model Useful i.e. actionable?
Total = 30052 Test = 15026 Train = 15026 Useful i.e. actionable? Correct / successful prediction?

Actionable? Tiered intervention? Diagnostics? Mentors Skills
Development Tutorials Advising Tiered intervention? Larger groups  scalable interventions Smaller groups  personalized interventions Diagnostics? Why incorrect predictions? Some students are predicted to pass but did not = research

Actionable? Borderline cases Tiered intervention Anomaly detection

Correct? Predicted too high Predicted too low
Green = correct prediction Black = wrong prediction (smaller group) Red = wrong prediction (larger group)

Correct? Some predictions were out by more than 50%!!
Sensitivity & specificity problem (true positive rate & true negative rate, respectively)

Algorithm challenges Algorithm is bad at detecting #4, #2, #1 and bad at rejecting #3 Accuracy is inflated  many false positives Most frequently the average students’ marks falls in #3, therefore algorithm will be correct on average. Cohen’s kappa is measurement of reliability (0  random, 1 = reliable) = none/slight  next approach

Sampling Decided to go with undersampling.
Did not want to create copies of data Smaller dataset = risk of losing information

Algorithms Predictions 10 ML algorithms 100 runs (cross validation)
Accuracy decreased by 20%! Due to undersampling or is this true predictive power?

Challenge accepted! And so I went down the rabbit hole…

Were we are now… Added unsupervised machine learning (PCA) as pre-processing step Accuracy increased 12.7% Cohen’s kappa increased 6 times! From none/slight  moderate/substantial ML based EWS to be piloted in 2020

Lessons learned Your data determine your limits (need depth, breath and quality) Bar graphs are easier to use (complex ≠ better) Algorithms cheat sometimes (mine had inflated accuracy) Everything is a trade-off (ML has many promises but …) ML requires multidisciplinary teams (silos are bad) Ethical considerations (labelling a student as “high risk” can be self fulfilling prophecy) Drink more coffee (my first try was fun!)

Early Warning System considerations
Are we looking at the correct risk factors? Should we include qualitative data? Surveys, campus access data etc.? Are we looking at the correct success metric? How much does average final mark in first year correlate with graduation? Are we looking at a homogeneous population? E.g. should we create an algorithm for each faculty / race / campus? Does the population change? Does the data support our assumptions about the algorithms? By exploring machine learning, we could help students earlier

Thank you!

Targeted student support: A machine learning approach

Similar presentations

Presentation on theme: "Targeted student support: A machine learning approach"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Targeted student support: A machine learning approach

Similar presentations

Presentation on theme: "Targeted student support: A machine learning approach"— Presentation transcript:

Similar presentations

About project

Feedback