Learning Analytics Research

Learning Analytics Research
Ryan Baker Teachers College, Columbia University

Thanks For inviting me to visit today
Always a pleasure and honor to visit the London Knowledge Lab

Please interrupt anytime…

Developments I’ll be discussing today
Big Data in Education Learning Analytics/Educational Data Mining

Big Data

Data Used to Be Dispersed Hard to Collect Small-Scale
Collecting sizable amounts of data required heroic efforts

Tycho Brahe Spent 24 years observing the sky from a custom-built castle on the island of Hven

Johannes Kepler Had to take a job with Brahe to get Brahe’s data

Johannes Kepler Had to take a job with Brahe to get Brahe’s data
Only got unrestricted access to data…

Only got unrestricted access to data… when Brahe died

Only got unrestricted access to data… when Brahe died and Kepler stole the data and fled to Germany

Alex Bowers Teachers College, Columbia University

Alex Bowers Teachers College, Columbia University
“For my dissertation I wanted to collect all of the data for all of the assessments (tests and grades and discipline reports, and attendance, etc.) for all of the students in entire cohorts from a school district for all grade levels, K-12. To get the data, the schools had it as the students' "permanent record", stored in the vault of the high school next to the boiler, ignored and unused. The districts would set me up in the nurse's office with my laptop and I'd trudge up and down the stairs into the basement to pull 3-5 files at a time and I'd hand enter the data into SPSS. Eventually I got fast enough to do about 10 a day, max.”

Data Today Every day, every one of us generates lots of data, with almost everything we do.

Data Today

Data Today As more learning takes place within educational software and online learning environments of various types, it becomes much easier to gather very rich data on individual students’ learning and engagement within specific subjects. For example, a student might use a science simulation like the Inq-ITS, to learn science content and inquiry skills. Or they might learn scientific inquiry skill and content within a virtual environment like EcoMUVE. They might learn math skill in an action game like Zombie Division – the student has a set of weapons with numbers associated with them, a 2 for a sword, or a 5 for a gauntlet, and they can divide a skeleton if the weapon divides the number on the skeleton’s chest. Or they might learn math in a conceptual story-based learning environment like Reasoning Mind… or by doing math problems in a workbook-like environment like ASSISTments. All of these environments generate rich data streams that have been used in EDM analyses. And this kind of software in becoming more widespread every day. Systems like the Cognitive Tutor, or ASSISTments, or Reasoning Mind, are used by tens or hundreds of thousands of students, one or two days a week.

Student Log Data *000:22:297 READY . *000:25:875 APPLY-ACTION WINDOW; LISP-TRANSLATOR::AUTHORINGTOOL-TRANSLATOR, CONTEXT; 3FACTOR-CROSS-XPL-4, SELECTIONS; (GROUP3_CLASS_UNDER_XPL), ACTION; UPDATECOMBOBOX, INPUT; "Two crossover events are very rare.", *000:25:890 GOOD-PATH *000:25:890 HISTORY P-1; (COMBOBOX-XPL-TRACE SIMBIOSYS), *000:25:890 READY *000:29:281 APPLY-ACTION SELECTIONS; (GROUP4_CLASS_UNDER_XPL), INPUT; "The largest group is parental since crossovers are uncommon.", *000:29:281 GOOD-PATH *000:29:281 HISTORY *000:29:281 READY *001:20:733 APPLY-ACTION SELECTIONS; (ORDER_GENES_OBS_XPL), INPUT; "The Q and q alleles have interchanged between the parental and SCO genotypes.", *001:20:733 SWITCHED-TO-EDITOR *001:20:748 NO-CONFLICT-SET *001:20:748 READY *001:32:498 APPLY-ACTION INPUT; "The Q and q alleles have interchanged between the parental and DCO genotypes.", *001:32:498 GOOD-PATH *001:32:498 HISTORY *001:32:498 READY *001:37:857 APPLY-ACTION SELECTIONS; (ORDER_GENES_UNDER_XPL), INPUT; "In the DCO group BOTH outer genes cross over so the interchanged gene is the middle one.", *001:37:857 GOOD-PATH For example, as a student uses one of these interactive learning environments, the student will make hundreds of meaningful actions each hour – pausing and thinking before making an incorrect answer, asking for help, rapidly changing settings on a simulation, running away from a skeleton. When the data is logged, these behaviors provide us with incredibly rich detail about learning and engagement, that we can analyze.

PSLC DataShop/LearnSphere (Koedinger et al, 2008, 2010)
>250,000 hours of students using educational software within LearnLabs and other settings >30 million student actions, responses & annotations

This amount of data is supporting a revolution in the science of learning
Whereas beforehand education research had to be conducted using small amounts of data Single studies in schools Small numbers of researchers Now we have large amounts of data Enabling us to utilize the methods of data mining to find unexpected patterns

“the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs.” To quote the Society for Learning Analytics Research… EDM and learning analytics methods have some similarities with traditional data mining methods, but as with the other areas where data mining methods have been common: bioinformatics, medical informatics, business analytics, data analysis methods in physics, and so on, the unique features of the domain of education leads to the development of unique methods. (

Goals Joint goal of exploring the “big data” now available on learners and learning To promote New scientific discoveries & to advance science of learning Better assessment of learners along multiple dimensions Social, cognitive, emotional, meta-cognitive, etc. Individual, group, institutional, etc. Better real-time support for learners

The explosion in data is supporting a revolution in the science of learning
Large-scale studies have always been possible… But it was hard to be large-scale and fine-grained And it was expensive

Types of EDM/LA Method (Baker & Siemens, 2014; building off of Baker & Yacef, 2009)
Prediction Classification Regression Latent Knowledge Estimation Structure Discovery Clustering Factor Analysis Domain Structure Discovery Network Analysis Relationship mining Association rule mining Correlation mining Sequential pattern mining Causal data mining Distillation of data for human judgment Discovery with models

Many applications Dropout/success prediction
Automated intervention for better individualization Better reporting for teachers Basic discovery in education

I’ll give examples in the context of…
Automated intervention for better individualization

Individualization Folks have been talking about individualizing education for a long time (Rousseau, 1762; Parkhurst, 1922)

We’re starting to get there…

With successful examples like
Mastery learning in Cognitive Tutors Support for student effort in Wayang Outpost Affect-sensitivity in AutoTutor

Individualization requires
Determining something about the student Knowing what matters Doing the right thing about it

Determining something about the student
Knowing what matters Doing the right thing about it

Stuff We Can Infer: Complex Cognitive Skill
Programming (Corbett & Anderson, 1995) Physics (Martin & VanLehn, 1995) Mathematics (Feng et al., 1999) Databases (Mitrovic et al., 2001) Science Inquiry Skill (Sao Pedro et al., 2013; Baker & Clarke-Midura, 2013)

Stuff We Can Infer: Deep Learning
Retention (Jastrzembski et al., 2006; Pavlik et al., 2008; Wang & Beck, 2012) Transfer/Shallow Learning (Baker et al., 2011, 2012) Preparation for Future Learning (Baker et al., 2011; Hershkovitz et al., 2013)

Stuff We Can Infer: Meta-Cognition
Self-Efficacy/Uncertainty/Confidence (Litman et al., 2006; McQuiggan, Mott, & Lester, 2008; Arroyo et al., 2009) Unscaffolded Self-Explanation (Shih et al., 2008; Baker, Gowda, & Corbett, 2011) Help Avoidance (Aleven et al., 2004, 2006) Help-Seeking Strategies (Aleven et al., 2004; Harris, Bonnett, Luckin, Yuill, & Avramides, 2009) Conscientiousness and Persistence (Ventura et al., 2012)

Stuff We Can Infer: Disengaged Behaviors
Gaming the System (Baker et al., 2004, 2008, 2010; Walonoski & Heffernan, 2006; Beal, Qu, & Lee, 2007) Off-Task Behavior (Baker, 2007; Cetintas et al., 2010) Inexplicable “WTF” Behavior (Rowe et al., 2009; Wixon et al., 2012) Carelessness (San Pedro et al., 2011; Hershkovitz et al., 2011)

Stuff We Can Infer: Teacher Strategic Behaviors
Curriculum Planning Behaviors (Maull et al., 2010) Teacher Interventions for Students (Miller et al., 2015)

Stuff We Can Infer: Affect
Boredom Frustration Confusion Engaged Concentration/Flow Curiosity Excitement Situational Interest Joy/Delight (D’Mello et al., 2008; Mavrikis, 2008; Arroyo et al., 2009; Conati & Maclaren, 2009; Lee et al., 2011; Sabourin et al., 2011; Baker et al., 2012, 2014; Paquette et al., 2014; Pardos et al., 2014; Kai et al., in press)

Sensor-free detection possible
Recent systems have been able to infer these constructs solely from student interaction with the learning system

Example Automated detectors of student engagement and affect in ASSISTments (Pardos et al., 2013; Ocumpaugh et al., 2014)

ASSISTments Web-based mathematics tutor
Primarily for middle school math Gives student mathematics questions Offers multi-step hints to struggling student If student makes error, student is given scaffolding that breaks the original questions down into sub-steps

Over 50,000 kids in

Efficacy Leads to better learning than traditional homework (Mendicino et al., 2009; Singh et al., 2011) Leads to better learning than traditional classroom practice (Koedinger, McLaughlin, & Heffernan, 2011)

Process Obtain human judgments on student engagement and affect
Leverage these human judgments to develop models using data mining That can replicate the judgments solely from interactions between the student and the software

The Goal

Measures That are Automated: Able to make assessments about students in real-time, with no human in the loop

Measures That are Automated: Able to make assessments about students in real-time, with no human in the loop Fine-grained: Able to make assessments about students second-by-second

Measures That are Automated: Able to make assessments about students in real-time, with no human in the loop Fine-grained: Able to make assessments about students second-by-second Validated: Demonstrated to apply to new students and new contexts

Many options for getting human judgments of engagement affect
(Porayska-Pomsta, Mavrikis, et al., 2011)

We use Expert Field Observations of Student Engagement and Affect
Using BROMP observation protocol (Ocumpaugh et al., 2012, 2015) Synchronized to log files with Android app HART (Ocumpaugh et al., in press)

BROMP protocol Protocol designed to reduce disruption to student
Some features of protocol: observe with peripheral vision or side glances, hover over student not being observed, 20-second “round-robin” observations of several students, bored-looking people are boring Inter-rater reliability around 0.8 for behavior, 0.65 for affect Over 150 coders now trained in USA, Philippines, India

BROMP Used in contexts from Science education classes in India

BROMP Used in contexts from Science education classes in India
Kindergarten activities in USA

Kindergarten activities in USA Online ecology learning in Philippines using system developed by Luckin and colleagues

Kindergarten activities in USA Online ecology learning in Philippines using system developed by Luckin and colleagues ASSISTments middle school classrooms in USA

Use data mining to find behaviors that co-occur with human observations
Distill features of interaction hypothesized to correlate to desired construct

Distill features of interaction hypothesized to correlate to desired construct Better to use theoretical understanding and automated discovery together

Distill features of interaction hypothesized to correlate to desired construct Better to use theoretical understanding and automated discovery together Than to just throw spaghetti at the wall and see what sticks

Distill features of interaction hypothesized to correlate to desired construct Better to use theoretical understanding and automated discovery together Than to just throw spaghetti at the wall and see what sticks (Sao Pedro et al., 2012; Paquette et al., 2014)

Try a small set of data mining/prediction/classification algorithms that fit different kinds of patterns Decision Trees Decision Rules Step Regression Naïve Bayes K*

Test model generalizability on new students and new populations
In this case, students in rural, urban, and suburban schools in Northeastern USA Diverse in terms of SES, race, ethnicity

Detector Creation Data Set
505 students 6 schools 3621 observations Diverse population: Urban, Suburban, Rural From very low per-capita income to average

Model Assessment Models assessed using A’
The model’s ability to distinguish when an affective state is present (e.g. is student bored or not) Chance = 0.5, Perfect = 1.0, First-level medical diagnostics > 0.8 Similar to area under ROC curve (AUC ROC) More principled method for choosing cut-offs and boundaries between labels Integrate all the data to assess what cut-offs should be, rather than needing to choose ad-hoc cut-offs Example: If faster changes are associated with haphazard inquiry, how fast does an action have to be? 5 seconds? 10 seconds? 40 seconds? boundaries/relationships between labels At a high level, this approach leverages machine-learning to “discover” what it means to design controlled experiments and test stated hypotheses in our learning environment. Thus, unlike knowledge engineering in which rules to describe behaviors are authored by a human (cf., Koedinger & MacLaren, 2002), our machine-learning approach attempts to derive rules based, in part, on student data. More specifically, we employed “text replay tagging” of log files (Sao Pedro, et al., 2010; Montalvo et al., 2010; Sao Pedro et al., in press), an extension to the text replay approach developed in Baker, Corbett and Wagner (2006) to build and validate behavior detectors. Text replay tagging, a form of protocol analysis (Ericsson & Simon, 1980, 1984), leveraged human judgment to identify whether students’ log files demonstrated inquiry skill. -- Future work on validation = Regardless of exploration during inquiry, the detectors can distinguish whether or not the student knows the skill rather than prescribe rules as done in KE Be sure to highlight how this differs from Knowledge Engineering.

Model Assessment Models assessed using Cohen’s Kappa
The degree to which the model is better than base rate Base Rate = 0, Perfect = 1.0 More principled method for choosing cut-offs and boundaries between labels Integrate all the data to assess what cut-offs should be, rather than needing to choose ad-hoc cut-offs Example: If faster changes are associated with haphazard inquiry, how fast does an action have to be? 5 seconds? 10 seconds? 40 seconds? boundaries/relationships between labels At a high level, this approach leverages machine-learning to “discover” what it means to design controlled experiments and test stated hypotheses in our learning environment. Thus, unlike knowledge engineering in which rules to describe behaviors are authored by a human (cf., Koedinger & MacLaren, 2002), our machine-learning approach attempts to derive rules based, in part, on student data. More specifically, we employed “text replay tagging” of log files (Sao Pedro, et al., 2010; Montalvo et al., 2010; Sao Pedro et al., in press), an extension to the text replay approach developed in Baker, Corbett and Wagner (2006) to build and validate behavior detectors. Text replay tagging, a form of protocol analysis (Ericsson & Simon, 1980, 1984), leveraged human judgment to identify whether students’ log files demonstrated inquiry skill. -- Future work on validation = Regardless of exploration during inquiry, the detectors can distinguish whether or not the student knows the skill rather than prescribe rules as done in KE Be sure to highlight how this differs from Knowledge Engineering.

Cross-Validation for Generalizability
Validating detector generalizability Student Level Validation: Train on data from (N-1) groups of students, test on Nth group

Sometimes also… Content Level Validation:
Validating detector generalizability Content Level Validation: Train on data from (N-1) groups of content, test on Nth group

Sometimes also… Population Level Validation:
Validating detector generalizability Population Level Validation: Train on data from (N-1) populations, test on Nth pop

Model Goodness (Pardos et al., 2013)
Construct Algo A’ Kappa Boredom JRip 0.632 0.229 Frustration Naïve Bayes 0.681 0.301 Engaged Concentration K* 0.678 0.358 Confusion J48 0.736 0.274 Off-Task REPTree 0.819 0.506 Gaming 0.802 0.370 The affect detectors’ predictive performance were evaluated using A' [28] and Cohen’s Kappa [18]. An A' value (which is approximately the same as the area under the ROC curve [28]) of 0.5 for a model indicates chance-level performance for correctly determining the presence or absence of an affective state in a clip, and 1.0 performing perfectly. Cohen’s Kappa assesses the degree to which the model is better than chance at identifying the affective state in a clip. A Kappa of 0 indicates chance-level performance, while a Kappa of 1 indicates perfect performance. A Kappa of 0.45 is equivalent to a detector that is 45% better than chance at identifying affect. As discussed in [37], all of the affect and behavior detectors performed better than chance. Detector goodness was somewhat lower than had been previously seen for Cognitive Tutor Algebra [cf. 6], but better than had been seen in other published models inferring student affect in an intelligent tutoring system solely from log files (where average Kappa ranged from below zero to 0.19 when fully stringent validation was used) [19, 22, 44]. The best detector of engaged concentration involved the K* algorithm, achieving an A' of and a Kappa of The best boredom detector was found using the JRip algorithm, achieving an A' of and a Kappa of The best confusion detector used the J48 algorithm, having an A’ of 0.736, a Kappa of The best detector of off-task behavior was found using the REP-Tree algorithm, with an A’ value of 0.819, a Kappa of The best gaming detector involved the K* algorithm, having an A’ value of 0.802, a Kappa of These levels of detector goodness indicate models that are clearly informative, though there is still considerable room for improvement. The detectors emerging from the data mining process had some systematic error in prediction due to the use of re-sampling in the training sets (models were validated on the original, non-resampled data), where the average confidence of the resultant models was systematically higher or lower than the proportion of the affective states in the original data set. This type of bias does not affect correlation to other variables since relative order of predictions is unaffected, but it can reduce model interpretability. To increase model interpretability, model confidences were rescaled to have the same mean as the original distribution, using linear interpolation. Rescaling the confidences this way does not impact model goodness, as it does not change the relative ordering of model assessments. Application of Affect and Behavior Models to Broader Data Set Once the detectors of student affect and behavior were developed, they were applied to the data set used in this paper. As mentioned, this data set was comprised of 2,107,108 actions in 494,150 problems completed by 3,747 students in three school districts. The result was a sequence of predictions of student affect and behavior across the history of each student’s use of the ASSISTment system.

Technical Detail (Ocumpaugh et al., 2014)
Models trained only on students from a single population (urban, suburban, rural): work well on that population are inappropriate for different populations, where they perform just barely better than chance Models trained on the students on all three populations work just as well as single-population models for urban and suburban students Still don’t work very well for rural students

Result Models can make inference in real-time (20 second delay)
Models can be applied at scale to retrospective log files

Off-Task Behavior We can detect off-task behavior (cf. Baker, 2007; Cetintas et al., 2009; Pardos et al., 2013)

Off-Task Behavior We can detect off-task behavior (cf. Baker, 2007; Cetintas et al., 2009; Pardos et al., 2013) Off-task behavior is continually a major focus of classroom management practice

But… Off-Task Behavior is:
More weakly correlated with learning and other outcomes than many other constructs Can foster positive collaborative relationships (cf. Goldman, 1996; cf. Barron, 2003; Kreijns, 2008) E.g. a collaboration strategy Can disrupt boredom (Baker et al., 2011) E.g. an emotional regulation strategy

So… Reduction of off-task behavior should probably be less a focus of classroom management practice than it currently is today Though carefully leveraging and managing it may be beneficial and useful…

By contrast Gaming the system and boredom associated with substantial differences in learning outcomes (Baker et al., 2004; Craig et al., 2004; Cocea et al., 2009; Rodrigo et al., 2009; Pardos et al., 2014) Correlation between gaming the system and learning is similar to Correlation between cigarette smoking and lifespan

Predicting the Future Can we go beyond inferring short-term outcomes…
To predicting the future? Not a trivial task…

Predicting the Future “Prediction is very difficult, especially about the future.” – Niels Bohr

Which matters more? The Present The Future
Does the student know the current skill? Will the student remember that skill next week? Has the student learned the current concept? Will the student learn the next concept? Is the student doing well in the course? Will the student pass the course? Does the student have high (and accurate) self-efficacy? Will the student maintain their self-efficacy when the challenges grow? Is the student engaged right now? Will the student be engaged enough to persist with difficult material? Is the student interested in the material? Will the student choose to take the next course?

Some early work in this area
Predicting student retention of knowledge (Jastrzembski et al., 2006; Pavlik & Anderson, 2008; Wang & Heffernan, 2012) Predicting future student transfer of knowledge and preparation for future learning (Baker, Gowda, & Corbett, 2011a, 2011b; Hershkovitz et al., 2013) Predicting final course grade early in the course (Superby et al., 2006; Arnold, 2010) Predicting future college dropout (Dekker et al., 2009) Predicting future participation in communities of practice (Wang et al., 2015)

Example (San Pedro, Baker, Bowers, & Heffernan, 2013)
Taking automated detectors of engagement, affect, and learning in ASSISTments Applied to several years of entire-year student data 3747 students 2,107,108 actions within the software

Predict College Attendance (San Pedro et al., 2013)
Student knowledge, engaged concentration, carelessness associated with going to college Gaming the system, boredom, confusion associated with not going to college Overall model A’ = 0.69

Note Carelessness positively associated with college until you control for student knowledge Then associated with not going to college Carelessness is the disengaged behavior of generally successful students (cf. Clements, 1982)

Predict Selective College Attendance (San Pedro et al., 2013)
Student knowledge, engaged concentration, carelessness associated with going to selective college Gaming the system, boredom associated with not going to selective college Overall model A’ = 0.76

Predict STEM Major in college (San Pedro et al., 2014)
Student knowledge, carelessness associated with STEM major Gaming the system associated with non-STEM major (D= 0.573) Overall model A’ = 0.68

Future Work Go further forward still – post-college choice of job and success in career

How do we use this information?
Inform and Empower School Personnel Automated Intervention Determine Least Engaging/Most Engaging Content

Reports to Guidance Counselors (Ocumpaugh et al., in preparation)

Reports to Regional Coordinators (Mulqueeny et al., in preparation)
Another online curriculum we work with, Reasoning Mind, deploys reports on student engagement to regional coordinators Allowing them to target teachers for additional support and professional development

Automated Intervention (Baker et al., 2006)

Curricular Refinement (Baker et al., 2009)

Other recent work… Large-scale use of BROMP field observations in Chennai, India Modeling science inquiry skill in online learning Studying student negativity in MOOCs Unified assessment across intelligent tutors and games Predicting failure in undergraduate courses from interactions with e-textbook before course start

Learn More Big Data and Education
Grad school first semester course level Details on broad range of methods and experience conducting them Next run on EdX starting June 29, 2015

Learn More Data, Analytics, and Learning Introductory level
Discussion of principles, some experience with Tableau, Social Network Analysis, Text Mining Next run on EdX starting Fall 2015

Learn More Masters in Learning Analytics, Teachers College Columbia University

All lab publications available online – Google “Ryan Baker”
Thank You twitter.com/BakerEDMLab Baker EDM Lab All lab publications available online – Google “Ryan Baker”

Learning Analytics Research

Similar presentations

Presentation on theme: "Learning Analytics Research"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Analytics Research

Similar presentations

Presentation on theme: "Learning Analytics Research"— Presentation transcript:

Similar presentations

About project

Feedback