Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Analytics at Marist College: From a single node prototype to a cluster computing platform Eitel Lauría (*) Peggy Kuck (**), Edward Presutti (**),

Similar presentations


Presentation on theme: "Learning Analytics at Marist College: From a single node prototype to a cluster computing platform Eitel Lauría (*) Peggy Kuck (**), Edward Presutti (**),"— Presentation transcript:

1 Learning Analytics at Marist College: From a single node prototype to a cluster computing platform Eitel Lauría (*) Peggy Kuck (**), Edward Presutti (**), Sandeep Jayaprakash (**), Matthew Sokoloff (**) (*) School of Computer Science & Mathematics (**) Enterprise Solutions for Data Science & Analytics Marist College Enterprise Computing Conference (ECC 2016) June 12-14, 2016

2 Alarming Stats <40% 4-year completion rate across all four-year institutions in the US 21% for Black students 25% for Hispanic students <60% 6-year completion rate for four-year institutions 40% for Black students 49% for Hispanic Students <45% 25-to-34 Year-Olds with an Associate Degree or Higher (US ranked 14th among 36 developed nations) Sources: U.S. Dept. of Education, Postsecondary Education Data System (2009) CollegeBoard, Advocacy & Policy Center, The Completion Agenda 2011 /2012 Progress Report Enterprise Computing Conference (ECC2016) June 12-14, 2016

3 Open Academic Analytics Initiative @ Marist EDUCAUSE Next Generation Learning Challenges (NGLC) grant Funded by Bill and Melinda Gates Foundation Create “early alert” framework: Use machine learning on large datasets to predict academically at-risk students in initial weeks of a course Deploy intervention to improve chances of success Based on Open ecosystem for academic analytics Sakai Collaboration and Learning Environment Weka + Kettle (Pentaho BI) Collaboration with IBM (SPSS Modeler for prototyping ) June 12-14, 2016 Enterprise Computing Conference (ECC2016)

4 Learning Analytics Processor @ Marist: Early Alert How does it actually work? Learning Analytics Processor @ Marist: Early Alert How does it actually work? (binary classification problem) Hardware Platform: IBM zEnterprise 114 with BladeCenter Extension (zBX) BladeCenter Extension (zBX) Virtualized Servers: 64 bit, 16/32 GB RAM Virtualized Servers: 64 bit, 16/32 GB RAM Linux Red Hat Linux Red Hat Extraction, Transformation & Loading Scoring (predictions on new student data using library of persisted learnt classifiers (predictions on new student data using library of persisted learnt classifiers) Predictive Model Building (classifiers learnt from data) New Student Data (early in the Semester) Prediction of At-risk students Single node architecture Relational Storage Intervention Student Academic Data Student Demographic Data LMS Event Log Data LMS Gradebook Data June 12-14, 2016 Enterprise Computing Conference (ECC2016) SATs, GPA, HS ranking, Course size, Course grade (target feature) age, gender, ethnicity, income level Sessions Resources Lessons Assignments Forums Tests Partial contributions to final grade Logistic Regression SVMs Naïve Bayes J48 Decision Trees

5 Promising Outcomes Phase I: Single Node (OAAI)AccuracyRecallFP Rate Marist -3 semesters, 25K records each 86%87%13% Pilots (only first Year Courses, 2 semesters, avg) Results followed quality of input data -Savannah State71%76%25% -College of the Redwoods80%81%16% -Cerritos CC70%72%25% -NCAT68%50%30% The project was well received: Computerworld Honors Laureate and Finalist in the Emerging Technology category (over 700 nominations) Campus Technology Innovator Award (one of only 9 recipients, over 230 applied) The OAAI became a de facto open source reference architecture for learning analytics June 12-14, 2016 Enterprise Computing Conference (ECC2016)

6 Learning Analytics Processor @ Marist 2.0: Early Alert Cluster Computing Architecture New Student Data (early in the Semester) Prediction of At-risk students Intervention Scoring (predictions on new student data using library of persisted learnt classifiers (predictions on new student data using library of persisted learnt classifiers) Hardware Platform (Dev) Linux VMs (32GB RAM) running on Linux VMs (32GB RAM) running on IBM PureFlex System IBM PureFlex System Distributed Storage (HDFS) & Processing Extraction, Transformation & Loading Predictive Model Building (classifiers learnt from data) Job Scheduling Student Academic Data Student Demographic Data LMS Event Log Data LMS Gradebook Data Library Data Student Engagement Data Social network Data and more … CURRENTFUTURE Big Data Scales well for Big Data use cases (more volume & variety) June 12-14, 2016 Enterprise Computing Conference (ECC2016) Logistic Regression Random Forests Naïve Bayes

7 ETL Logical Flow June 12-14, 2016 Enterprise Computing Conference (ECC2016) MySql Flattened Tables LMS Moodle/Sakai Pre-ETL Store in HDFS Files Moved into Hive (External linked files were also considered) HQL Final Output Multiple HQL Files execute in order (some in parallel) Table with Predictors for consumption by training, testing and scoring flows ETL Process expects 5 source tables: –Activity, Course, Enrollment, Grade, Personal –LMS Extract - Initial extracts from LMS were stored in a MySql database –Exposed Views – Views were provided to “hide” database complexity facilitating ease of import operation: vw_activity, course_vw, enrollment_vw, grade_vw, personal_vw

8 June 12-14, 2016 Enterprise Computing Conference (ECC2016) Clear Transient Directory fork Load Course into HDFS Load Activity into HDFS Load Enrollment into HDFS fork Load grade into HDFS Load Personal into HDFS Oozie Flow Create DB and lookup Table Copy File into all Tables Create Final Output Table Populate Final Output table Change transient permissions * Five Sqoop load steps could be in parallel, cluster resources dictate capacity of simultaneous actions ** Configuration issues resulted in separating Sqoop load and Hive store separation

9 ETL Lessons Learned HQL is not SQL (you are in a file system not a database) Implementation techniques like the creation of views and in particular cascading views caused significant performance issues when processing data Transient data was resolved with temporary tables for cascading data o Also allowed for debugging ETL stages Inbound data from MySql was sourced from views, keys and indexes were unavailable for Sqoop optimization Linux tmp space for logs and transient files was undersized and caused cluster failures that presented as “waiting on resources” – heart beats on Yarn jobs June 12-14, 2016 Enterprise Computing Conference (ECC2016)  Data Volumes (NCSU one semester)  Distinct Students ~32k  Course Count ~7k  Enrollment ~157k  Events ~30m  Data Size (NCSU one semester)  Activty ~ 4gb  Course ~ 506kb  Enrollment ~ 14mb  Grade ~ 216mb  Personal ~ 4mb

10 Predictive Modeling (overview) Software –PySpark (Spark + Python API) –ML –ML Lib Machine Learning Algorithms –Logistic Regression with LBFGS –Decision Tree Ensembles (Random Forest) –Naïve Bayes Architecture –2 flows Training/Testing Prediction (scoring) June 12-14, 2016 Enterprise Computing Conference (ECC2016)

11 Predictive Modeling (process) Steps: –Import from HDFS –Impute missing values Categorical variables are assigned to the mode Continuous variables are assigned to the mean –Stratified Sampling (Training flow only) Even distribution of positive and negative outcomes –ML pipeline (Transformations, detailed on the next slide) –Load saved ML Lib model and make predictions (Prediction flow only) –Save output to HDFS Training flow produces a model as output with prediction on data with known results for validation purposes. Testing flow produces predictions and labels as output Output must be used to confirm predictive performance. Recall (TP Rate) FP Rate Accuracy (not good for unbalanced datasets) June 12-14, 2016 Enterprise Computing Conference (ECC2016)

12 Predictive Modeling (ML pipeline) Takes the preprocessed data and applies transformations and machine learning algorithms Our Process: –Apply the String Indexer to categorical data The numerical value assigned to a categorical data point is a function of its relative frequency. Most frequent category is assigned a value of zero. –Indicate the target variable and predictors using the Vector Assembler –Convert to ML Lib apply algorithm of choice (detailed on the next slide) Logistic regression was chosen for production since it had the best results June 12-14, 2016 Enterprise Computing Conference (ECC2016)

13 Predictive Modeling (model persistence) Spark ML does not support model persistence in Python –This library was less than a month old when we began using it The solution was to use Spark ML Lib which supports saving models –We converted the output of the ML Pipeline into an RDD, a datatype that ML Lib is compatible with. –Steps: Convert the pipeline into a Pandas DataFrame Turn the Pandas DataFrame into a matrix Create a Pyspark RDD from the matrix Construct a LabeledPoint object from the RDD Pass the LabeledPoint to the ML Lib train function to generate a model –At this point we were able to call the save function offered by ML Lib to save to HDFS June 12-14, 2016 Enterprise Computing Conference (ECC2016)

14 Promising Outcomes (II) Phase II: Cluster ComputingAccuracyRecallFP Rate Marist -3 semesters, 25K records each86% 87% 14% North Carolina State University (first set of results, May 2016) -3 semesters, 160K recs each81% 77% 18% -3 semesters, online, 85K recs each80%82%19% NCSU is currently implementing Marist LAP framework Marist LAP chosen as a key component of the UK’s national analytics infrastructure provided by Jisc June 12-14, 2016 Enterprise Computing Conference (ECC2016)

15 Questions ? June 12-14, 2016 Enterprise Computing Conference (ECC2016)

16 Thank You !! June 12-14, 2016 Enterprise Computing Conference (ECC2016)


Download ppt "Learning Analytics at Marist College: From a single node prototype to a cluster computing platform Eitel Lauría (*) Peggy Kuck (**), Edward Presutti (**),"

Similar presentations


Ads by Google