Dr. Jigar Jadav, Pace University

Dr. Jigar Jadav, Pace University
An Educational Data Mining study on Web Filter Logs of Secondary School Students Dr. Jigar Jadav, Pace University July 18, 2018

An Educational Data Mining Study on Web Filter Logs of Secondary School Students
Outline: Introduction Background Dataset Problem Statement Methodology Results Contributions Future Work

Introduction (Ch.1) Mobile devices are increasingly being used in K-12 education Educational technology is moving towards one-to-one mobile device or bring your own device (BYOD) initiatives Requires major investment in equipment and infrastructure Support staff required to maintain the equipment Professional development needed for teachers Children’s Internet Protection Act (CIPA) and web filters Web filters log student’s online activity Are school-issued mobile devices effectively used for learning purposes?

Background (Ch.2) Data Analysis in Other Fields:
Marketing Finance Sports Advertising Medicine Customer Service Data Analysis in Education: Educational Data Mining (EDM) is a growing field Mostly qualitative (observational) data analysis performed in mobile learning Minimal research done on quantitative data using machine learning and Big Data algorithms in K-12 setting

Dataset Attribute and Description

Dataset Example Web Queries

Data Analytics Lifecycle (Ch.5)
Reference: EMC Education Services, 2015

Methodology Raw data were extracted and anonymized from web filter logs by authorized school administrator The Family Educational Rights and Privacy Act (FERPA) Painstakingly long process to acquire this data, requiring multiple approvals Data were sorted based on the attribute Rule Set: Staff Administrator Teacher Student

Results of Exploratory Analysis
Term Frequency Histogram Word Cloud

Analysis of Student Web Queries (Ch.4)
Frequency of terms per query % of unique queries to # of terms/query

Results of Web Query Analysis

Problem Statement Binary classification of student web queries as school-related or non-school related is difficult to perform: Web queries tend to be short They are often quite noisy No training data is available Objective: By solving the aforementioned problem, this research aims to determine whether student web queries performed on school-issued mobile devices have an impact on student GPA

Published Work J. Jadav, C. Tappert, M. Kollmer, A. Burke, and P. Dhiman, “Using text analysis on web filter data to explore k-12 student learning behavior,” in UEMCON, IEEE Annual, 2016, pp. 1–5. J. Jadav, A. Burke, and P. Dhiman, M. Kollmer, C. Tappert “Analysis of student web queries,” in Proceedings of the EDSIG Conference ISSN, 2016, p J. Jadav, A. Burke, P. Dhiman, M. Kollmer, and C. Tappert, “Classification of Student Web Queries,” in Proceedings of the CCWC, IEEE, 2016. J. Jadav, A. Burke, G. Goldberg, D. Lindelin, A. Preciado, C. Tappert, and M. Kollmer, “Correlation Discovery Between High School Student Web Queries and their Grade Point Average,” in Proceedings of the CCWC, IEEE, 2016.

Related Work (Ch.2) Classification of web queries is extensively studied in Information Retrieval Knowledge Discovery and Data Mining (KDD) cup challenge in 2005 Classify 800,000 search queries into 67 categories Teams were provided a sample of 111 manually classified web queries sorted into 5 categories No training or testing data provided 800 random queries from 800,000 used for evaluations of solutions submitted (Li, 2005)

Student Web Query Classifier (SWQC) Architecture (Ch.5)

Student Web Query Classifier Algorithm (Implemented in Java)

Preliminary Results of Classification of Student Web Queries
Supervised learning algorithms accuracy: SVM: 1.29% at a threshold of 80% Naïve Bayes: 46% Lack of training data yielded low accuracy Unsupervised algorithm accuracy: SWQC: 90.68% Correctly classified 1052 out of 1160 web queries Used to generate training data for SVM classifier

Preliminary Results of SWQC

Methodology Only student web queries were extracted
Approximately 150,000 from 887 students 40,404 unique web queries remained after preprocessing

Correlation Discovery, K-means Clustering
Selected k = 5 after Within Sum of Squares Analysis (WSS)

Correlation Discovery, Regression with k-means clustering (T=70)

Preliminary Results of Correlation Discovery
Rejected H0 – Percentage of school related SQ, originating from school issued mobile iPad, does not affect student GPA p-value: 4.1x10-5, R-Square:

Recommendation from Proposal Defense
Rerun the entire correlation analysis on a larger dataset Refactor the SWQC algorithm to work with the new Bing search API with JSON Incorporate the dataset attributes Domain and Time into the SWQC algorithm Automate data-cleanup Create a heatmap of web queries

Analysis on Larger Dataset (Ch. 7)
Data from the first quarter of the school year was collected Approximately 1,140,000 for final analysis from 917 students 984,427 student web queries after preprocessing 316,499 unique web queries remained after preprocessing SWQC code was refactored to work with Bing Search API 5.0

Analysis on Larger Dataset
After data cleanup, Amazon web services (AWS) was utilized to store all the search queries on a database Levenshtein data clean-up was added to the SWQC architecture SWQC algorithm was modified to include attributes Domain and Time

Data Clean-up Levenhstein algorithm was added to the SWQC architecture for data clean-up

Revised Student Web Query Classifier Architecture
Levenshtein algorithm

Revised Student Web Query Classifier Algorithm (Ch.7)

Search Queries/Hour

Percent of School-Related SQ

Limitation of SWQC Static corpuses Employs Bing Search API 5.0
School-related corpus is fixed Non-school related corpus is fixed Need for both school-related and non-school related corpuses to be updated automatically Employs Bing Search API 5.0 Enrichment of original search query is both time and monetarily expensive User intent still difficult to decipher Only performs binary classification

Future Work (Computer Science)
Collect web queries from an entire school year Use SWQC to classify the web queries Create a large training dataset Retrain SVM from the large dataset Create a heatmap of percentage of web queries: Find the local min and local max for each quarter Find the global min and the global max for the entire year Perform a multivariable regression analysis Explore other data such as amount of time spent on iPads Based on student web activities, can student in academic peril be predicted with a certain probability?

Future Work (Education)
Where in the curriculum and at what age should students be taught the best practices for searching the Internet? Can teachers be provided with a real-time breakdown of student's schoolwork related searches to improve time-on-task? Can data analysis of student web activities help teachers and guidance counselors identify students that need Response to Intervention (RTI) sooner?

Contributions Conducted ground breaking research in Educational Data Mining (EDM) of secondary school students using web filter logs Created corpuses of school-related and non-school related terms Proposed and implemented a new algorithm, Student Web Query Classifier (SWQC), for the binary classification of student web queries Applied the SWQC classifier to conclude that a correlation exists between percentage of school-related web queries and GPA This research can now be used by stakeholder in making better decisions surrounding one-to-one initiatives

External References EMC Education Services. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons, 2015. Williams, Graham. Data mining with Rattle and R: the art of excavating data for knowledge discovery. Springer Science & Business Media, 2011. Y. Li, Z. Zheng, and H. K. Dai, “KDD cup-2005 report: Facing a great challenge,” ACM SIGKDD Explorations Newsletter, vol. 7, no. 2, pp. 91–99, 2005. Images: courtesy of Google images

Questions?

Exploratory Analysis (Ch.3)
STEPS (Williams, 2011): Loading data Snapshot of a 2-hour window during school hours No student identifiers Only search queries (SQ) were extracted Only unique SQ were loaded Preprocessing Data cleanup Stage the data Explore the data Term frequency Plot term frequencies Relationships between terms Word clouds

Preprocessing: cleanup (R code)
Case folding: docs <- tmmap(docs, tolower) Stem terms: docs <- tmmap(docs, stemDocument, language="english") Remove stopwords: docs <- tmmap(docs, removeWords, stopwords("english")) Other clean-ups performed: Renaming particular terms wwi changed to ww1 Removing unnecessary whitespace

Methodology (prelim) Only student web queries were extracted
Approximately 10,000 for preliminary results from 315 students 6,477 student web queries before preprocessing 1,160 unique web queries remained after preprocessing

Term Frequency Table wf <- data.frame(word=names(freq), freq=freq)
head(wf) word freq ww1 1750 propaganda 1249 plastic 315 surgeri train 306 face 244 disfigur 201 compar 182 show 164 cartoon 152 battl 145 york 143 mammoth 129 woolli 125 wheat 120 meme 117 border 104 forest argonn 100

Results of Exploratory Text Analysis
1687 unique terms with a total frequency of was obtained from the Document-Term Matrix. The summation of term frequencies associated with world war 1 listed in the table below is 4973.

Methodology continued…
Research was conducted in three stages: data collection, model specification and model evaluation Binary classification with supervised learning algorithms Naïve Bayes (NB) Support Vector Machines (SVM) Proposed a new model architecture and developed a new algorithm (SWQC) outperforming NB and SVM Single-variable regression analysis K-means clustering

Dr. Jigar Jadav, Pace University

Similar presentations

Presentation on theme: "Dr. Jigar Jadav, Pace University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Jigar Jadav, Pace University

Similar presentations

Presentation on theme: "Dr. Jigar Jadav, Pace University"— Presentation transcript:

Similar presentations

About project

Feedback