Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Similar presentations


Presentation on theme: "Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved"— Presentation transcript:

0 Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC
Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu, Rui Kimura, Nobuyuki Maita, Yujin Tang, Takafumi Watanabe, Akihiro Kobayashi, Kazunori Matsumoto,and Keiichi Kuroyanagi

1 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
The Final Results Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

2 “KDDILABS&Keiku” Members
Account Name Full Name Affiliations t.MF Akiyama, Takuya KDDI R&D Laboratories, Inc. Aakansh Gupta, Aakansh Uhuru Corporation NoahZh Zhang, Nuo kyone Yonekawa, Kei mz-matsumoto Matsumoto, Kazunori mura Muramatsu, Shigeki ruik Kimura, Rui no6est Maita, Nobuyuki Yujin Tang, Yujin Keiku Kuroyanagi, Keiichi Financial Engineering Group, Inc. TakWat Watanabe, Takafumi apf-koba Kobayashi, Akihiro Working at KDDILABS office Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

3 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
What is “KDDI”? Knowledge Discovery and Data mining Institute …? NO! A Japanese telecommunication company “KDDI” is an acronym standing for Japanese words No relation between KDD2015 and KDDI We did NOT Cheat Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

4 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
System Overview XGBoost† Regularized Greedy Forest Original Data Blend Submit Data Feature ×2000 Deep Neural Network Bagging of 200 models Our special twist is “strategic” feature engineering Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

5 In late Jun, we were merged as a team
At First... Each member started KDD Cup separately. Each member created features separately. In late Jun, we were merged as a team The number of features : 1500 over “Basic Features” Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

6 Examples of Basic 1500 Features
4m 30s Target prediction interval The number of logs of the eID 30s 1m 15s Counting up 2h 30m 5s 50s The number of lags Time 3m :access log Time to Target Prediction Interval All features has variation with respect to labels like: Time window, category, event, source, or their combination Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

7 ROC Curve & Predicted Value Distribution
Prediction of training data by XGBoost at 10-fold cross validation. Dropout eID True Positive Rate Non Dropout eID Density False Positive Rate Predicted Value Why can’t we predict “Lower right” eID accurately? “Lower right” eID do not have enough number of logs, in some cases there are only 1 log, but they did not drop out the course. Because there are less number of logs, it is hard to predict their dropout probability by basic features. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

8 Our Strategy & Features
Creating features which do NOT depend on the number of logs We created the features by 3 kinds of methods Aggregating “Cross-Course” logs Using Idea of Recommendation System Using Time-Series Prediction Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

9 ①Aggregating “Cross-Course” logs
Idea: About 1/3 users attend multiple courses. All users : Users attending multiple courses : 38939 active course count 1 2 3 4 5 ... 28 29 30 31 39 the number of user 73509 20251 8237 4118 2277 It is effective to create features by logs of not only the object course, but also other active courses. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

10 ①Aggregating “Cross-Course” logs
How to create features: A Maybe, some user have enrolled multiple courses at once, and attended courses one by one. Target prediction interval of Course A Course A There is a high probability of attending Course A in this period. User a Course B Course C time Only 1 log at Course A Although there are many logs at Couse B & Course C There are little logs in the period Counting up the number of logs, unique days, or unique courses in which logs exist by moving time window (window size : 5days) Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

11 ①Aggregating “Cross-Course” logs
How to create features: B If there is some relationship between a target course and an other course, logs of the target course may exist near logs of the other course. Target prediction interval of Course A Relational Courses There are a high prob ability of Existing Logs of Course A nearby Logs of Course B Course A User a Course B time Two Steps: Making a matrix of interrelationship of all courses which is transition probability from one course log to other course log. Calculating a sum of products of logs and interrelationship of each courses to the target course at the prediction interval. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

12 ②Using Idea of Recommendation System
We want to create features by NOT using logs. →other users who enroll similar course pattern to the user is useful. How to create features: Creating features by Collaborative Filtering which is often used as recommendation system in e-commerce sites or search engines Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

13 ②Using Idea of Recommendation System
Collaborative Filtering: Calculate similarities between the user and other users by comparing active course patterns of each users Calculate reasonable value which is calculated by a weighted average of other users value whose similarities are higher than threshold  Course User  . 1 2 3 4 5 Similarity to User A A × - B 0.7 C 0.2 D 0.8 Course User 1 2 3 4 5 A × 20 B 130 50 C 30 D 40 70 (130×0.7+50×0.8)/ ( )=87 The user may continue to attend this course. ○ means “enrolled” × means “not enrolled” Feature value (in this case, the number of logs) Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

14 ③Using Time-Series Prediction
Idea: Is there consistent trend at the numbers of unique users who attend the courses in specific days? If we know the number of unique users who attend the courses in dropout judgment period and an order of users who is more likely to attend the courses in its period, we can see the boundary of dropout users and non-dropout users. How to create features: Using ARIMA which is often used in financial prediction or telecommunication traffic prediction Predicting unique users in judgment period by using a transition of unique users in each specific time window (10days) Ranking users according to most useful feature values in previous dropout predict system. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

15 ③Using Time-Series Prediction
Trainsitions of unique users (time window:10days) Actual Value Prediction Using ARIMA Predicted number of unique users who attend the course in day 31~40 User ranking according to specific feature values (for example, the number of logs) and normalization by the predicted number of unique users Username Ranking of the feature value Normalized value a 1 0.001 b 1000 c 2000 2 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

16 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Results Predicted Value Distribution Prediction of “Lower right” eID are improved Dropout eID Non Dropout eID Dropout eID Non Dropout eID Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

17 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Results Final AUC becomes Final private score is Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

18 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Miscellaneous In this competition, we didn’t use some features created from truth data of training set because we were afraid of over-fitting to training set. Maybe It restricted more flexible idea and was why we got no more than 6th rank. Creating wide variety and useful features was important. However of course, the choice of three kind of models (XGBoost, Regularized Greedy Forest, and Bagging Deep Learning) was also important of, so we really appreciate the authors of used models and libraries. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

19 Thanks for your attention.
Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved


Download ppt "Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved"

Similar presentations


Ads by Google