PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija.

Slides:

Advertisements

Similar presentations

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

An Introduction to Boosting Yoav Freund Banter Inc.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.

K Means Clustering , Nearest Cluster and Gaussian Mixture

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

K nearest neighbor and Rocchio algorithm

1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Machine Learning: Ensemble Methods

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

Ensemble Learning (2), Tree and Forest

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

Automatic Gender Identification using Cell Phone Calling Behavior Presented by David.

Evaluating Classifiers

Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

MULTICLASS CONTINUED AND RANKING David Kauchak CS 451 – Fall 2013.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

+ Recommending Branded Products from Social Media Jessica CHOW Yuet Tsz Yongzheng Zhang, Marco Pennacchiotti eBay Inc. eBay Inc.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,

Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.

Post-Ranking query suggestion by diversifying search Chao Wang.

Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Evaluating Classification Performance

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Classification Today: Basic Problem Decision Trees.

Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.

The Wisdom of the Few Xavier Amatrian, Neal Lathis, Josep M. Pujol SIGIR’09 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Evaluating Classifiers

Session 7: Face Detection (cont.)

Trees, bagging, boosting, and stacking

Erasmus University Rotterdam

Generalization ..

Machine Learning Week 1.

The Combination of Supervised and Unsupervised Approach

Discriminative Frequent Pattern Analysis for Effective Classification

CSCI N317 Computation for Scientific Applications Unit Weka

Presentation transcript:

PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija Piskorec, Tomislav Smuc

OVERVIEW Team name: gambi Final score: (12. position) Crucial part of the work: how to construct good features? Feature construction process: 1. Features containing general information 2. Construction from the time slot information 3. Construction from the information about the viewed items 4. Construction from the dataset enriched using recommendation algorithms 5. Construction from the error analysis

S OLUTION WORKFLOW

GENERAL INFORMATION FEATURES Number of items viewed during the session Duration of the session  Some sessions had unreasonable length  Assumption: some users forgot to log out  Too long sessions got missing value Number of items divided by duration

S OLUTION WORKFLOW

TEMPORAL FEATURES Day of the week, month, year  7 features for the day of the week  Extra binary features for Saturday and Sunday  Extra binary features for the beginning of the month, middle and end Month Christmas time  Five or less days before Christmas Hour  Exact hour (24 features)  Binary features: working hours, morning, evening, night The percentage of males in a given time slot

S OLUTION WORKFLOW

VIEWED ITEMS FEATURES Single categories  One feature for each category ID appearing in at least 3 distinct sessions Categories pairs and triplets  One feature for each pair/triplet of categories appearing in at least 3 distinct sessions calculated as the number of joint appearances Ordered pairs of categories  Number of times category i precedes category j at the distance k or less (for k=1,2,3) First category  Category that was viewed first in the session for A,B,C hierarchy levels

VIEWED ITEMS FEATURES Rare categories  Categories appearing in less than three instances  One feature for each hierarchy level belonging to the specific parent category calculated as a sum of all rare categories Due to the sparsity of generated features (0.15% of the elements wer non-zero) we performed non-negative matrix factorization (NMF) on merged train and test set 100 factors from NMF were then used as features

S OLUTION WORKFLOW

DATA ENRICHMENT Idea: to enrich data by using recommendation algorithms Weighted item k-NN with 10 nearest neighbours was used to predict new items  For each session we recommended 4 new items with top ranked scores  Recommendations for sessions containing only items unique to the sessions were ignored New features constructed from this more abundant set of items  For each category all appearances of distinct items that were viewed only by a predefined percentage of females/males were summed

S OLUTION WORKFLOW

ERROR ANALYSIS We extracted instances labelled as males that were misclassified on a cross-validation The main source of errors: products in A00002 category and its subcategories New feature for sessions with A00002 category calculated as the percentage of males that viewed the same B product from A00002 category in the same time slot  If there were more B products from A00002 category, only the first one was taken into account  The number of sessions in which the B product in a given time slot appeared had to be greater than one  Sessions without A00002 category got missing value for feature value

CLASSIFICATION We used 642 features in total Classification algorithm: Random Forest with 1000 trees After obtaining predictions we optimized the threshold to predict more males accurately on the cost of females misclassification  Using 10-fold cross-validation results we checked where to put a threshold on a classifier confidence  Chosen threshold was then used on the test set predictions Cross validation score: 0.822, after optimization Preliminary evaluation score: Final evaluation score:

CLASSIFICATION We also analyzed the importance of each generated feature The most important features are temporal and NMF features

THANK YOU FOR YOUR ATTENTION!