Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,

Similar presentations


Presentation on theme: "Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,"— Presentation transcript:

1 Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang, Dajun Chen Fujitsu Research and Development Center, Beijing, China

2 Background Copyright 2015 Fujitsu R&D Center Co.,LTD 2 Gender Prediction: The task in this competition is to predict user’s gender from product viewing logs. Our solution: Use the product viewing information within single session Use information among different sessions by exploring their potential associations We adopt a two-step strategy for gender prediction, which consists of “gender classification” and the “continuous session alignment model”.

3 Copyright 2015 Fujitsu R&D Center Co.,LTD 3 Features for gender classification[1] Product and Category features view products and product categories in each session as words in the document, and the “bag of word” model is applied U10171 2014/12/20 20:31 2014/12/20 20:31 A00001/B00001/C00075/D33237/;A00001/B00001/C00075/D34328 A00001, A00001/B00001, A00001/B00001/C00075, A00001/B00001/C00075/D33237, A00001/B00001/C00075/D34328 Product and Category features with timestamp time stamp is taken from the start time (only year, month and date) of each session u10171 2014/12/20 20:31 2014/12/20 20:31 A00001/B00001/C00075/D33237/;A00001/B00001/C00075/D34328/ male u10174 2014/11/14 0:37 2014/11/14 0:37 A00001/B00001/C00019/D00044/male 2014/12/20/A00001,2014/12/20/A00001/B00001,2014/12/20/A00001/B00001/C00075,2014/12/20/A00001/B00001/C00075/D33237 ; 2014/11/14/A00001,2014/11/14/A00001/B00001,2014/11/14/A00001/B00001/C00019,2014/11/14/A00001/B00001/C00019/D00044;

4 Features for gender classification[2] Copyright 2015 Fujitsu R&D Center Co.,LTD 4 Same level product and category features with time stamp Considering that different products focus on different target customers, it is quite natural for individual customers to hold several fixed preferences, like products and categories. u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078/ female 2014/12/19/A00003, 2014/12/19/B00036, 2014/12/19/{C000175,C000190}, 2014/12/19/{D333072,D333078} Product ID Prefix with time stamp We have noticed that many products hold same product ID prefix in training data. Products share same product id prefix “D3307” as follows. Prefix length is set to 4 u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078 female 2014/12/19/D3307

5 Features for gender classification[3] Copyright 2015 Fujitsu R&D Center Co.,LTD 5 Transferring features of sequential products The transferring actions between sequentially viewed products may reflect click habits of users with different genders. Counts on different kinds of features: u10204 2014/12/19 14:40 2014/12/19 14:42 A00003/B00036/C00190/D33072/;A00003/B00036/C00175/D33078/ female 2014/12/19/D33072/C000175 FeaturesPFPFTSLTPIPTFTTotal Count 22,46435,828 11403 5,15717,23192,083 Table 1. Counts of different kinds of features. PF denotes product and category feature, PFT is PF with timestamp, SLT denotes same level features with timestamp, PIP means product ID prefix features, TFT means transferring features

6 Features for gender classification[4] Copyright 2015 Fujitsu R&D Center Co.,LTD 6 Feature value is calculated by: Classification model: We use a well implemented SVM library named libsvm [2] with linear kernel function, We set “male” session’s weight to be 1.3 and 0.25 for “female” session during training due to the unbalance of gender ratio in training data. Summary We finally use a sparse feature set with high feature dimensions. Timestamp based features greatly increase feature dimensions, but turn out to be useful. Linear classifier is efficient and works well on this data set.

7 Continuous Session Alignment Model Copyright 2015 Fujitsu R&D Center Co.,LTD 7

8 References 1.Zellig S. Harris. Distributional structure. Word, 10:146-162, 1954. 2.Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.http://www.csie.ntu.edu.tw/~cjlin/libsvm Copyright 2015 Fujitsu R&D Center Co.,LTD 8


Download ppt "Copyright 2015 Fujitsu R&D Center Co.,LTD FRDC’s approach at PAKDD’s Data Mining Competition Ruiyu Fang, Qingliang Miao, Cuiqin Hou, Yao Meng, Lu Fang,"

Similar presentations


Ads by Google