Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Dating Competition COMP621U. 第一届全国大学生数据挖掘邀请赛 March 22, 2011 ~ April 27, 2011 赞助 – 上海花千树信息科技有限公司.

Similar presentations


Presentation on theme: "Introduction to Dating Competition COMP621U. 第一届全国大学生数据挖掘邀请赛 March 22, 2011 ~ April 27, 2011 赞助 – 上海花千树信息科技有限公司."— Presentation transcript:

1 Introduction to Dating Competition COMP621U

2 第一届全国大学生数据挖掘邀请赛 http://www.statmodelingcompetition.com March 22, 2011 ~ April 27, 2011 赞助 – 上海花千树信息科技有限公司 – 世纪佳缘 http://www.jiayuan.com/http://www.jiayuan.com/ 联合举办 – 中国科学技术大学管理学院 – 中国人民大学统计学院 – 统计之都( COS )网站 目标 – 是为某个以婚恋为目的的大型交友网站提供会员推荐的智能算法,改善会员推荐 的精度,增加网站黏度 答辩时需提交 : 论文、源代码

3 Workflow User: A User: B Step 2: user A “click” the photo of user B (or ignore) Step 3: user A “msg” (send a message to) user B (or ignore) Relevance score 2: “msg” 1: “click” 0: “rec” Impact: make a difference on ones’ whole lives Step 1: the system “rec” user B to user A

4 train.txt 8,599,012 lines 15,000 unique USER_ID_A 55,871 unique USER_ID_B 59,921 unique users (10,950 overlapped) test.txt 3,311,076 lines 10,433 unique USER_ID_A 54,409 unique USER_ID_B 57,352 unique users (7,490 overlapped) “rec”: 8,366,058 (97.29%) “click”: 184,291 ( 2.14%) “msg”: 48,663 ( 0.57%) How to make use of “ROUND”? -> sequential information/constraint (?) -> only take the highest relevance (?) How to make use of “REC_TIMES” (in the last three months)?

5 TRAIN-A TEST-A ALL USER-B (57,133) 7,546 Comm.: 53,147 15,000 1,2622,724 2,887 (1)Pure CF can help those 7,546 TEST-A (2)User profiles (?)

6 profile_m.txt, profile_f.txt User Profile Reduce to the problem of “learning to rank” (1)Extract feature vector from (user A, user B) pair (2)Extract the relevance score from the action (“msg”, “click”, “rec”) ALL: Male: 344,552 Female: 203,843 We can learn more about the data distribution All users have profile information Male # vs. Female # is quite balanced

7 Evaluation and Submit What to submit? –Each line (USER-A): an list of orders of the corresponding USER-Bs Performance evaluation: NDCG@10 Average NDCG@10 of 10,433 TEST User-A If NDCG@10 is comparable, NDCG@20 is also considered The committee will also consider other issues for real deployment if the NDCG performance is very similar Gain Position discount Cumulating

8 Discussion Learning to rank, CF (+content), association rule mining (since lots of features are categorical) Transductive (semi-supervised learning) More studies of the data distribution of training and test set is needed (whether there is significant mismatch) Temporal information/constraints One very important information is missing: USER-B’s photo –Latent factorization approach may help alleviate this a bit Is there some information we can crawl from http://www.jiayuan.com/ ?http://www.jiayuan.com/ Shall we incorporate some prior knowledge as constraints (i.e. “ 门当户对 ”) ?

9 User Product e.g. KDDCUP2011, Netflix. e.g. Dating competition. (1) Recommend people to people (much higher social impact) (2) The “like-minded” assumption in CF may not hold (4) The content information (e.g. user profile) is definitely very important (3) Proximity: asymmetric vs. symmetric (new recommendation model needed) vs.

10 Others ------------------------- Q :港澳台学生可以参加吗 ? A :可以,欢迎。 ------------------------- Q :如何获取建模数据集?我可以把数据集传给别人么? A :本数据集仅能用于本次竞赛的分析、建模用途,且限于在线注册用户使用。不得用于任何其他 商业用途。用于学术研究和论文发表目的的,请与上海花千树信息科技有限公司联系并获取授权。 竞赛委员会不具有授权权力。 -------------------------


Download ppt "Introduction to Dating Competition COMP621U. 第一届全国大学生数据挖掘邀请赛 March 22, 2011 ~ April 27, 2011 赞助 – 上海花千树信息科技有限公司."

Similar presentations


Ads by Google