1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University.

1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University 3 IBM T.J. Watson Research Center, USA 4 Peking University Heterogeneous Cross Domain Ranking in Latent Space

2 Introduction The web is becoming more and more heterogeneous Ranking is the fundamental problem over web –unsupervised v.s. supervised –homogeneous v.s. heterogeneous

3 Motivation Heterogeneous cross domain ranking Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains? Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains?

4 Outline Related Work Heterogeneous cross domain ranking Experiments Conclusion

5 Related Work Learning to rank –Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07] [Yue, 07] –Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08] –Ranking adaptation: [Chen, 08] Transfer learning –Instance-based : [Dai, 07] [Gao, 08] –Feature-based : [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07] –Model-based : [Bonilla, 08]

6 Outline Related Work Heterogeneous cross domain ranking –Basic idea –Proposed algorithm: HCDRank Experiments Conclusion

7 Query: “data mining” Conference Expert Latent Space Source Domain Target Domain mis-ranked pairs might be empty! (no labelled data in target domain)

8 Learning Task In the HCD ranking problem, the transfer ranking task can be defined as: Given limited number of labeled data L_T, a large number of unlabeled data S from the target domain, and sufficiently labeled data L_S from the source domain, the goal is to learn a ranking function f_T^* for predicting the rank levels of unlabeled data in the target domain. Key issues: -Different feature distributions/different feature spaces -Number of rank levels different -Number of labeled training examples very unbalanced (thousands vs a few)

9 The Proposed Algorithm — HCDRank How to optimize?How to define? Non-convex Dual problem Loss function in source domain Loss function in target domain penalty Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty

10 The Proposed Algorithm — HCDRank How to optimize?How to define? Non-convex Dual problem Loss function in source domain Loss function in target domain penalty Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty unsolvable

11 alternately optimize matrix M and D O(2T*sN logN) Construct transformation matrix O(d 3 ) learning in latent space O(sN logN) O((2T+1)*sN log(N) + d 3 Learn weight vector of target domain Apply learnt weight vector to predict d: feature number, N = nr of instance pairs for training, s: number of non- zero features

12 Outline Related Work Heterogeneous cross domain ranking Experiments –Ranking on Homogeneous data –Ranking on Heterogeneous data –Ranking on Heterogeneous tasks Conclusion

13 Experiments Data sets –Homogeneous data set: LETOR_TR 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR –Heterogeneous academic data set: ArnetMiner.org 14,134 authors, 10,716 papers, and 1,434 conferences –Heterogeneous task data set: 9 queries, 900 experts, 450 best supervisor candidates Evaluation measures –P@n : Precision@n –MAP : mean average precision –NDCG : normalized discount cumulative gain

14 Ranking on Homogeneous data LETOR_TR –We made a slight revision of LETOR 2.0 to fit into the cross- domain ranking scenario –three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR Baselines

15 Cosine Similarity=0.01 OHSUMED_TR TREC2004_TRTREC2003_TR Cosine Similarity=0.23 Cosine Similarity=0.18

16 Observations Ranking accuracy HCDRank is +5.6% to +6.1% in terms of MAP better Effect of difference when cosine similarity is high (TREC2004), simply combining the two domains would result in a better ranking performance Training time: next slide

17 Training Time BUT: HCDRank can easily be parallelized And training process only needs to be run once on a data set

18 Ranking on Heterogeneous data ArnetMiner data set (www.arnetminer.org)www.arnetminer.org 14,134 authors, 10,716 papers, and 1,434 conferences Training and test data set: –44 most frequent queried keywords from log file Author collection: Libra, Rexa and ArnetMiner Conference collection: Libra, ArnetMiner Ground truth: –Conference: online resources –Expert: two faculty members and five graduate students from CS provided human judgments for expert ranking

19 Feature Definition FeaturesDescription L1-L10Low-level language model features H1-H3High-level language model features S1How many years the conference has been held S2The sum of citation number of the conference during recent 5 years S3The sum of citation number of the conference during recent 10 years S4How many years have passed since his/her first paper S5The sum of citation number of all the publications of one expert S6How many papers have been cited more than 5 times S7How many papers have been cited more than 10 times 16 features for a conference, 17 features for an expert

20 Expert Finding Results

21 Observations Ranking accuracy HCDRank outperforms the baselines especially the two unsupervised systems Feature analysis next slide: final weight vectors which exploits the data information from two domains and adjusts the weight learn from single domain data Training time: next slide

22 Feature Correlation Analysis

23 Ranking on Heterogeneous tasks Expert finding task v.s. best supervisor finding task Training and test data set: –expert finding task: ranking lists from ArnetMiner or annotated lists –best supervisor finding task: 9 most frequent queries from log file of ArnetMiner For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation Ground truth: –Collection of feedbacks about the candidates (yes/ no/ not sure)

24 Best supervisor finding Training/test set and ground truth –724 mails sent –Fragment of mail 24 – Feedbacks in effect > 82 (increasing) – Rate each candidate by the definite feedbacks (yes/no)

25 Feature Definition FeaturesDescription L1-L10Low-level language model features H1-H3High-level language model features B1The year he/she published his/her first paper B2The number of papers of an expert B3The number of papers in recent 2 years B4The number of papers in recent 5 years B5The number of citations of all his/her papers B6The number of papers cited more than 5 times B7The number of papers cited more than 10 times B8PageRank score SumCo1-SumCo8The sum of coauthors’ B1-B8 scores AvgCo1-AvgCo8The average of coauthors’ B1-B8 scores SumStu1-SumStu8The sum of his/her advisees’ B1-B8 scores AvgStu1-AvgStu8The average of his/her advisees’ B1-B8 scores

26 Best supervisor finding results

27 Experimental Results

28 Outline Related Work Heterogeneous cross domain ranking Experiments Conclusion

29 Conclusion Formally define the problem of heterogeneous cross domain ranking and propose a general framework We provide a preferred solution under the regularized framework by simultaneously minimizing two ranking loss functions in two domains The experimental results on three different genres of data sets verified the effectiveness of the proposed algorithm

30 Data Set

31 Ranking on Heterogeneous data A subset of ArnetMiner (www.arnetminer.org)www.arnetminer.org 14134 authors, 10716 papers, and 1434 conferences 44 most frequent queried keywords from log file Author collection: –For each query, we gathered top 30 experts from Libra, Rexa and ArnetMiner Conference collection: –For each query, we gathered top 30 conferences from Libra and ArntetMiner Ground truth: –Three online resources http://www.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html http://www3.ntu.edu.sg/home/ASSourav/crank.htm http://www.cs-conference-ranking.org/conferencerankings/alltopics.html –Two faculty members and five graduate students from CS provided human judgments

32 Ranking on Heterogeneous tasks For expert finding task, we can use results from ArnetMiner or annotated lists as training data For best supervisor task, 9 most frequent queries from log file of ArnetMiner are used –For each query, we sent emails to 100 researchers Top 50 researchers by ArnetMiner Top 50 researchers who start publishing papers only in recent years (91.6% of them are currently graduates or postdoctoral researchers) –Collection of feedbacks 50 best supervisor candidates (yes/ no/ not sure) Also add other candidates –Ground truth

1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University.

Similar presentations

Presentation on theme: "1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University.

Similar presentations

Presentation on theme: "1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University."— Presentation transcript:

Similar presentations

About project

Feedback