Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Similar presentations


Presentation on theme: "1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer."— Presentation transcript:

1 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Dec. 16, 2008 ICDM2008

2 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 2 Introduction  Traditional information retrieval  Expert finding task Data mining

3 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 3 Outline  Introduction  Related work  Methodology Modeling Expertise Statistical language model Topic-based model Hybrid model  Experiments  Conclusions

4 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 4 Introduction  Expert finding received increased interest W3C collection in 2005 and 2006 (introduced and used by TREC) CSIRO collection in 2007  Nearly all of the work has been evaluated on the W3C collection  We address the expert finding task in a real world academic field An important practical problem Some special problems and difficulties II. Introduction

5 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 5 Problems  How to represent the expertise of a researcher? The publications of a researcher  How to identify experts for a given query? Relevance between a query and publications Publications act as the “bridge” between query and experts  What dataset can be used? DBLP bibliography ( limited information) Use Google Scholar as a data supplement  How to measure the relevance between a query and docs? Language model, vector space model, etc.  Should we treat each publication equally? II. Introduction

6 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 6 Our Work  Our setting: DBLP bibliography and Google scholar More than 955,000 articles with over 574,000 authors About 20GB metadata crawled from Google Scholar  Differ from the W3C setting Cover a wider range of topics Contain much more expert candidates  Applications Find experts for consultation on a new research field Assign papers to reviewers automatically Recommend panels of reviews for grant applications II. Introduction

7 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 7 Related Work  Document model & Candidate model (Balog et al., SIGIR’06 & SIGIR’07)  Hierarchical language models (Petkova and Croft, ICTAI’06)  Voting model (Macdonald and Ounis, CIKM’06)  Author-Persona-Topic model (Mimno and McCallum, KDD’07)  ……  They do not consider the importance of documents. Hardly to be used in large-scale expert finding.

8 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 8 Expertise Modeling  Expert finding p(ca|q): what is the probability of a candidate ca being an expert given the query topic q? Rank candidates ca according to this probability.  Approach: Using Bayes’ theorem, where p(ca, q) is joint probability of a candidate and a query, p(q) is the probability of a query. III. Methodology

9 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 9 Expertise Modeling  Problem: How to estimate p(ca, q)? Model 1: Statistical language model  Document-based approach  Find out the experts from the associated publications Model 2: Topic-based model  Association between the query with several similar topics Model 3: Hybrid model  Combination of Model1 and Model2 III. Methodology

10 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 10 III. Model 1: Statistical language model Basic Language Model  The probability p l (ca,q): Language Model Conditionally independent Fig1. Baseline model Find out documents relevant to the query Model the knowledge of an expert from the associated documents

11 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 11 Weighted Language Model Fig2. A query example Fig3. Weighted model III. Model 1: Statistical language model

12 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 12 Topic-based Model  Observation: researchers usually describe their expertise as a combination of several topics  Each candidate is represented as a weighted sum of multiple topics Z Similarity between query and topics z -> as a query estimate p III. Model 2: Topic-based model Fig4. Topic-based model

13 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 13 Topic-based Model Information retrieval 1. Introduction to Modern Information retrieval 2. Information retrieval 3. Modern Information retrieval 5. A language modeling approach to information retrieval 7. Information filtering and information retrieval …… 99. Cross-language information retrieval 100. On modeling information retrieval with probabilistic inference Topic z Google Scholar θ z represent III. Model 2: Topic-based model

14 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 14 Topic-based Model  Challenge: What similar topics would be selected?  T1: Calculate p(q|θ z ), select the top K ranked topics Assume topics are independent  Ideal similar topics: Include topics from many different subtopics Not include topics with high redundancy Define a conditional probability function to quantify the novelty and penalize the redundancy of a topic  T2:  T3: III. Model 2: Topic-based model

15 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 15 Topic Selection Algorithm T2: T3: III. Model 2: Topic-based model

16 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 16 Hybrid Model  Aggregate the advantage of the p l and p t  Defined as: III. Model 3: Hybrid model

17 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 17 Experiments  DBLP Collection Limitation  No abstract and index terms  Hard to represent the document Representation for documents  Use Google Scholar for data supplementation Title as query, crawled top 10 returned records Up to 20 GB metadata (HTML pages) The citation number of the publication IV. Experiments

18 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 18 Topic Collection  2,498 well-defined topics from eventseer  Crawl the top 100 returned records from Google Scholar IV. Experiments

19 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 19 Benchmark Dataset  A benchmark dataset with 7 topics and expert lists IV. Experiments

20 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 20 Evaluation Metrics  Precision at rank n (P@n):  Mean Average Precision (MAP):  Bpref: The score function of the number of non-relevant candidates IV. Experiments

21 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 21 Preliminary Experiments  Performed on two corpora using basic language model (B1) “Title” corpus: only using the title “GS” corpus: the representation of Google Scholar  Evaluation results on two corpora (%) More effective to represent d using Google Scholar IV. Experiments

22 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 22 Model 1: Statistical Language Models  Evaluation results of language modes Weighted language model B3 and B2 outperform B1 Important to consider the prior probability IV. Experiments

23 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 23 Model 2: Topic-based Models  Vary the number of topics (K) from 5 to 100  Results by using different values for K. The number of topics will be cutoff automatically for T2 & T3 IV. Experiments

24 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 24 Model 2: Topic-based Models  Comparison of the three topic-based models IV. Experiments

25 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 25 Model 3: Hybrid Models  Evaluation results of hybrid model Hybrid model outperforms the pure language model and topic-based model in most of the metrics IV. Experiments

26 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 26 Conclusions and Future Work  Conclusions Address expert finding task in a real world academic field Propose a weighted language model Investigate a topic-base model to interpret the expert finding task Integrate the language model with the topic-based model Demonstrate that hybrid model achieves the best performance in evaluation results  Future work Take into account other types of information Refine the results by utilizing social network analysis

27 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 27 Q&A Thanks!

28 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 28 Comparison to Other Systems  Evaluation results of our language models and the method TS

29 Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong ICDM 2008 29 Example results


Download ppt "1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer."

Similar presentations


Ads by Google