Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1

Similar presentations


Presentation on theme: "Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1"— Presentation transcript:

1 Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
The Fudan-UIUC participation in the BioASQ Challenge Task2a: The Antinomyra System Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1 1Fudan University 2 Central South University, 3 University of Illinois at Urbana-Champaign

2 Outline Introduction Related Work Our Methods Experimental Result
Conclusion

3 Introduction MeSH Terms
The NLM Indexing Initiative: Current Status and Role in Improving Access to Biomedical Information Introduction MeSH Terms Each year, around 0.8 million biomedical documents are added into MEDLINE.

4 MeSH is Important Indexing all documents in MEDLINE
Indexing many books and collections in NLM Improving the retrieval performance by query expansion using MeSH Improving the clustering performance by integrating MeSH information [Zhu et al IP&M] [Zhe et al Bioinformatics] [Gu et al IEEE TSMCB] Improving the biomedical text mining performance

5 Automatic MeSH Annotation is a challenging problem
More than 26,000 MeSH headings organized in hierarchical structure Quickly approaching 1,000,000 articles indexed per year ~$9.40 to index an article

6 The number of distinct MeSH is large (almost 27000)
The large variations of MeSH frequencies in MEDLINE The large variations in the number of MeSH terms for each document

7 BioASQ (Large Scale Biomedical Semantic Indexing Competition )
Batch 3, week docs Batch 3, week docs Batch 3, week docs Batch 3, week docs Batch 3, week docs

8 Label Based Micro F1-measure (MiF)
L represent the label set, |L| represents the number of labels. It means that Frequent labels will be weighed more in the evaluation.

9 Batch 3, week 5 4533 docs We achieved around 10% improvement over current NLM MTI solution (Result of June 2014) Fudan University NLM Current Solution

10 Outline Introduction Related Work Our Methods Experimental Result
Conclusion

11 NLM approach: MTI Two sources:
MetaMap Indexing Maps UMLS Concepts restricting to MeSH PubMed Related Citations reference: Advanced machine learning algorithms are not utilized

12 MetaLabeler (Tsoumakas et al. 2013)
Firstly, for each MeSH heading, a binary classification model was trained using linear SVM. Secondly, a regression model was trained to predict the number of MeSH headings for each citation. Finally, given a target citation, different MeSH headings were ranked according to the SVM prediction score of each classifier, and the top K MeSH headings were returned as the suggested MeSH headings, where K is the number of predicted MeSH headings by the model. Problem: Only use global information. The scores from different classifiers are not comparable.

13 NCBI’s learning to rank (LTR) (Huang et al., 2011; Mao et al., 2013)
Each citation was deemed as a query and each MeSH headings as a document. LTR method was utilized to rank candidate MeSH headings with respect to target citation. The candidate MeSH headings came from similar citations (nearest neighbors). Problem: Only use local information. Similar citations might be rare.

14 Outline Introduction Related Work Our Methods Experimental Result
Conclusion

15 Our solution: Learning to Rank (LTR) Framework
* Obtain an initial list of main headings Initial List * Rank the main headings MH-0 Logistic Regression MH-1 Ranked List MH-2 MH-0 * Generate features of main headings MH-n MH-1 Target Doc Ranking model MH-2 PRA MH-n MH-0 MH-1 Features MH-2 LambdaMart Evaluation MH-m * Retrieve Similar documents

16 Main idea: various information (Features) integrated in the Learning to Rank (LTR) Framework
Given a target document, for each candidate MeSH, we get prediction scores from all kinds of sources: (1) Logistic Regression (Global information) (2) KNN (Local information) (3) Pattern Matching (4) MTI result (KNN+ pattern +rule)

17 Logistic Regression Train a binary-class Logistic Regression Model for Each Label. Finally we have 25,000+ binary-class models Question: The prediction scores are from different classifiers. How to make these scores comparable?

18 Key idea: We have huge validation set of whole MEDLINE
Key idea: We have huge validation set of whole MEDLINE. Use the Precision at prediction score K as the Normalized score [Liu et al., In preparation]

19 The performance comparison on LR between default prediction scores and our normalized scores.
Method mip mir mif Default scores 0.5576 0.5614 0.5595 Normalized scores 0.5734 0.5774 0.5754 [Liu et al., In preparation]

20 KNN Given a target citation, we used NCBI efetch to find its similar(neighbor) citations. For a candidate MeSH, we compute a score from neighbors to represent its confidence. Specifically, in Top 25 documents most similar to target citation, we use the following formula, where Si is the score of a document appearing in top 25, Sk is the score of a document not only appearing in top 25 and also annotated with candidate MeSH.

21 Pattern matching Use direct string pattern matching for finding MeSH Term, its synonyms and entry terms from the text MTI Whether the candidate MeSH appears in the default results of MTI

22 The number of MeSH Labels
A Support Vector Regression for predicting the Number of Labels by using a number of features, such as Journal information The number of labels in nearest neighbors Number of labels predicted by MTI Number of labels predict by metalabeler

23 Outline Introduction Related Work Our Methods Experimental Result
Conclusion

24 Evaluation & Experiment
Server 4 * Intel XEON E GHzs CPU 128GB RAM. Training of LR Classifiers took days. All other training tasks took 1 day. Annotating 10,000 citations 2 hours.

25 Evaluation & Experiment

26 Outline Introduction Related Work Our Methods Experimental Result
Conclusion

27 Conclusion & Future Work
The superior performance of our methods come from integrating all kinds of information in LTR framework, MTI, KNN, LR as well as direct matching . The big data of MEDLINE make the prediction score normalization possible and improves the performance significantly. More information could be used, such as full text, and indexing rules. How to minimize the gap between a good competition system and real applications?

28 Acknowledgement Dr. Hongning Wang UIUC Mr. Mingjie Qian UIUC
Mr. Jieyao Deng Fudan Mr. Tianyi Peng Tsinghua


Download ppt "Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1"

Similar presentations


Ads by Google