KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.

Slides:

Advertisements

Similar presentations

1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.

Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Neural Networks II CMPUT 466/551 Nilanjan Ray. Outline Radial basis function network Bayesian neural network.

Caimei Lu et al. (KDD 2010) Presented by Anson Liang.

A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Latent Dirichlet Allocation a generative model for text

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Scalable Text Mining with Sparse Generative Models

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.

B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego

A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

BME1450: Biomaterials and Biomedical Research Michelle Baratta Engineering & Computer Science Library Maria Buda Dentistry Library.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Which of the two appears simple to you? 1 2.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Mapping and Localization with RFID Technology Matthai Philipose, Kenneth P Fishkin, Dieter Fox, Dirk Hahnel, Wolfram Burgard Presenter: Aniket Shah.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™

1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Conceptual structures in modern information retrieval Claudio Carpineto Fondazione Ugo Bordoni

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human.

An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.

PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.

Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Matching Words with Pictures

Pattern Recognition and Image Analysis

CSCI 5822 Probabilistic Models of Human and Machine Learning

Panagiotis G. Ipeirotis Luis Gravano

Topic Models in Text Processing

Presentation transcript:

KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann Integrated Data Systems, Siemens Corporate Research, Princeton, NJ Markus Bundschus Department of Computer Science, Ludwig-Maximilians-University

Page 28/25/2008 Las Vegas, NV KDD-2008 From data to knowledge PubMed: >15M abstracts from UniProt: >200k proteins. GeneOntology: >22k processes and functions. Mesh: >23k medical terms, >170k synonyms. FDA Clinical Trials: >50k reports … Proprietary data sources, such as patent information and news articles Large volume and complexity of information requires automation of important tasks: - Detect biomedical concepts - Detect topics - Satisfy search queries - Track historical trends - Predict new trends Data sources for biomedical literature research

Page 38/25/2008 Las Vegas, NV KDD-2008 Bio Journal Monitor Query BioJournalMonitor Topics Group large search result into topics to facilitate analysis and drill down PubMed and other sources By keyword By MeSH concept By date … Trends Show emerging trends related to query.  Named-entity recognition  Annotation  Trend analysis  Clustering  Ranking  Screening of biomedical literature  Early detection of biomarkers and technologies related to a disease  Tracking relevance of biomarkers over time  Prediction of research trends. Use cases

KDD-2008 Annotation

Page 58/25/2008 Las Vegas, NV KDD-2008 Medical Subject Headings MeSH annotation  Each document in PubMed is manually indexed with a set of MeSH terms  Semi-automatic approaches assist indexers Our approach  Model the generative process of document writing and document indexing  Author chooses relevant topics  Based on topic distribution author writes the paper  Indexer reads the paper and extracts hidden topic structure  Indexer assigns index terms based on topics. Document writingDocument indexing

Page 68/25/2008 Las Vegas, NV KDD-2008 LDA Framework  Represent a document as a mixture of topics, where each topic is expressed as a mixture of words  Model the generation of a document d as a three-step process 1. Sample distribution over topics θ 2. Sample a topic z based on θ 3. Sample a word w based on Ф, the word-distribution specific for topic z w: word chosen from a vocabulary of size N z: topic responsible for generating a word α, β: Dirichlet prior parameters θ, Ф: model parameters (to be learned) T: number of topics

Page 78/25/2008 Las Vegas, NV KDD-2008 Topic-Concept Model  Given a set of annotated documents D={(w 1,c 1 ),…,(w D,c D )}, simultaneously model the process of document writing and document indexing  Use hierarchical Bayesian framework to model this generative process  For each of the M d concepts in document d draw a topic according to the topic assignments of each word  θ, Ф, Γ provide information about topic- word- and concept distributions w: word chosen from a vocabulary of size N c: concept chosen from a set of MeSH concepts z: topic responsible for generating a word z_tilde: topic responsible for generating a concept α, β, γ: Dirichlet prior parameters θ, Ф, Γ: model parameters (to be learned)

Page 88/25/2008 Las Vegas, NV KDD-2008 Learning the Topic-Concept Model  Given a set of documents D={(w 1,c 1 ),…,(w D,c D )} infer θ, Ф, Γ for each document d  Computing the posterior p(z | w) is intractable  Approximation by sampling from joint p(z, w) using Markov chain Monte Carlo approach  Set α, β, γ constant

Page 98/25/2008 Las Vegas, NV KDD-2008 Annotation Task  Train Topic-Concept Model to predict MeSH concepts of a previously unseen document d  Use Bayes rule to estimate the distribution over concepts given document d  Estimate p(t | d) by re-sampling z based on new document d  Result is a ranked list of MeSH concepts

Page 108/25/2008 Las Vegas, NV KDD-2008 Experiment  Use 2 benchmark datasets provided by NLM  Compare results with NLM approach and Naïve Bayes (multi-label)  Prune MeSH concepts to top layer (109 MeSH concepts)  Better overall performance compared to naïve Bayes and NLM!  Reasons: modelling of dependency, exploiting word features unindexed documents,  Also shows advantages as descriptive model.

Page 118/25/2008 Las Vegas, NV KDD-2008 Experiment  Use 2 benchmark datasets provided by NLM  Compare results with NLM approach and Naïve Bayes (multi-label)  Prune MeSH concepts to top layer (109 MeSH concepts) Random 50KGenetics

KDD-2008 Emerging Trend Detection

Page 138/25/2008 Las Vegas, NV KDD-2008 Emerging Trend Detection Problem  New MeSH terms are selected by experts – this can happen long after the term becomes important and widely used!  An early identification of potential MeSH terms would be very useful for technology scouting teams and biomedical researchers. Challenges  Automatically identify newly emerging important concepts  Prepare a collection that can be used for evaluation of emerging trend detection methods  1.5M PubMed abstracts from 01/1975 through 10/2007 with keywords: cancer, carcinoma, tumor, neopla, malignant.  81 interesting cancer-related MeSH term introduced during this period.

Page 148/25/2008 Las Vegas, NV KDD-2008 Collection Preparation  PubMed abstracts from 01/1975 through 10/2007 for with the following cancer related keywords (substrings): cancer, carcinoma, tumor, neopla, malignant.  About 1.5M documents were found and processed: word level parsing, stop word removal, word stemming. The stop word list included some very common medical terms such as result, patient, study, method.  The MeSH annotations of the abstracts were not used and no named-entity recognition was performed to ensure that no information is used that would not have been available at the time the abstracts were published. Number of cancer related documents per month in PubMed.

Page 158/25/2008 Las Vegas, NV KDD-2008 Positive Examples  22,169 MeSH terms observed in at least one of the cancer related documents.  Kept terms listed in a tree that has one of the cancer keywords in the path name relevant trees with 759 relevant terms.  Removed terms with early or suspect creation dates.  Kept terms in one of the top level trees listed in the table on the right.  Out of the remaining terms, only 81 match stems occurring in the abstracts.

Page 168/25/2008 Las Vegas, NV KDD-2008 Representation and Scoring Representation: Term frequency in sliding 12 month window. Divide by the total number of documents in that period. Scoring function (Better than the one in the paper!) Consider a 24 month period ending with the current month t Count the number of times normalized frequency f reaches a new maximum in that period:  Excluded terms that have  not yet occurred (impossible)  have already been added to MeSH (truth known)  are added to MeSH within the next year (too late)  TP: terms that will be added after at least 1 year  FP: terms that will never be added to MeSH  FN: terms that are added to MeSH at time [t+1y, t+5y]  TN: All other terms Experimental setup 140K word stems, 81 true positives Top ranked 300 terms / month Time horizon 1 year / 5 years

Page 178/25/2008 Las Vegas, NV KDD-2008 Results Time difference between inclusion in MeSH and earliest detection in top out of 81 positive terms are detected. Is top 300 too much? 300 * 12 month * 25 years = 90,000 terms. However, only 6,290 unique terms occurred in top 300 in this period. Addition of new MeSH terms describing cancer- related biomarkers. Since the 1 st term is added in 1985, it only makes sense to start evaluation in 1980 (given our horizon parameters).

Page 188/25/2008 Las Vegas, NV KDD-2008 Precision and Recall Measures

Page 198/25/2008 Las Vegas, NV KDD-2008 BioJournalMonitor Trends of concepts and topics Group abstracts into topics. Summarize topics with keywords

Page 208/25/2008 Las Vegas, NV KDD-2008 Conclusion  Described BioJournalMonitor system for automated analysis of biomedical literature and other data sources.  Discussed in detail:  automated categorization of articles using LDA models;  and detection of important emerging trends  Future Work  Extend LDA approach to cover entire MeSH hierarchy  Examine supervised approaches for identifying emerging trends; and evaluate on different data – ex. Heart disease instead of cancer-related biomarkers