KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.

KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann Integrated Data Systems, Siemens Corporate Research, Princeton, NJ Markus Bundschus Department of Computer Science, Ludwig-Maximilians-University

/25/2008 Las Vegas, NV KDD-2008 From data to knowledge PubMed: >15M abstracts from 1975-2007. UniProt: >200k proteins. GeneOntology: >22k processes and functions. Mesh: >23k medical terms, >170k synonyms. FDA Clinical Trials: >50k reports … Proprietary data sources, such as patent information and news articles Large volume and complexity of information requires automation of important tasks: - Detect biomedical concepts - Detect topics - Satisfy search queries - Track historical trends - Predict new trends Data sources for biomedical literature research

/25/2008 Las Vegas, NV KDD-2008 Bio Journal Monitor Query BioJournalMonitor Topics Group large search result into topics to facilitate analysis and drill down PubMed and other sources By keyword By MeSH concept By date … Trends Show emerging trends related to query.  Named-entity recognition  Annotation  Trend analysis  Clustering  Ranking  Screening of biomedical literature  Early detection of biomarkers and technologies related to a disease  Tracking relevance of biomarkers over time  Prediction of research trends. Use cases

KDD-2008 Annotation

/25/2008 Las Vegas, NV KDD-2008 Medical Subject Headings MeSH annotation  Each document in PubMed is manually indexed with a set of MeSH terms  Semi-automatic approaches assist indexers Our approach  Model the generative process of document writing and document indexing  Author chooses relevant topics  Based on topic distribution author writes the paper  Indexer reads the paper and extracts hidden topic structure  Indexer assigns index terms based on topics. Document writingDocument indexing

/25/2008 Las Vegas, NV KDD-2008 LDA Framework  Represent a document as a mixture of topics, where each topic is expressed as a mixture of words  Model the generation of a document d as a three-step process 1. Sample distribution over topics θ 2. Sample a topic z based on θ 3. Sample a word w based on Ф, the word-distribution specific for topic z w: word chosen from a vocabulary of size N z: topic responsible for generating a word α, β: Dirichlet prior parameters θ, Ф: model parameters (to be learned) T: number of topics

/25/2008 Las Vegas, NV KDD-2008 Topic-Concept Model  Given a set of annotated documents D={(w 1,c 1 ),…,(w D,c D )}, simultaneously model the process of document writing and document indexing  Use hierarchical Bayesian framework to model this generative process  For each of the M d concepts in document d draw a topic according to the topic assignments of each word  θ, Ф, Γ provide information about topic- word- and concept distributions w: word chosen from a vocabulary of size N c: concept chosen from a set of MeSH concepts z: topic responsible for generating a word z_tilde: topic responsible for generating a concept α, β, γ: Dirichlet prior parameters θ, Ф, Γ: model parameters (to be learned)

/25/2008 Las Vegas, NV KDD-2008 Learning the Topic-Concept Model  Given a set of documents D={(w 1,c 1 ),…,(w D,c D )} infer θ, Ф, Γ for each document d  Computing the posterior p(z | w) is intractable  Approximation by sampling from joint p(z, w) using Markov chain Monte Carlo approach  Set α, β, γ constant

/25/2008 Las Vegas, NV KDD-2008 Annotation Task  Train Topic-Concept Model to predict MeSH concepts of a previously unseen document d  Use Bayes rule to estimate the distribution over concepts given document d  Estimate p(t | d) by re-sampling z based on new document d  Result is a ranked list of MeSH concepts

/25/2008 Las Vegas, NV KDD-2008 Experiment  Use 2 benchmark datasets provided by NLM  Compare results with NLM approach and Naïve Bayes (multi-label)  Prune MeSH concepts to top layer (109 MeSH concepts)  Better overall performance compared to naïve Bayes and NLM!  Reasons: modelling of dependency, exploiting word features unindexed documents,  Also shows advantages as descriptive model.

/25/2008 Las Vegas, NV KDD-2008 Experiment  Use 2 benchmark datasets provided by NLM  Compare results with NLM approach and Naïve Bayes (multi-label)  Prune MeSH concepts to top layer (109 MeSH concepts) Random 50KGenetics

KDD-2008 Emerging Trend Detection

/25/2008 Las Vegas, NV KDD-2008 Emerging Trend Detection Problem  New MeSH terms are selected by experts – this can happen long after the term becomes important and widely used!  An early identification of potential MeSH terms would be very useful for technology scouting teams and biomedical researchers. Challenges  Automatically identify newly emerging important concepts  Prepare a collection that can be used for evaluation of emerging trend detection methods  1.5M PubMed abstracts from 01/1975 through 10/2007 with keywords: cancer, carcinoma, tumor, neopla, malignant.  81 interesting cancer-related MeSH term introduced during this period.

/25/2008 Las Vegas, NV KDD-2008 Collection Preparation  PubMed abstracts from 01/1975 through 10/2007 for with the following cancer related keywords (substrings): cancer, carcinoma, tumor, neopla, malignant.  About 1.5M documents were found and processed: word level parsing, stop word removal, word stemming. The stop word list included some very common medical terms such as result, patient, study, method.  The MeSH annotations of the abstracts were not used and no named-entity recognition was performed to ensure that no information is used that would not have been available at the time the abstracts were published. Number of cancer related documents per month in PubMed.

/25/2008 Las Vegas, NV KDD-2008 Positive Examples  22,169 MeSH terms observed in at least one of the cancer related documents.  Kept terms listed in a tree that has one of the cancer keywords in the path name - 223 relevant trees with 759 relevant terms.  Removed terms with early or suspect creation dates.  Kept terms in one of the top level trees listed in the table on the right.  Out of the remaining terms, only 81 match stems occurring in the abstracts.

/25/2008 Las Vegas, NV KDD-2008 Representation and Scoring Representation: Term frequency in sliding 12 month window. Divide by the total number of documents in that period. Scoring function (Better than the one in the paper!) Consider a 24 month period ending with the current month t Count the number of times normalized frequency f reaches a new maximum in that period:  Excluded terms that have  not yet occurred (impossible)  have already been added to MeSH (truth known)  are added to MeSH within the next year (too late)  TP: terms that will be added after at least 1 year  FP: terms that will never be added to MeSH  FN: terms that are added to MeSH at time [t+1y, t+5y]  TN: All other terms Experimental setup 140K word stems, 81 true positives Top ranked 300 terms / month Time horizon 1 year / 5 years

/25/2008 Las Vegas, NV KDD-2008 Results Time difference between inclusion in MeSH and earliest detection in top 300. 48 out of 81 positive terms are detected. Is top 300 too much? 300 * 12 month * 25 years = 90,000 terms. However, only 6,290 unique terms occurred in top 300 in this period. Addition of new MeSH terms describing cancer- related biomarkers. Since the 1 st term is added in 1985, it only makes sense to start evaluation in 1980 (given our horizon parameters).

/25/2008 Las Vegas, NV KDD-2008 Precision and Recall Measures

/25/2008 Las Vegas, NV KDD-2008 BioJournalMonitor Trends of concepts and topics Group abstracts into topics. Summarize topics with keywords

/25/2008 Las Vegas, NV KDD-2008 Conclusion  Described BioJournalMonitor system for automated analysis of biomedical literature and other data sources.  Discussed in detail:  automated categorization of articles using LDA models;  and detection of important emerging trends  Future Work  Extend LDA approach to cover entire MeSH hierarchy  Examine supervised approaches for identifying emerging trends; and evaluate on different data – ex. Heart disease instead of cancer-related biomarkers

KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.

Similar presentations

Presentation on theme: "KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann.

Similar presentations

Presentation on theme: "KDD-2008 Anticipating Annotations and Emerging Trends in Biomedical Literature Fabian Mörchen, Mathäus Dejori, Dmitriy Fradkin, Julien Etienne, Bernd Wachmann."— Presentation transcript:

Similar presentations

About project

Feedback