Biological literature mining

Biological literature mining
Information retrieval (IR): retrieve papers relevant to specific keywords Entity recognition (ER): specific biological entities (e.g., genes) identified in papers Information extraction (IE): enable specific facts to be automatically pulled out of papers

Example sentence “Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation and degradation” Its context is the cell cycle of the yeast Saccharomyces cerevisiae and it allows us to demonstrate the powers and pitfalls of current literature-mining approaches.

Information Retrieval: finding the papers
Aim is to identify text segments pertaining to a particular topic (here, “yeast cell cycle”) Topic may be a user provided query ad hoc IR Topic may be a set of papers text categorization

Ad hoc IR Pubmed is an example
Supports “boolean model” as well as “vector model” Boolean model: combination of terms using logical operations (OR, AND) Vector model: We’ll see more of this later

Ad hoc IR: tricks Lessons learned from regular IR also applicable to biomedical literature Removal of “stop words” such as the, it, etc. Truncating common word endings such as -ing, -s Use of thesaurus to automatically “expand” query e.g., “yeast AND cell cycle” => “(yeast OR Saccharomyces cerevisae) AND cell cycle”

Ad hoc IR “Even with these improvements, current ad hoc IR systems
are not able to retrieve our example sentence when they are given the query ‘yeast cell cycle’. Instead, this could be achieved by realizing that ‘yeast’ is a synonym for S. cerevisiae, that ‘cell cycle’ is a Gene Ontology term, that the word ‘Cdc28’ refers to an S. cerevisiae protein and finally, by looking up the Gene Ontology terms that relate to Cdc28 to connect it to the yeast cell cycle.”

Entity recognition (ER)
Goal: to identify biological entities (e.g., genes, proteins) in text Two sub-goals: recognition of the words in text that represent these entities unique identification of these entities (the synonym problem)

ER goals In our example, Clb2, Cdc28, Cdk1, Swe1, Cdc5 should be recognized as gene or protein names Additionally, they should be identified by their respective “Saccharomyces Genome Database” accession numbers Perhaps the most difficult task in biomedical text mining

ER approaches: rule based
Manually built rules that look for typical features of names, e.g., names followed by numbers, the ending “-ase”, occurrences of word “gene”, “receptor” etc in proximity Automatically built rules using machine learning techniques

ER approaches: dictionary based
Comprehensive list of gene names and their synonyms Matching algorithms that allow variations in those names, e.g., ‘CDC28’, ‘Cdc28’, ‘Cdc28p’ or ‘cdc-28. Advantage: they can also associated the recognized entity with its unique identifier

Why ER is difficult Each gene has several names and abbreviations, e.g., ‘Cdc28’ is also called ‘Cyclin-dependent kinase 1’ or ‘Cdk1’ Gene names may also be common english names, e.g., hairy biological terms, e.g., SDS names of other genes, e.g., ‘Cdc2’ refers to two different genes in budding yeast and in fission yeast

Information Extraction (IE)
IR extracts texts on particular topics IE extracts facts about relationship between biological entities e.g., deduce that Cdc28 binds Clb2, Swe1 is phosphorylated by the Cdc28–Clb2 complex Cdc5 is involved in Swe1 phosphorylation

IE approaches: co-occurrence
Identify entities that co-occur in a sentence, abstract, etc. Two co-occuring entities may be unrelated, but if they co-occur repeatedly, then likely related. Therefore, some statistical analysis used Finds related entities but not necessarily the type of relationship

IE approaches: NLP Natural Language Processing (NLP)
Tokenize text and identify word and sentence boundaries Part of speech tag (e.g., noun/verb) for each word Syntax tree for each sentence, delineating noun phrases and their interrelationships ER used to assign semantic tags for biological entities (e.g., gene/protein names) Rules applied to syntax tree and semantic labels to extract relationships between entities

Summary Information retrieval: getting the texts
Entity recognition: identifying genes, proteins etc. Information extraction: recovering reported relationships between entities

Automatically Generating Gene Summaries from Biomedical Literature
(Ling et al. PSB 2006) CS 466

Outline Introduction System Experiments and Evaluations
Motivation System Keyword Retrieval Module Information Extraction Module Experiments and Evaluations Conclusion and Future Work

Motivation Finding all the information we know about a gene from the literature is a critical task in biology research Reading all the relevant articles about a gene is time consuming A summary of what we know about a gene would help biologists to access the already-discovered knowledge

Above summary is from ca. 2006
An Ideal Gene Summary GP EL SI GI MP WFPI Above summary is from ca. 2006

Problem with Manual Procedure
Labor-intensive Hard to keep updated with the rapid growth of the literature information How can we generate such summaries automatically?

The solution Structured summary on 6 aspects 2-stage summarization
Gene products (GP) Expression location (EL) Sequence information (SI) Wild-type function and phenotypic information (WFPI) Mutant phenotype (MP) Genetic interaction (GI) 2-stage summarization Retrieve relevant articles by keyword match Extract most informative and relevant sentences for 6 aspects.

Motivation System Keyword Retrieval Module Information Extraction Module Experiments and Evaluations Conclusion and Future Work

System Overview: 2-stage
IE = Information Extraction; KR = Keyword Retrieval

Keyword Retrieval Module (IR)
Dictionary-based keyword retrieval: to retrieve all documents containing any synonyms of the target gene. Input: gene name Output: relevant documents for that gene Gene SynSet Construction Keyword-based retrieval

KR module

Gene SynSet Construction & Keyword Retrieval
Gene SynSet: a set of synonyms of the target gene Issues in constructing SynSet Variation in gene name spelling gene cAMP dependent protein kinase 2: PKA C2, Pka C2, Pka-C2,… normalized to “pka c 2” Short names are sometimes ambiguous, e.g., gene name “PKA” is also a chemical term Require retrieved document to have at least one synonym that is >= 5 characters long Retrieving documents based on keywords: Enforce the exact match of the token sequence

Information Extraction Module
Takes a set of documents returned from the KR module, and extracts sentences that contain useful factual information about the target gene. Input: relevant documents Output: gene summary Training data generation Sentence extraction

IE module

Training Data Generation
Construct a training data set consisting of “typical” sentences for describing a category (e.g., sequence information) Training data is not about the gene to be summarized. It is about a “type” of information in general. These sentences come from a manually curated database e.g., Flybase has separate sections for each category.

Sentence Extraction Extract sentences from the documents related to our gene Then try to identify key sentences talking about a certain aspect of the gene (“category”) In determining the importance of a sentence, consider 3 factors Relevance to the specified category (aspect) Relevance to its source document Sentence location in its source abstract

Scoring strategies Category relevance score (Sc): “Vector space model”
Construct “category term vector” Vc for each category c Weight of term ti in this vector is wij=TFij*IDFi TFij is frequency of ti in all training sentences of category j IDFi is “inverse document frequency” = 1+log(N/ni), N = total # documents, ni = number of documents containing ti. TF measures how relevant the term is, IDF measures how rare it is Similarly, vector Vs for each sentence s Category relevant score Sc = cosine(Vc, Vs )

Scoring strategies Document relevance score (Sd): Location score (Sl):
Sentence should also be related to this document. Vd for each document, Sd = cos(Vd, Vs ) Location score (Sl): News: early sentences are more useful for summarization Scientific literature: last sentence of abstract Sl = 1 for the last sentence of an abstract, 0 otherwise. Sentence Ranking: S=0.5Sc+0.3Sd+0.2Sl

Summary generation Keep only 2 top-ranked categories for each sentence. Generate a paragraph-long summary by combining the top sentence of each category

Motivation Related Work System Keyword Retrieval Module Information Extraction Module Experiments and Evaluations Conclusion and Future Work

Experiments 22092 PubMed abstracts on “Drosophila”
Implementation on top of Lemur Toolkit Variety of information retrieval functions 10 genes are randomly selected from Flybase for evaluation

Evaluation Precision of the top k sentences for a category evaluated
Three different methods evaluated: Baseline run (BL): randomly select k sentences CatRel: use Category Relevance Score to rank sentences and select the top-k Comb: Combine three scores to rank sentences Ask two annotators with domain knowledge to judge the relevance for each category Criterion: A sentence is considered to be relevant to a category if and only if it contains information on this aspect, regardless of its extra information, if any.

Precision of the top-k sentences

Discussion Improvements over the baseline are most pronounced for EL, SI, MP, GI categories. These four categories are more specific and thus easier to detect than the other two GP, WFPI. Problem of predefined categories Not all genes fit into this framework. E.g., gene Amy-d, as an enzyme involved in carbohydrate metabolism, is not typically studied by genetic means, thus low precision of MP, GI. Not a major problem: low precision in some occasions is probably caused by the fact that there is little research on this aspect.

Summary example (Abl)

Summary example (Camo|Sod)

Outline Introduction System Experiments and evaluations
Motivation Related work System Keyword Retrieval Module Information Extraction Module Experiments and evaluations Conclusion and future work

Conclusion and future work
Proposed a novel problem in biomedical text mining: automatic structured gene summarization Developed a system using IR techniques to automatically summarize information about genes from PubMed abstracts Dependency on the high-quality training data in FlyBase Incorporate more training data from other model organisms database and resources such as GeneRIF in Entrez Gene Mixture of data from different resources will reduce the domain bias and help to build a general tool for gene summarization.

References L. Hirschman, J. C. Park, J. Tsujii, L. Wong, C. H. Wu, (2002) Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12): H. Shatkay, R. Feldman, (2003) Mining the Biomedical Literature in the Genomic Era: An Overview. JCB, 10(6): D. Marcu, (2003) Automatic Abstracting. Encyclopedia of Library and Information Science,

Vector Space Model Term vector: reflects the use of different words
wi,j: weight of term ti in vactor j

Biological literature mining

Similar presentations

Presentation on theme: "Biological literature mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biological literature mining

Similar presentations

Presentation on theme: "Biological literature mining"— Presentation transcript:

Similar presentations

About project

Feedback