Presentation on theme: "Biological literature mining"— Presentation transcript:
1 Biological literature mining Information retrieval (IR): retrieve papers relevant to specific keywordsEntity recognition (ER): specific biological entities (e.g., genes) identified in papersInformation extraction (IE): enable specific facts to be automatically pulled out of papers
2 Example sentence“Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directlyphosphorylated Swe1 and this modification served as a primingstep to promote subsequent Cdc5-dependent Swe1hyperphosphorylation and degradation”Its context is the cell cycle of the yeast Saccharomyces cerevisiaeand it allows us to demonstrate the powers and pitfalls of currentliterature-mining approaches.
3 Information Retrieval: finding the papers Aim is to identify text segments pertaining to a particular topic (here, “yeast cell cycle”)Topic may be a user provided queryad hoc IRTopic may be a set of paperstext categorization
4 Ad hoc IR Pubmed is an example Supports “boolean model” as well as “vector model”Boolean model: combination of terms using logical operations (OR, AND)Vector model: We’ll see more of this later
5 Ad hoc IR: tricksLessons learned from regular IR also applicable to biomedical literatureRemoval of “stop words” such as the, it, etc.Truncating common word endings such as -ing, -sUse of thesaurus to automatically “expand” querye.g., “yeast AND cell cycle” => “(yeast OR Saccharomyces cerevisae) AND cell cycle”
6 Ad hoc IR “Even with these improvements, current ad hoc IR systems are not able to retrieve our example sentence whenthey are given the query ‘yeast cell cycle’. Instead, thiscould be achieved by realizing that ‘yeast’ is a synonymfor S. cerevisiae, that ‘cell cycle’ is a Gene Ontology term,that the word ‘Cdc28’ refers to an S. cerevisiae proteinand finally, by looking up the Gene Ontology termsthat relate to Cdc28 to connect it to the yeast cell cycle.”
7 Entity recognition (ER) Goal: to identify biological entities (e.g., genes, proteins) in textTwo sub-goals:recognition of the words in text that represent these entitiesunique identification of these entities (the synonym problem)
8 ER goalsIn our example, Clb2, Cdc28, Cdk1, Swe1, Cdc5 should be recognized as gene or protein namesAdditionally, they should be identified by their respective “Saccharomyces Genome Database” accession numbersPerhaps the most difficult task in biomedical text mining
9 ER approaches: rule based Manually built rules that look for typical features of names, e.g., names followed by numbers, the ending “-ase”, occurrences of word “gene”, “receptor” etc in proximityAutomatically built rules using machine learning techniques
10 ER approaches: dictionary based Comprehensive list of gene names and their synonymsMatching algorithms that allow variations in those names, e.g., ‘CDC28’, ‘Cdc28’, ‘Cdc28p’ or ‘cdc-28.Advantage: they can also associated the recognized entity with its unique identifier
11 Why ER is difficultEach gene has several names and abbreviations, e.g., ‘Cdc28’ is also called ‘Cyclin-dependent kinase 1’ or ‘Cdk1’Gene names may also becommon english names, e.g., hairybiological terms, e.g., SDSnames of other genes, e.g., ‘Cdc2’ refers to two different genes in budding yeast and in fission yeast
12 Information Extraction (IE) IR extracts texts on particular topicsIE extracts facts about relationship between biological entitiese.g., deduce thatCdc28 binds Clb2,Swe1 is phosphorylated by the Cdc28–Clb2 complexCdc5 is involved in Swe1 phosphorylation
13 IE approaches: co-occurrence Identify entities that co-occur in a sentence, abstract, etc.Two co-occuring entities may be unrelated, but if they co-occur repeatedly, then likely related. Therefore, some statistical analysis usedFinds related entities but not necessarily the type of relationship
14 IE approaches: NLP Natural Language Processing (NLP) Tokenize text and identify word and sentence boundariesPart of speech tag (e.g., noun/verb) for each wordSyntax tree for each sentence, delineating noun phrases and their interrelationshipsER used to assign semantic tags for biological entities (e.g., gene/protein names)Rules applied to syntax tree and semantic labels to extract relationships between entities
15 Summary Information retrieval: getting the texts Entity recognition: identifying genes, proteins etc.Information extraction: recovering reported relationships between entities
16 Automatically Generating Gene Summaries from Biomedical Literature (Ling et al. PSB 2006)CS 466
17 Outline Introduction System Experiments and Evaluations MotivationSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and EvaluationsConclusion and Future Work
18 MotivationFinding all the information we know about a gene from the literature is a critical task in biology researchReading all the relevant articles about a gene is time consumingA summary of what we know about a gene would help biologists to access the already-discovered knowledge
19 Above summary is from ca. 2006 An Ideal Gene SummaryGPELSIGIMPWFPIAbove summary is from ca. 2006
20 Problem with Manual Procedure Labor-intensiveHard to keep updated with the rapid growth of the literature informationHow can we generate such summaries automatically?
21 The solution Structured summary on 6 aspects 2-stage summarization Gene products (GP)Expression location (EL)Sequence information (SI)Wild-type function and phenotypic information (WFPI)Mutant phenotype (MP)Genetic interaction (GI)2-stage summarizationRetrieve relevant articles by keyword matchExtract most informative and relevant sentences for 6 aspects.
22 Outline Introduction System Experiments and Evaluations MotivationSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and EvaluationsConclusion and Future Work
23 System Overview: 2-stage IE = Information Extraction; KR = Keyword Retrieval
24 Keyword Retrieval Module (IR) Dictionary-based keyword retrieval: to retrieve all documents containing any synonyms of the target gene.Input: gene nameOutput: relevant documents for that geneGene SynSet ConstructionKeyword-based retrieval
26 Gene SynSet Construction & Keyword Retrieval Gene SynSet: a set of synonyms of the target geneIssues in constructing SynSetVariation in gene name spellinggene cAMP dependent protein kinase 2:PKA C2, Pka C2, Pka-C2,…normalized to “pka c 2”Short names are sometimes ambiguous, e.g., gene name “PKA” is also a chemical termRequire retrieved document to have at least one synonym that is >= 5 characters longRetrieving documents based on keywords: Enforce the exact match of the token sequence
27 Information Extraction Module Takes a set of documents returned from the KR module, and extracts sentences that contain useful factual information about the target gene.Input: relevant documentsOutput: gene summaryTraining data generationSentence extraction
29 Training Data Generation Construct a training data set consisting of “typical” sentences for describing a category (e.g., sequence information)Training data is not about the gene to be summarized. It is about a “type” of information in general.These sentences come from a manually curated databasee.g., Flybase has separate sections for each category.
30 Sentence ExtractionExtract sentences from the documents related to our geneThen try to identify key sentences talking about a certain aspect of the gene (“category”)In determining the importance of a sentence, consider 3 factorsRelevance to the specified category (aspect)Relevance to its source documentSentence location in its source abstract
31 Scoring strategies Category relevance score (Sc): “Vector space model” Construct “category term vector” Vc for each category cWeight of term ti in this vector is wij=TFij*IDFiTFij is frequency of ti in all training sentences of category jIDFi is “inverse document frequency” = 1+log(N/ni), N = total # documents, ni = number of documents containing ti.TF measures how relevant the term is, IDF measures how rare it isSimilarly, vector Vs for each sentence sCategory relevant score Sc = cosine(Vc, Vs )
32 Scoring strategies Document relevance score (Sd): Location score (Sl): Sentence should also be related to this document.Vd for each document, Sd = cos(Vd, Vs )Location score (Sl):News: early sentences are more useful for summarizationScientific literature: last sentence of abstractSl = 1 for the last sentence of an abstract, 0 otherwise.Sentence Ranking: S=0.5Sc+0.3Sd+0.2Sl
33 Summary generationKeep only 2 top-ranked categories for each sentence.Generate a paragraph-long summary by combining the top sentence of each category
34 Outline Introduction System Experiments and Evaluations MotivationRelated WorkSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and EvaluationsConclusion and Future Work
35 Experiments 22092 PubMed abstracts on “Drosophila” Implementation on top of Lemur ToolkitVariety of information retrieval functions10 genes are randomly selected from Flybase for evaluation
36 Evaluation Precision of the top k sentences for a category evaluated Three different methods evaluated:Baseline run (BL): randomly select k sentencesCatRel: use Category Relevance Score to rank sentences and select the top-kComb: Combine three scores to rank sentencesAsk two annotators with domain knowledge to judge the relevance for each categoryCriterion: A sentence is considered to be relevant to a category if and only if it contains information on this aspect, regardless of its extra information, if any.
39 DiscussionImprovements over the baseline are most pronounced for EL, SI, MP, GI categories.These four categories are more specific and thus easier to detect than the other two GP, WFPI.Problem of predefined categoriesNot all genes fit into this framework. E.g., gene Amy-d, as an enzyme involved in carbohydrate metabolism, is not typically studied by genetic means, thus low precision of MP, GI.Not a major problem: low precision in some occasions is probably caused by the fact that there is little research on this aspect.
43 Outline Introduction System Experiments and evaluations MotivationRelated workSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and evaluationsConclusion and future work
44 Conclusion and future work Proposed a novel problem in biomedical text mining: automatic structured gene summarizationDeveloped a system using IR techniques to automatically summarize information about genes from PubMed abstractsDependency on the high-quality training data in FlyBaseIncorporate more training data from other model organisms database and resources such as GeneRIF in Entrez GeneMixture of data from different resources will reduce the domain bias and help to build a general tool for gene summarization.
45 ReferencesL. Hirschman, J. C. Park, J. Tsujii, L. Wong, C. H. Wu, (2002) Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12):H. Shatkay, R. Feldman, (2003) Mining the Biomedical Literature in the Genomic Era: An Overview. JCB, 10(6):D. Marcu, (2003) Automatic Abstracting. Encyclopedia of Library and Information Science,
46 Vector Space Model Term vector: reflects the use of different words wi,j: weight of term ti in vactor j