Presentation on theme: "Biological literature mining"— Presentation transcript:
1Biological literature mining Information retrieval (IR): retrieve papers relevant to specific keywordsEntity recognition (ER): specific biological entities (e.g., genes) identified in papersInformation extraction (IE): enable specific facts to be automatically pulled out of papers
2Example sentence“Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directlyphosphorylated Swe1 and this modification served as a primingstep to promote subsequent Cdc5-dependent Swe1hyperphosphorylation and degradation”Its context is the cell cycle of the yeast Saccharomyces cerevisiaeand it allows us to demonstrate the powers and pitfalls of currentliterature-mining approaches.
3Information Retrieval: finding the papers Aim is to identify text segments pertaining to a particular topic (here, “yeast cell cycle”)Topic may be a user provided queryad hoc IRTopic may be a set of paperstext categorization
4Ad hoc IR Pubmed is an example Supports “boolean model” as well as “vector model”Boolean model: combination of terms using logical operations (OR, AND)Vector model: We’ll see more of this later
5Ad hoc IR: tricksLessons learned from regular IR also applicable to biomedical literatureRemoval of “stop words” such as the, it, etc.Truncating common word endings such as -ing, -sUse of thesaurus to automatically “expand” querye.g., “yeast AND cell cycle” => “(yeast OR Saccharomyces cerevisae) AND cell cycle”
6Ad hoc IR “Even with these improvements, current ad hoc IR systems are not able to retrieve our example sentence whenthey are given the query ‘yeast cell cycle’. Instead, thiscould be achieved by realizing that ‘yeast’ is a synonymfor S. cerevisiae, that ‘cell cycle’ is a Gene Ontology term,that the word ‘Cdc28’ refers to an S. cerevisiae proteinand finally, by looking up the Gene Ontology termsthat relate to Cdc28 to connect it to the yeast cell cycle.”
7Entity recognition (ER) Goal: to identify biological entities (e.g., genes, proteins) in textTwo sub-goals:recognition of the words in text that represent these entitiesunique identification of these entities (the synonym problem)
8ER goalsIn our example, Clb2, Cdc28, Cdk1, Swe1, Cdc5 should be recognized as gene or protein namesAdditionally, they should be identified by their respective “Saccharomyces Genome Database” accession numbersPerhaps the most difficult task in biomedical text mining
9ER approaches: rule based Manually built rules that look for typical features of names, e.g., names followed by numbers, the ending “-ase”, occurrences of word “gene”, “receptor” etc in proximityAutomatically built rules using machine learning techniques
10ER approaches: dictionary based Comprehensive list of gene names and their synonymsMatching algorithms that allow variations in those names, e.g., ‘CDC28’, ‘Cdc28’, ‘Cdc28p’ or ‘cdc-28.Advantage: they can also associated the recognized entity with its unique identifier
11Why ER is difficultEach gene has several names and abbreviations, e.g., ‘Cdc28’ is also called ‘Cyclin-dependent kinase 1’ or ‘Cdk1’Gene names may also becommon english names, e.g., hairybiological terms, e.g., SDSnames of other genes, e.g., ‘Cdc2’ refers to two different genes in budding yeast and in fission yeast
12Information Extraction (IE) IR extracts texts on particular topicsIE extracts facts about relationship between biological entitiese.g., deduce thatCdc28 binds Clb2,Swe1 is phosphorylated by the Cdc28–Clb2 complexCdc5 is involved in Swe1 phosphorylation
13IE approaches: co-occurrence Identify entities that co-occur in a sentence, abstract, etc.Two co-occuring entities may be unrelated, but if they co-occur repeatedly, then likely related. Therefore, some statistical analysis usedFinds related entities but not necessarily the type of relationship
14IE approaches: NLP Natural Language Processing (NLP) Tokenize text and identify word and sentence boundariesPart of speech tag (e.g., noun/verb) for each wordSyntax tree for each sentence, delineating noun phrases and their interrelationshipsER used to assign semantic tags for biological entities (e.g., gene/protein names)Rules applied to syntax tree and semantic labels to extract relationships between entities
15Summary Information retrieval: getting the texts Entity recognition: identifying genes, proteins etc.Information extraction: recovering reported relationships between entities
16Automatically Generating Gene Summaries from Biomedical Literature (Ling et al. PSB 2006)CS 466
17Outline Introduction System Experiments and Evaluations MotivationSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and EvaluationsConclusion and Future Work
18MotivationFinding all the information we know about a gene from the literature is a critical task in biology researchReading all the relevant articles about a gene is time consumingA summary of what we know about a gene would help biologists to access the already-discovered knowledge
19Above summary is from ca. 2006 An Ideal Gene SummaryGPELSIGIMPWFPIAbove summary is from ca. 2006
20Problem with Manual Procedure Labor-intensiveHard to keep updated with the rapid growth of the literature informationHow can we generate such summaries automatically?
21The solution Structured summary on 6 aspects 2-stage summarization Gene products (GP)Expression location (EL)Sequence information (SI)Wild-type function and phenotypic information (WFPI)Mutant phenotype (MP)Genetic interaction (GI)2-stage summarizationRetrieve relevant articles by keyword matchExtract most informative and relevant sentences for 6 aspects.
22Outline Introduction System Experiments and Evaluations MotivationSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and EvaluationsConclusion and Future Work
24Keyword Retrieval Module (IR) Dictionary-based keyword retrieval: to retrieve all documents containing any synonyms of the target gene.Input: gene nameOutput: relevant documents for that geneGene SynSet ConstructionKeyword-based retrieval
26Gene SynSet Construction & Keyword Retrieval Gene SynSet: a set of synonyms of the target geneIssues in constructing SynSetVariation in gene name spellinggene cAMP dependent protein kinase 2:PKA C2, Pka C2, Pka-C2,…normalized to “pka c 2”Short names are sometimes ambiguous, e.g., gene name “PKA” is also a chemical termRequire retrieved document to have at least one synonym that is >= 5 characters longRetrieving documents based on keywords: Enforce the exact match of the token sequence
27Information Extraction Module Takes a set of documents returned from the KR module, and extracts sentences that contain useful factual information about the target gene.Input: relevant documentsOutput: gene summaryTraining data generationSentence extraction
29Training Data Generation Construct a training data set consisting of “typical” sentences for describing a category (e.g., sequence information)Training data is not about the gene to be summarized. It is about a “type” of information in general.These sentences come from a manually curated databasee.g., Flybase has separate sections for each category.
30Sentence ExtractionExtract sentences from the documents related to our geneThen try to identify key sentences talking about a certain aspect of the gene (“category”)In determining the importance of a sentence, consider 3 factorsRelevance to the specified category (aspect)Relevance to its source documentSentence location in its source abstract
31Scoring strategies Category relevance score (Sc): “Vector space model” Construct “category term vector” Vc for each category cWeight of term ti in this vector is wij=TFij*IDFiTFij is frequency of ti in all training sentences of category jIDFi is “inverse document frequency” = 1+log(N/ni), N = total # documents, ni = number of documents containing ti.TF measures how relevant the term is, IDF measures how rare it isSimilarly, vector Vs for each sentence sCategory relevant score Sc = cosine(Vc, Vs )
32Scoring strategies Document relevance score (Sd): Location score (Sl): Sentence should also be related to this document.Vd for each document, Sd = cos(Vd, Vs )Location score (Sl):News: early sentences are more useful for summarizationScientific literature: last sentence of abstractSl = 1 for the last sentence of an abstract, 0 otherwise.Sentence Ranking: S=0.5Sc+0.3Sd+0.2Sl
33Summary generationKeep only 2 top-ranked categories for each sentence.Generate a paragraph-long summary by combining the top sentence of each category
34Outline Introduction System Experiments and Evaluations MotivationRelated WorkSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and EvaluationsConclusion and Future Work
35Experiments 22092 PubMed abstracts on “Drosophila” Implementation on top of Lemur ToolkitVariety of information retrieval functions10 genes are randomly selected from Flybase for evaluation
36Evaluation Precision of the top k sentences for a category evaluated Three different methods evaluated:Baseline run (BL): randomly select k sentencesCatRel: use Category Relevance Score to rank sentences and select the top-kComb: Combine three scores to rank sentencesAsk two annotators with domain knowledge to judge the relevance for each categoryCriterion: A sentence is considered to be relevant to a category if and only if it contains information on this aspect, regardless of its extra information, if any.
39DiscussionImprovements over the baseline are most pronounced for EL, SI, MP, GI categories.These four categories are more specific and thus easier to detect than the other two GP, WFPI.Problem of predefined categoriesNot all genes fit into this framework. E.g., gene Amy-d, as an enzyme involved in carbohydrate metabolism, is not typically studied by genetic means, thus low precision of MP, GI.Not a major problem: low precision in some occasions is probably caused by the fact that there is little research on this aspect.
43Outline Introduction System Experiments and evaluations MotivationRelated workSystemKeyword Retrieval ModuleInformation Extraction ModuleExperiments and evaluationsConclusion and future work
44Conclusion and future work Proposed a novel problem in biomedical text mining: automatic structured gene summarizationDeveloped a system using IR techniques to automatically summarize information about genes from PubMed abstractsDependency on the high-quality training data in FlyBaseIncorporate more training data from other model organisms database and resources such as GeneRIF in Entrez GeneMixture of data from different resources will reduce the domain bias and help to build a general tool for gene summarization.
45ReferencesL. Hirschman, J. C. Park, J. Tsujii, L. Wong, C. H. Wu, (2002) Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12):H. Shatkay, R. Feldman, (2003) Mining the Biomedical Literature in the Genomic Era: An Overview. JCB, 10(6):D. Marcu, (2003) Automatic Abstracting. Encyclopedia of Library and Information Science,
46Vector Space Model Term vector: reflects the use of different words wi,j: weight of term ti in vactor j