Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University.

Similar presentations


Presentation on theme: "Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University."— Presentation transcript:

1 Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University of Skövde, Sweden

2 The goal of the project: text analysis for candidate path extraction

3 The characteristics of the language of biomedical texts A typical PubMed abstract (PMID: 16301995): The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on human chromosome 17p13.3, is frequently silenced in cancer by epigenetic mechanisms. Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses and zinc-finger family of transcription factors and acts by repressing target gene expression. It has been shown that enforced p53 expression leads to increased HIC1 mRNA, and recent data suggest that p53 and Hic1 cooperate in tumorigenesis. In order to elucidate the regulation of HIC1 expression, we have analysed the HIC1 promoter region for p53-dependent induction of gene expression. (…) Other members of the p53 family, notably TAp73beta and DeltaNp63alpha, can also act through this HIC1.PRE to induce transcription of HIC1, and finally, hypermethylation of the HIC1 promoter attenuates inducibility by p53.

4 Results of POS-tagging of two large corpora (30 million words each): 1) texts on stem cell research, and 2) general English prose light - Stem Cell, dark - Prose

5 Results of POS-tagging of a smaller sample corpus of biomedical abstracts

6 The general architecture of the Information Extraction system

7 Patterns for domain-specific Named Entity Recognition  Pattern 1: n lower case chars (n>=1) + m integers (m >=2) + optionally: any character (p53, cdc25C, bcl2)  Pattern 2: n lower case chars (n>=1) + m upper case chars (m>=1) + k integers (k>=0) (mRNA)  Pattern 3: integer + lower case + n integers (n>=0) (1alpha)  Pattern 4: n integers (n>=1) + m upper case (m >=1) (7BL)

8 Linking acronyms to full names of biological objects Find next acronym A Found? L1:= First Letter of A N := Number of letters in A Yes Within (…) ? Yes Find the N:th word beginning in L1 to the left of the ‘(‘, link that word and its right context to A Is A followed by ’(’ and L1* ? No Mark the words inside the (…), link to A YesNo Place pointer at the first word in the sentence To next procedure (Other parts of the NER-module) From previous procedure Thereare alsotumor-relatedgeneslikeNF2neurofibromatose of type 2. p16INK4a belongstoagroupcellcycleregulatorcalledcyclindependentkinaseinhibitors CDKI. () ( )

9 Sample semantico-syntactic tags Our finding implicates that TNF-alpha released from the mesangium after IgA deposition activates renal tubular cells. [semcat('Our',our,[[],poss([])]),semcat(finding,find,[wnn,[]]), semcat(implicates,implicate,[[],[speech_act_verb([1])]), semcat(that,that,[[],rel([])]),semcat('TNF',[propername]), semcat(alpha,alpha,[wnn,[]]),semcat(released,release,[[],bioverb([[],production])]), semcat(from,from,[[],prep([])]),semcat(the,the,[[],det([])]), semcat(mesangium,mesangium,[[],[]]), semcat(after,after,[[],prep([])]),semcat('IgA',[propername]), semcat(deposition,deposition,[wnn,[]]), semcat(activates,activate,[[],bioverb([[],activation])]), semcat(renal,renal,[adj,[]]),semcat(tubular,tubular,[adj,[]]), semcat(cells,cell,[[],cell([])]),semcat('.',[[],[]])]

10 Tags (occurrences) in the test set in relation to knowledge sources

11 The next step: finding background and foreground in abstracts

12 Background/foreground in abstracts ID: 16284406. The transcription factors dehydration-responsive element-binding protein 1s (DREB1s)/C- repeat-binding factors (CBFs) specifically interact with the DRE/CRT cis-acting element and control the expression of many stress-inducible genes in Arabidopsis. The genes for DREB1 orthologs, OsDREB1A and OsDREB1B from rice, are induced by cold stress, and overexpression of DREB1 or OsDREB1 induced strong expression of stress-responsive genes in transgenic Arabidopsis plants, resulting in increased tolerance to high-salt and freezing stresses. In this study, we generated transgenic rice plants overexpressing the OsDREB1 or DREB1 genes. These transgenic rice plants showed not only growth retardation under normal growth conditions but also improved tolerance to drought, high-salt and low-temperature stresses like the transgenic Arabidopsis plants overexpressing OsDREB1 or DREB1. We also detected elevated contents of osmoprotectants such as free proline and various soluble sugars in the transgenic rice as in the transgenic Arabidopsis plants. (…)

13 Retrieval of Relevant Text Parts  Presence of the string this study/current study/present stud/our study or synonyms of study in the same context (work, research, investigation)  Presence of the pronoun we preceded by or followed by a verb denoting an event in the world of the researcher (i.e., a cognition, communication, or manipulation verb) and not combined with a time adverb referring to past time, such as previously, earlier  Presence of the string our goal/our aim  Presence of a cognition/communication verb combined with the adverb now, presently or here.  Tense shift from present to past. success rate: 92,5%

14 Retrieval of Relevant Text Parts (2) if Foreground < 6 and word is in [study, work, research, investigation] and word-1 is in [this, current, present,our] -> Foreground = 6 else if Foreground Foreground = 5 else if Foreground Foreground = 4 else if Foreground Foreground = 3 else if Foreground if set found{ "now", "presently", "here" } = 1  Foreground= 2 else foundCCverb=1 if Foreground if foundCCverb=1 -> Foreground= 2 else set_found{ "now", "presently", "here" } = 1 if word indicates tense shift from present to past -> Foreground = 1

15 Extracting relations from syntactic trees S subjSdsent predobj subjSdsent objpred we hypothesize NUX i Sdsent predadvl PNP mediators release, pass from HMC may lead to activation PTEC RelclNUX j subj:Ref j (HMC) Sdsent advl:agentpred trigger,passIgA deposition Relcl subj:Ref i (mediators) PTEC H y p o t h e s e : K E G G r e l a t i o n : a c t i v a t i o n v : a c t i v a t e HMC IgA deposition KEGG relation:activation v:trigger mediators v:release relation type: production We hypothesise that mediators released from human mesangial cells (HMC) triggerred by IgA deposition may lead to activation of proximal tubular epithelial cells (PTEC)

16 Allelic loss at TP53 seems to arise independently of LOH at the RB1 gene in carcinomas of the uterine corpus in humans

17 The syntactic tree after application of the tree search algorithm

18 A possible graphical representation of the compressed tree

19 Results  Test corpus: about 15 000 words selected from PubMed using p53 as keyword  Tagging: 95.2% recall  Retrieval of relevant text parts: success rate 92.5%  Syntactic parsing: 79% recall, 86% precision  Relation retrieval: tested only manually, success rate about 94%

20 Current and Future Work  A revised tagging procedure; tagging using a smaller lexicon and domain-specific prefix list  parsing improvements  implementation of the tree search algorithm  the question of the final output format

21


Download ppt "Towards an Automated Analysis of Biomedical Abstracts Barbara Gawronska, Björn Erlendsson, Björn Olsson School of Humanities and Informatics, University."

Similar presentations


Ads by Google