Presentation on theme: "Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo."— Presentation transcript:
Extracting Disease-Gene Associations from MEDLINE abstracts Tsujii laboratory University of Tokyo
Outline NLP tools –Part-of-speech tagger, HPSG parser Machine learning based approach for extracting Disease-Gene Association Evaluation –Precision / recall / f-score –Effectiveness of predicate argument structures DGA explorer Annotation tool
Part-of-speech tagger Trained on the corpus containing newspaper articles and biology texts. Training corpus WSJGENIA WSJ97.0%84.3% GENIA75.2%98.1% WSJ+GENIA96.9%98.1%
HPSG parser Output –Phrase structures (e.g. np, vp, pp) –Predicate-argument structures demonstrate ARG1: we ARG2: 1, activates ARG1: E2F-1 ARG2: promoter 1, … We demonstrate that E2F-1 activates the promoter.
Parsing MEDLINE Corpus –1,500,000 MEDLINE abstracts Parsing speed –5 secs / sentence Server –PC cluster (100 processors) Time –10 days
Extracting Disease-Gene Association Preliminary experiments –Patterns on predicate-argument structures accelerates ARG1: GENE ARG2: DISEASE demonstrates ARG1: DISEASE ARG2: GENE … Low recall and precision
Machine learning based approach Sentence selectionExtracted association Using the patterns on predicate-argument structures as the features for machine learning
Training data The latter is also implied by fibroblast-associated alterations in tumor cell morphology and ECM distribution in the system. Lung fibrosis is a fatal condition of excess extracellular matrix (ECM) deposition associated with increased transforming growth factor beta (TGF- beta) activity. All foals with OLWS were homozygous for the Ile118Lys EDNRB mutation, and adults that were homozygous were not found. Dominant radial drusen and Arg345Trp EFEMP1 mutation. The 5 year overall survival (OS) and event-free survival (EFS) were 94 and 90 +/- 8%, respectively, with a median follow-up of 48 months. These data may indicate that formation of parathyroid adenoma in young patients is related to a mechanism involving EGFR.
Maximum entropy learning Log-linear model Binary-valued feature function Weight of the feature Features –Bag-of-words –Local context –Gene/disease name –Predicate-argument structures – :