Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction in Biology Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science University of Tokyo.

Similar presentations


Presentation on theme: "Information Extraction in Biology Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science University of Tokyo."— Presentation transcript:

1 Information Extraction in Biology Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science University of Tokyo

2 Overview of GENIA Project GENIA Pre ‐ processing Named entity Template element Scenario template Information Extraction Learning Terminology Databases Corpora ① A researcher with a question ② query ③ ④ information extracted ⑤ answer to the question Ontology Information Retrieval WWW LinksThesaurus

3 Overview of GENIA System Information Extraction Module Identify & classify terms Identify events Raw(OCR)Text Structure Annotated Corpus DocumentNamed-EntityEvent Database OntologyMarkup language Data model Background Knowledge MEDLINE Retrieval Module Request enhancement Spawn request Classify documents Security User IR Request Abstract Full Paper Interface Module GUI HTML conversion System integration Concept Module Corpus Module Markup generation / compilation Annotated corpus construction Database Module DB design / access / management DB construction BK design / construction / compilation

4 Objectives What should be extracted ? –Ontology for Fact Data bases and Ontology for NLP –Linking texts with Fact Data Bases Information extraction from texts –Named Entity Recognition –Event Recognition Resource building –Knowledge of the Domain –Representation Language: Lattice-based Types

5 Objectives What should be extracted ? –Ontology for Fact Data bases and Ontology for NLP –Linking texts with Fact Data Bases Information extraction from texts –Named Entity Recognition –Event Recognition Resource building –Knowledge of the Domain –Representation Language: Lattice-based Types

6 Target Definition of Information Extraction Examples of Existing Data Base Lack of Explicit Ontology Flat, non-structured collection of Data Ontology for Data Bases and Knowledge Bases

7 CSNDB ( National Institute of Health Sciences) A data- and knowledge- base for signaling pathways of human cells. –It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. –Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. –CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. –Final goal is to make a computerized model for various biological phenomena.

8 Example. 1 A Standard Reaction Excerpted @[Takai98] Signal_Reaction: “EGF receptor  Grb2” From_molecule “EGF receptor” To_molecule “Grb2” Tissue “liver” Effect “activation” Interaction “SH2+phosphorylated Tyr” Reference [Yamauchi_1997]

9 Example. 3 A Polymerization Reaction Excerpted @[Takai98] Signal_Reaction: “Ah receptor + HSP90  ” Component “Ah receptor” “HSP90” Effect “activation dissociation” Interaction “PAS domain of Ah receptor” Activity “inactivation of Ah receptor” Reference [Powell-Coffman_1998]

10 Characteristics of Signal Pathway (1) Granularity of Knowledge Units Different types of entities which are interrelated with each other Cells, Sub-locations of cells Proteins, substructures of proteins, Subclasses of proteins Ions, other chemical substances Genes, RNA, DNA G-protein coupled receptor pathway model figure from TRANSPATH

11 Characteristics of Signal Pathways (2) http://www.mips.biochem.mpg.de/proj/yeast/pathways/pherom one.html Incomplete Knowledge of Interactions Interpretations depend on background knowledge and contexts

12 Structured Representation Compound Graph) B E G I J H F A D C root EIJH BFG ACD EIJH BFG ACD interaction graph G=(V,E G ) decomposition tree T=(V,E T ) Fukuda (CBRC, AIST) and Takagi (IMS, University of Tokyo) CBRC:Computational Biology Research Center AIST: National Institute of Advanced Industrial Science and Technology

13 GPCR ligand Compound Graph for Pheromone Signal Transduction Pathway Beta STE4 Gamma STE18 Alpha GPA1 Beta STE4 Gamma STE18 Alpha GPA1 G protein G-protein coupled receptor complex Gamma STE18 STE24 STE20/ MKKKK Beta STE4 BEM1 CDC42 GDP Non phos STE7 Phos STE7 Non phos KSS1 Cell polarization CDC42 GTP Cell cycle arrest Phos FUS3 Phos KSS1 Transcription of Mating-specific genes STE12 FAR1 Non phos FUS3 STE5 MAPK scaffold structure STE7/MKK MAPK STE5 Non phos STE11 Phos STE11 STE11/MKKK MAPK

14

15

16 Signal Diagram membrane ANP ANP receptor guanylyl cyclase GTP cGM P Ras G-kinase P S/T cytosol exterior Controlled Presentation Molecular Function –Receptor G-protein coupled receptor Receptor S/T kinase ….. –Enzyme –……. Cellular Function –Stress Response Heat shock response Oxidative stress respons Molecular Function –Receptor G-protein coupled receptor Receptor S/T kinase ….. –Enzyme –……. Cellular Function –Stress Response Heat shock response Oxidative stress respons Signal Ontology Signal DB covalent_modification S/T phosphorylaytion G-kinase Ras 3324 … covalent_modification S/T phosphorylaytion G-kinase Ras 3324 … GEST / Signal XML Controlled Vocabulary Gene/Gene Product XML Database B E G I J H F A D C Compound Graph Inference Template of the Entries Schema Definition

17 SIGNAL MODULE an unit of signal processing in common to the model species MOLECULAR FUNCTION biochemical properties of a molecule CELLULAR FUNCTION a biological response performed by a set of molecules REACTION biochemical properties of a signaling reaction MOLECULE TISSUE CELL SPECIES SIGNAL-ONTOLOGY ontology for cell signaling general in genome ontologies ~500 Terms Terms are linked to Gene Ontology

18 Defined byfrom/Extracted Texts Data and Knowledge bases Interface Representation Texts Data and Knowledge bases Linguistic interfaceKnowledge Interface Ontology for Knowledge Thesaurus

19 Objectives What should be extracted ? –Ontology for Fact Data bases and Ontology for NLP –Linking texts with Fact Data Bases Information extraction from texts –Named Entity Recognition –Event Recognition Resource building –Knowledge of the Domain –Representation Language: Lattice-based Types

20 Difficulties in IE in Biology From the linguistic processing point of view

21 (1)Problem: Syntactic Variations RAF6 activates NF-kappaB. Lck is activated by autophosphorylation at Tyr 394. Anandamide induces vasodilation by activating vanilloid receptors. the activation of Rap1 by C3G the GTPase-activating protein rhoGAP the stress-activated group of MAP kinases ACTIVATOR activate ACTIVATEE

22 (2)Embedded Relations between Events An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole

23 Example: “IFN alpha activated STAT 1,STAT 2 and STAT 3 in T cells, but no detectable activation of these STATs was induced by IL- 2.” Activation_Event: EID:=95080245 . 2 Protein_ 1 :=“IFN alpha” Domain_ 1: =  Protein_ 2 :=“STAT 1 ”,”STAT2”,”STAT3” Domain_ 2: =  Location:=“ T cells” Definiteness:=definite Finding:=new Reaction_Type:=  Reaction_Path:=direct Mode:=affirmative SynSet : =<” T cells” , ”human peripheral blood- derived T cells”> Example ① (3) Uncertain and Negative Information

24 Activation_Event: EID:=95080245 . 4 Protein_ 1 :=“IL-2” Domain_ 1: =  Protein_ 2 :=“these STATs” Domain_ 2: =  Location:=“ T cells” Definiteness:=tentative Finding:=new Reaction_Type:=  Reaction_Path:=direct Mode:=negative SynSet : =<“T cells” , ”human peripheral blood- derived T cells“> , <”these STATs” ”STAT1 , STAT2 , STAT3”> Example: “IFN alpha activated STAT 1,STAT 2 and STAT 3 in T cells, but no detectable activation of these STATs was induced by IL- 2.” Example ② (3) Uncertain and Negative Information

25 IE System Using a Full Parser Extraction of argument structures by applying a domain-independent (HPSG) parser plus a small number of domain-dependent patterns on the structures Pattern on Argument Structure Information HPSG Parser Document

26 Result of Experiments Argument Frame Extractor (PSB2001, A.Yakushiji, et.al.:GENIA Project) 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely Extracted with ambiguity Parsing Failures Extractable from pp’s 31 32 26 Not extractable27 Memory limitation,etc17 68%

27 Actual System Configuration a.Chunking of domain- dependent terms b.Reducing the number of lexical entries from information given by the shallow parser document A) Term recognizer B) Shallow parser Lexical Entry Generator Lexical Entries (for parser) STRING : “proteolytic enzymes” WORD : “ENZYMEs” POS : “N” TEMPLATES : [ “3pl” ] LEX_ INFOS :

28 Difficulties in NE

29 Inconsistent naming conventions e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 Wide-spread synonymy Many synonyms in wide usage, e.g. PKB and Akt Open, growing vocabulary for many classes Cross-over of names between classes depending on context Frequent uses of coordination inside term formations Task difficulties in molecular-biology

30 In this report we present evidence that the cell line NK3.3 derivedfrom human NK cells, responds to both IL-2 and IL-12, as measured byincreases in IFN-gamma and granulocyte-macrophage colony stimulating factor (GM-CSF) cytoplasmic mRNA and protein expression. Coordination in term formations IFN-gamma cytoplasmic mRNA expression Granulocyte-macrophage colony stimulating factor(GM-CSF) cytoplassmic mRNA expression IFN-gamma protein expression Granulocyte-macrophage colony stimulating factor(GM-CSF) protein expression

31 Differences between traditional NE and term extraction Orthographic form e.g. a mixture of upper-case letters and numerals is a strong indication of a protein or DNA entity. Terms have internal structure: “IL-2” is a protein “IL-2 receptor alpha chain promoter ” is a DNA. Term meaning is more contextually dependent: “IL-2” is a protein, but in some contexts, “..IL-2 promoter elements IL-2A ”.. Role names like “Ah Receptor” can be used as protein names. Task difficulties in molecular-biology

32 Model’s intuition Start of sentenceEnd of sentence Class states protein DNA Source.ct UNK Example: Activation of JAK kinases and STAT proteins in human T lymphocytes. UNK UNK PROTEIN PROTEIN UNK PROTEIN PROTEIN UNK SOURCE.ct SOURCE.ct SOURCE.ct UNK Underlying process:

33 Interpolating HMM model specification Character features:

34 End of sentence Class states Start of sentence  Overcoming data sparseness with interpolation  End of sentence Class states Start of sentenceEnd of sentence Class states Start of sentence  End of sentence Class states Start of sentence  Model +

35 Results for HMM (Coling 2000, N.Collier, et.al.: GENIA Project) Table 1: F-score values for 5-fold cross-validation F-score = (2 x Precision x Recall) / (Precision + Recall) Class#BaseBase (no features) Protein21250.7590.670(-11.7%) DNA3580.4720.376(-20.3%) RNA300.0250.000(-100.0%) Source (all)7990.6850.697(+1.8%) Source.cl930.4780.503(+5.2%) Source.ct4170.7080.752(+6.2%) Source.mo210.2000.311(+55.5%) Source.mu640.3960.402(+1.5%) Source.vi900.6760.713(+5.5%) Source.sl770.5400.549(+1.7%) Source.ti370.2060.216(+4.9%) All classes33120.7280.651(-10.6%)

36 Results for Decision tree (NLPRS 1999, C.Nobata, et.al.: GENIA Project) Table 1: F-score values for 5-fold cross-validation F-score = (2 x Precision x Recall) / (Precision + Recall) ClassNEClassificationIdentification taskonlyonly All69.0789.5664.56 SOURCE60.1087.27- PROTEIN73.1692.26- DNA42.6577.03- RNA21.6263.64-

37 Objectives What should be extracted ? –Ontology for Fact Data bases and Ontology for NLP –Linking texts with Fact Data Bases Information extraction from texts –Named Entity Recognition –Event Recognition Resource building –Knowledge of the Domain –Representation Language: Lattice-based Types

38 Resource Building Annotated Corpus Linguistic ontology (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/)

39 GENIA ontology (Ontology of Substances) +-name-+-source-+-natural-+-organism-+-multi-cell organism | | | +-mono-cell organism | | | +-virus | | +-tissue | | +-cell type | | +-sub-location of cells | +-artificial-+-cell line | +-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group | | +-protein complex | | +-individual protein molecule | | +-subunit of protein complex | | +-substructure of protein | | +-domain or region of protein | +-peptide | +-amino acid monomer | +-nucleic-+-DNA-+-DNA family or group | +-individual DNA molecule | +-domain or region of DNA | +-RNA-+-RNA family or group +-individual RNA molecule +-domain or region of RNA

40 Extension of Substance Ontology Many terms all in MEDLINE abstracts which constitute biological knowledge are not in substance ontology. We need to extend the ontology to cover broader ranges of terms [Method] Language-based ontology building:Finding frequent verbs and examining what classes of arguments they take

41 Expansion of GENIA Ontology Chemical class of substance and their substrucutres Sources Reaction –Biological reaction –Pathway –Disease Structure themselves Experiment, experimental results, and researchers Measure Biological role, or function, of substances

42 Example of Entities in Expanded Ontology Biological role, or function of substances –receptor, inhibitor, … Biological reaction –activation, binding, inhibition, apoptosis, G2 arrest –pathway, signal –immune dysfunction, Ataxia telangiectasia (AT) Structure themselves –alpha-helix, Experiment, experimental results, researchers –our results, these studies, we

43 Verbs Related to Biological Events Frequent Verbs in 100 MEDLINE Abstracts

44 Verbs Related to Biological Events Verbs that take biological entities as arguments induce –noun BE INDUCED BY noun activation of these PROTEIN was induced by PROTEIN –noun INDUCE noun PROTEIN induced the tyrosine phosphorylation bind –noun BIND TO noun the drugs bind to two different PROTEIN –noun BIND noun motifs previously found to bind the cellular factors –noun BINDING noun the TATA-box binding protein –the BINDING of noun the binding of PROTEIN semantic class: substance structure source experiment fact reaction

45 Verbs Related to Biological Events Verbs whose arguments depend on syntactic patterns show –noun BE SHOWN to-infinitive PROTEIN has been shown to trigger cellular PROTEIN activity –noun SHOW that-clause the data show that PROTEIN stimulation is also not sufficient –noun SHOW noun SOURCE showed a dose-dependent inhibition of PROTEIN activity semantic class: substance source experiment fact

46 Verbs Related to Biological Events Verbs that take both entities indicate –noun INDICATE that-clause the data indicate that PROTEIN is required in CELL prolifiration –noun INDICATE noun these findings indicate an unexpected role of DNA –noun INDICATE that-clause the structure indicates that it represents a unique class of PROTEIN –noun INDICATE noun the structure indicates mechanisms for allosteric effector action semantic class: substance structure source experiment fact reaction role

47 Example of Annotated Texts UI - 85146267 TI - Characterization of aldosterone binding sites in circulating human mononuclear leukocytes. AB - Aldosterone binding sites in human mononuclear leukocytes were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in RPMI-1640 medium, cells were incubated at 37 degrees C for 1 h with different concentrations of [3H]aldosterone plus a 100-fold concentration of RU-26988 ( 11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one ), with or without an excess of unlabeled aldosterone. Aldosterone binds to a single class of receptors with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of desoxycorticosterone = corticosterone = aldosterone greater than hydrocortisone greater than dexamethasone. The results indicate that mononuclear leukocytes could be useful for studying the physiological significance of these mineralocorticoid receptors and their regulation in humans.

48 CLASS の頻度分布 Distribution of Semantic Classes of NEs

49 Subclass distributions in major classes

50 TAG NAMEsub classCount organism multi-cell organism477 mono-cell organism20 virus153 tissue-213 cell type-1478 sub-location of cells-79 other (natural source)-1 cell line-695 other (artificial source)-7 protein family or group1172 complex170 molecule1181 subunit65 substructure29 domain or region77 N/A98 peptide -40 amino acid monomer -27 TAG NAMEsub classCount DNA family or group29 complex0 molecule81 subunit0 substructure41 domain or region770 N/A24 RNA family or group13 complex0 molecule80 subunit0 substructure1 domain or region2 N/A4 other polymer -43 nucleic acid monomer -47 lipid -1113 carbohydrate -10 other organic compounds -829 inorganic -29 atom -29 other name -2850

51 Defined byfrom/Extracted Texts Data and Knowledge bases Interface Representation Texts Data and Knowledge bases Linguistic interfaceKnowledge Interface Ontology for Knowledge Thesaurus

52 Defined byfrom/Extracted Texts Data and Knowledge bases Interface Representation Texts Data and Knowledge bases Linguistic interfaceKnowledge Interface Ontology for Knowledge Thesaurus

53 GPCR ligand Compound Graph for Pheromone Signal Transduction Pathway Beta STE4 Gamma STE18 Alpha GPA1 Beta STE4 Gamma STE18 Alpha GPA1 G protein G-protein coupled receptor complex Gamma STE18 STE24 STE20/ MKKKK Beta STE4 BEM1 CDC42 GDP Non phos STE7 Phos STE7 Non phos KSS1 Cell polarization CDC42 GTP Cell cycle arrest Phos FUS3 Phos KSS1 Transcription of Mating-specific genes STE12 FAR1 Non phos FUS3 STE5 MAPK scaffold structure STE7/MKK MAPK STE5 Non phos STE11 Phos STE11 STE11/MKKK MAPK

54 SIGNAL MODULE an unit of signal processing in common to the model species MOLECULAR FUNCTION biochemical properties of a molecule CELLULAR FUNCTION a biological response performed by a set of molecules REACTION biochemical properties of a signaling reaction MOLECULE TISSUE CELL SPECIES SIGNAL-ONTOLOGY ontology for cell signaling general in genome ontologies ~500 Terms Terms are linked to Gene Ontology

55 Defined byfrom/Extracted Texts Data and Knowledge bases Interface Representation Texts Data and Knowledge bases Linguistic interfaceKnowledge Interface Ontology for Knowledge Thesaurus

56 Example: “IFN alpha activated STAT 1,STAT 2 and STAT 3 in T cells, but no detectable activation of these STATs was induced by IL- 2.” Activation_Event: EID:=95080245 . 2 Protein_ 1 :=“IFN alpha” Domain_ 1: =  Protein_ 2 :=“STAT 1 ”,”STAT2”,”STAT3” Domain_ 2: =  Location:=“ T cells” Definiteness:=definite Finding:=new Reaction_Type:=  Reaction_Path:=direct Mode:=affirmative SynSet : =<” T cells” , ”human peripheral blood- derived T cells”> Example ① (3) Uncertain and Negative Information

57 GENIA ontology (Ontology of Substances) +-name-+-source-+-natural-+-organism-+-multi-cell organism | | | +-mono-cell organism | | | +-virus | | +-tissue | | +-cell type | | +-sub-location of cells | +-artificial-+-cell line | +-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group | | +-protein complex | | +-individual protein molecule | | +-subunit of protein complex | | +-substructure of protein | | +-domain or region of protein | +-peptide | +-amino acid monomer | +-nucleic-+-DNA-+-DNA family or group | +-individual DNA molecule | +-domain or region of DNA | +-RNA-+-RNA family or group +-individual RNA molecule +-domain or region of RNA


Download ppt "Information Extraction in Biology Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science University of Tokyo."

Similar presentations


Ads by Google