Presentation is loading. Please wait.

Presentation is loading. Please wait.

NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate.

Similar presentations


Presentation on theme: "NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate."— Presentation transcript:

1 NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate School of Information Science and Technology University of Tokyo JAPAN

2 My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

3 My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

4 Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

5 Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

6 by D. Devos Genome sequencing.

7 Function Sequence Structure Sequence, structure and function Information Exploitation

8 Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

9 Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

10 Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]

11 Why NLP in Biomedicine ? From Biology and Medical Sciences From Natural Language Processing

12 Revolution in LT in the last decade Information Knowledge Language Texts Grammar Syntax-Semantic Mapping Interpretation based on Knowledge Machine Learning Knowledge Acquisition Statistical Biases Huge Ontology: Next Revolution ? Bio-Medical Application: UMLS, Gene Ontology, etc.

13

14 My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

15 What can we do in Biomedical domains by NLP ? Examples

16 Protein-Protein Interaction extracted from texts by C. Blaschke

17 Organized Knowledge through terms by C. Blaschke

18 From Data to Understanding : Interpretation by Language Oliveros, Blaschke et al., GIW 2000

19 Information Extraction from Texts QA Answering Systems

20 Characteristics of Signal Pathway (1) Granularity of Knowledge Units Different types of entities which are interrelated with each other Cells, Sub-locations of cells Proteins, substructures of proteins, Subclasses of proteins Ions, other chemical substances Genes, RNA, DNA G-protein coupled receptor pathway model figure from TRANSPATH

21 CSNDB ( National Institute of Health Sciences) A data- and knowledge- base for signaling pathways of human cells. –It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. –Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. –CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. –Final goal is to make a computerized model for various biological phenomena.

22 Example. 1 A Standard Reaction Excerpted @[Takai98] Signal_Reaction: “EGF receptor  Grb2” From_molecule “EGF receptor” To_molecule “Grb2” Tissue “liver” Effect “activation” Interaction “SH2+phosphorylated Tyr” Reference [Yamauchi_1997]

23 Example. 3 A Polymerization Reaction Excerpted @[Takai98] Signal_Reaction: “Ah receptor + HSP90  ” Component “Ah receptor” “HSP90” Effect “activation dissociation” Interaction “PAS domain of Ah receptor” Activity “inactivation of Ah receptor” Reference [Powell-Coffman_1998]

24 My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

25 Theories in Science Observed Data ObservableNon-Observable Data Mining

26 Objects of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Observed Data Quantitative Data Mathematical Formula Qualitative, Structures, Classification Ontology Texts

27 Objects Of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Natural Language Incomplete System Diversity Ambiguity

28 Theories in Science Observed Data ObservableNon-Observable Data Mining

29 Objects of Science Knowledge In Mind Non-Observable Observable Observed Data Quantitative Data Mathematical Formula Qualitative, Structures, Classification Ontology Texts Descriptions Of Knowledge Data Mining + Text Mining

30 Knowledge in MindDescriptions of Knowledge Observable Non-Observable Characteristics Of Language Text Mining Objects of science Data Mining Characteristics Of Knowledge

31 Objects Of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Natural Language Incomplete System Diversity Ambiguity

32 Objects Of Science Knowledge In Mind Non-Observable Descriptions Of Knowledge Observable Natural Language Incomplete System Diversity Ambiguity

33 My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

34 Terms are the basic units of knowledge Classification, Features NE recognition Event Recognition Semantic Disambiguation

35 Inconsistent naming conventions e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, … Wide-spread synonymy Many synonyms in wide usage, e.g. PKB and Akt cycline-dependent kinase inhibitor p27, p27kip1, Open, growing vocabulary for many classes Cross-over of names between classes depending on context Protein vs DNA Frequent uses of coordination inside term formations Task difficulties in molecular-biology Linking Problem Diversity Lexicon Static Processing Term Recognition Ambiguity Context Dependent Dynamic Processing

36 Ambiguity Abbreviation Extraction ( Schwartz 2003 ) –Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid amino acid amino acids anaemia anemia :

37 Experiment [Tsuruoka, et.al. 03 SIGIR] Corpus –MEDLINE: the largest collection of abstracts in the biomedical domain Rule learning –83,142 abstracts –Obtained rules: 14,158 Evaluation –18,930 abstracts –Count the occurrences of each generated variant.

38 Results: “NF-kappa B” Generation Probability Generated VariantsFrequency 1.0 (Input)NF-kappa B857 0.417NF-kappaB692 0.417nF-kappa B0 0.337Nf-kappa B0 0.275NF kappa B25 0.226NF-kappa b0 :::

39 Results: “antiinflammatory effect” Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect7 0.462anti-inflammatory effect33 0.393antiinflammatory effects6 0.356Antiinflammatory effect0 0.286antiinflammatory-effect0 0.181anti-inflammatory effects23 :::

40 Results: “tumour necrosis factor alpha” Generation Probability Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha15 0.492tumor necrosis factor alpha126 0.356tumour necrosis factor-alpha30 0.235Tumour necrosis factor alpha2 0.175tumor necrosis factor alpha182 0.115Tumor necrosis factor alpha8 :::

41

42 Inconsistent naming conventions e.g. IL-2, IL2, Interleukin 2, Interleukin-2, Il-2 NF kappa B, NF-kappa B, (NF)-kappa B, NF-Kappa B, … Wide-spread synonymy Many synonyms in wide usage, e.g. PKB and Akt cycline-dependent kinase inhibitor p27, p27kip1, Open, growing vocabulary for many classes Cross-over of names between classes depending on context Protein vs DNA Frequent uses of coordination inside term formations Task difficulties in molecular-biology Linking Problem Diversity Lexicon Static Ptocessing Term Recognition Ambiguity Context Dependent Dynamic Processing

43 Genia Ontology Substance +substance-+-compound-+-organic-+-nucleic_acid-+-poly_nucleotides | | | | +-nucleotide | | | | +-DNA | | | | +-RNA | | | +-amino_acid-+-peptide | | | | +-amino_acid_monomer | | | | +-protein | | | +-lipid | | | +-carbohydrate | | | +-other_organic_compounds | | +-inorganic | +-atom

44 Genia Ontology : Source +-source-+-natural-+-organism-+-multi_cell | | | +-mono_cell | | | +-virus | | +-body_part | | +-tissue | | +-cell_type | +-artificial-+-cell_line | +-other_artificial_sources

45 Number of Tagged Objects Texts: 2,500 MEDLINE Abstracts –Papers on Transcription Factors in Human blood cells –550,000 words, 20,000 sentences Tagged objects: 147,000 –Protein:~ 77,000 –DNA:~ 24,000 –RNA:~ 2,400 –Source:~ 27,000 –Other:~ 37,000

46 Distributions of Semantic Classes

47 Extension of GENIA Ontology Small classes (to be embedded in UMLS) –5242 terms labelled with ‘other_names’ class Events, Biological reactions 3800 Disease 636 –Names of Diseases 501 –Treatments 61 –Diagnoses 52 –Pathology 3 –Others 39 Experiments 578 –Methods 493 –Materials 25 –Others 60 Others 228

48 DNA PROTEIN DNA CELLTYPE and classify Thus, CIITA not only activates the expression of class II genes but recruits another B cell-specific coactivator to increase transcriptional activity of class II promoters in B cells. Recognize “names” in the text –Technical terms expressing proteins, genes, cells, etc. Biomedical NE Task (Collier Coling00,Kazama ACL02, Kim ISMB02) Identify

49 NE Task as Classification To a class (tag) representing the semantic class and the position in the term –The task is reduced to a tagging task We can use methods developed for tagging –The structure is encoded in a tag BIO (Begin, Inside, and Other) tagging … Term of class X B- X I- X o Term of class Y B- Y oooo Words: BIO tags: (OTHER)

50 NE Tagging Illustrated Classify a word depending on the context activity of class II promoters in B-DNAI-DNA conversion to features classifier N PNSymNsP context BIO tags: POS tags: OO Words: Deterministic tagging: - Only the most probable tag at each word (SVM) The Viterbi tagging: - The most probable sequence among all (probabilistic models)

51 The GENIA Corpus [Tateishi HLT02., Ohta PSB00, ISMB02] Annotated MEDLINE abstracts A gold standard for biomedical NLP tasks # of abstracts: # of sentences: # of tokens (words): # of named entities: # of semantic classes: 670 5,109 152,216 23,793 24 - 2,000-abstract version soon http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ Big enough to: make SVM usage nontrivial Small enough to: make sparseness serious

52 the ME Method Maximum Entropy model Feature function Weight for F i Feature function: Target term Same as the feature in SVMs The Viterbi algorithm is used for tagging ContextTag

53 SOHMM modeling (J.KIM, et.al. ACL03) SOHMM modeling –No assumption is made arbitrarily. –Instead, a context classification function is induced from a corpus. SOHMM learning –Inducing the context classification function –Estimating parameters A set of contextual feature values which are visible at the moment of predicting. A classification function from sets of contextual feature values to context patterns grouped appropriately.

54 Experimental Results Biological source recognition Biological substance recognition Matching methodprecisionrecallF-score hard matching59.7268.9263.99 soft matching left63.2372.9767.75 soft matching right61.3670.8165.75 soft matching either64.8774.8669.51 Matching methodprecisionrecallF-score hard matching73.7666.9270.17 soft matching left77.6470.6773.99 soft matching right75.1968.2271.54 soft matching either79.0771.9875.36

55 Event Recognition Identity of events in our mind Disambiguation of different events by context

56 Problem: Syntactic Variations RAF6 activates NF-kappaB. Lck is activated by autophosphorylation at Tyr 394. Anandamide induces vasodilation by activating vanilloid receptors. the activation of Rap1 by C3G the GTPase-activating protein rhoGAP the stress-activated group of MAP kinases ACTIVATOR activate ACTIVATEE

57 Verbs Related to Biological Events Frequent Verbs in 100 MEDLINE Abstracts

58 Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely Extracted with ambiguity Parsing Failures Extractable from pp’s 31 32 26 Not extractable27 Memory limitation,etc17 68%

59 My Talk 1.Background : Why NLP in Biomedicines 2. Examples of NLP in Biomedicines 3. Text Mining and NLP 4. Our current Work 4.1 Terms and NE 4.2 Resource Building 4.3 Event Recognition 5. Concluding Remarks

60 Revolution in LT in the last decade Information Knowledge Language Texts Grammar Syntax-Semantic Mapping Interpretation based on Knowledge Machine Learning Knowledge Acquisition Statistical Biases Huge Ontology: Next Revolution ? Bio-Medical Application: UMLS, Gene Ontology, etc.

61

62 by D. Devos Genome sequencing. Actual demands in the real world with more homogenous user groups and more concrete criteria for evaluating results

63 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ Resources available Medline Abstracts (4000, about 1 million words) GENIA ontology POS tags Semantic tags Structural tags Co-reference annotations with a Singaporean team Lexical resources mapped to existing ontology


Download ppt "NLP for Biomedicine - Ontology building and Text Mining - Junichi Tsujii GENIA Project (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/) Computer Science Graduate."

Similar presentations


Ads by Google