Presentation is loading. Please wait.

Presentation is loading. Please wait.

Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science Department of Biomedical.

Similar presentations


Presentation on theme: "Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science Department of Biomedical."— Presentation transcript:

1 Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science yxiang@bmi.osu.edu yxiang@bmi.osu.edu Department of Biomedical Informatics The Ohio State University

2 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

3 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

4 What is Literature (Text) Mining? The purposes of Literature Mining – Find relevant documents – Discover knowledge (what is knowledge?) e.g. opinion mining (sentiment analysis) e.g. document similarity The advantage of computer-based Literature Mining – Simply, computers can search much more documents! – Computers can ‘think’ and discover knowledge. We will focus on biomedical literature mining in the following

5 Why Literature Mining is Very Popular in Biomedical Science? Biomedical science studies nature subjects. – Species – Genes – Phenotypes – Diseases ….

6 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

7 Popular Tools for Biomedical Literature Mining – Document search Google – Google Scholar: http://scholar.google.comhttp://scholar.google.com ISI web of knoledge – www.isiknowledge.com www.isiknowledge.com Pubmed – www.ncbi.nlm.nih.gov/pubmed www.ncbi.nlm.nih.gov/pubmed Scopus – www.scopus.com www.scopus.com

8 Tools for Biomedical Literature Mining – Knowledge discovery The Gene Ontology – http://www.geneontology.org/ http://www.geneontology.org/ Gene answer – www.geneanswers.com www.geneanswers.com

9 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

10 Techniques Behind Literature Mining Interdisciplinary – Computer Science Information retrieval Data mining Natural Language Processing Machine learning – Library Science – Biomedical Science – Linguistics Computational linguistics – Statistics – And more! Two main research areas (some overlaps) – Information Retrieval – Natural Language Processing

11 Basic Text Search Algorithm Assume text size is n. Assume search string size is m. How to design an efficient algorithm to find all matches in the text? – Brutal force algorithm, O(mn). – Boyer-Moore Heuristics, O(mn), but fast in most cases for English text. – KMP (Knuth-Morris-Pratt) algorithm, O(m+n). Hello,world wor ld … … text String to match

12 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

13 Information Retrieval (Indexing) Archiving (preprocessing) documents for fast search – Preprocessing time – Query time – Index size

14 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

15 Programming language processing (C++, Java, etc) Lexical analysis y=x+10; Syntax analysis lexemeToken type yidentifier =assignment operator xidentifier +addition operator 10number ;end of statement assignment operator identifier expression identifier number expression x 10 + = y

16 Natural Language Processing Lexical level – Stemming (including lemmatizing): find the root of a word swimming, swam, swim, swimmer  swim – Stemming rule may vary (balance between overstemming and understemming) – Typical algorithm (Porter Stemming algorithm) – Alias, Synonym Grammatical level – Parsing “…We find Gene1 interacts with Gene2…” Sentence Noun phrase Verb phrase Gene1 Verb interact Noun phrase Gene2

17 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

18 Statistical and Data Mining Processing Statistical – Count the word frequency – Count the expression frequency Data Mining – Mining the set of frequent words – Association Rule Mining

19 Document Classification E.g., classify all documents related to coffee and health Various machine learning algorithms can be applied here. Coffee and health related documents Documents show benefits Documents show risk Cardioprotective Laxative … Cholesterol … Anxiety

20 Accuracy vs Relevancy in Pattern Recognition/Machine Learning Precision= |{relevant docs}∩{retrieved docs}|/| {retrieved docs}| Recall= |{relevant docs}∩{retrieved docs}|/|{relevant docs}| Fall-out |{nonrelevant docs}∩{retrieved docs}|/|{nonrelevant docs}|

21 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

22 Ontology According to philosophy, ontology is a systematic account of Existence In information science, ontology is a representation of concepts and their relationships, often by directed graphs

23 Ontology Example (Informal) fish fresh water salt water North American Asian …… Europe Common Carp mirror Carp invasive native Crappie

24 Ontology Example: Scientifc classification Animalia Chordata Hemichordata … Actinopterygii Sarcopterygii … Neopterygii Chondrostei … Teleostei … Cypriniformes Cyprinidae … … Kingdom Phylum Class Subclass Infraclass Order Family

25 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

26 Gene Ontology (GO) Consortium Molecular function Nucleic acid binding enzyme helicase DNA binding DNA helicase ATP-dependent DNA helicase DNA metabolis cell … … … … Reference: Gene Ontology: tool for the unification of biology, nature genetics, 2000 http://dx.doi.org/ 10.1038/75556

27 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

28 Unified Medical Language System (UMLS) A compendium of controlled vocabularies in the biomedical sciences (since 1986). It contains: – Metathesaurus – Semantic Network – SPECIALIST Lexicon UMLS contains data more than ontologies Maintained by US National Library of Medicine Website: http://www.nlm.nih.gov/research/umls/ http://www.nlm.nih.gov/research/umls/

29 UMLS - Metathesaurus Number of biomedical concepts > 1 million Stem from over 100 incorporated controlled source vocabularies: – ICD (International Statistical Classification of Diseases and Related Health Problems) – MeSH (Medical Subject Headings) – SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) – LOINC (Logical Observation Identifiers Names and Codes) – Gene Ontology – OMIM (Mendelian Inheritance in Man) … http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/source_vocabularies.html

30 UMLS - Semantic Network Semantic types (categories) – Entity Physical Object – Organism … – Event Actitivity – Behavior … Semantic relationships (connecting two concepts) – isa – assoicated_with physically_related_to – part_of … spatially_related_to – location_of … … http://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html http://www.clres.com/semrels/umls_relation_list.html Drug A treats Disease B Gene A disease_is_marked_by_gene treated_by

31 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Use and index ontology Applications of Literature Mining and Ontology

32 Use of ontology systems Statistical – Gene ontology enrichment test Indexing – Reachability – Distance – Path

33 Represent Ontology by Graphs Directed Graph Directed Acyclic Graph (DAG): Most ontologies fall into this type. Directed Tree Directed Graph DAG Tree

34 Reachability 12 34 67 8 5 9 1310 11 12 14 15 ?Query(1,11) Yes ?Query(3,9) No The problem: Given two vertices u and v in a directed graph G, is there a path from u to v ?

35 Distance 12 34 67 8 5 9 1310 11 12 14 15 ?Query d G (1, 11) =3 The problem: Given two vertices u and v in a (directed) graph G, what is the distance from u to v?

36 Path 12 34 67 8 5 9 1310 11 12 14 15 The problem:Given two vertices u and v in a (directed) graph G, what is a path (are paths) connecting u to v ? Find a path from 1 to 11

37 The estimated difficulty of building a very efficient indexing graph database schemes (based on current research) ReachabilityDistancePath Directed Treeeasy Directed Acyclic Graphmediumhard Directed Graphmediumhard Reference: R. Jin, Y. Xiang, N. Ruan, H. Wang, "Efficiently Answering Reachability Queries on Very Large Directed Graphs", Proc. of ACM SIGMOD Conference, Vancouver, June 9-12, 2008, pp. 595-608. R. Jin, Y. Xiang, N. Ruan, D. Fuhry, "3-HOP: A High-Compression Indexing Scheme for Reachability Query", Proc. of ACM SIGMOD Conference, Providence, Rhode Island, June 29-July 2, 2009, pp. 813-826.

38 Outline What is Literature Mining? – Popular Tools for Literature Mining – Basic Techniques – Information Retrieval (Indexing): Expediting searching – Linguistic Processing – Other Processing What is Ontology? – Simple ontology examples – Gene ontology – Unified Medical Language System – Ontology use and indexing Applications of Literature Mining and Ontology

39 Applications of Literature Mining and Ontology - I Build confirmed gene-phenotype relations – Human Phenotype Ontology (HPO) – Built from Online Mendelian Inheritance in Man (OMIM) database. – http://human-phenotype-ontology.org/ http://human-phenotype-ontology.org/ Reference: Robinson PN, Mundlos S. The Human Phenotype Ontology. Clinical Genetics 77(6) 2010: 525–534. http://dx.doi.org/10.1111/j.1399-0004.2010.01436.x

40 Applications of Literature Mining and Ontology - II MetaMap program and CKC Mining – MetaMap: Mapping biomedical text to UMLS Metathesaurus. – CKC (Conceptual Knowledge Constructs) represents a path connecting several concepts in the UMLS. – Knowledge Discovery using MetaMap and CKC mining. Reference: Aronson, A.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In : AMIA Symposium, p.17 (2001) Payne, P., Borlawsky, T., Kwok, A., Greaves, A.: Supporting the design of translational clinical studies through the generation and verification of conceptual knowledge-anchored hypotheses. In : AMIA Annual Symposium Proceedings, p.566 (2008) Literature MetaMap ……….… … C CKCs phenotypes bio-molecular

41 Thanks! Questions?


Download ppt "Literature Mining and Ontology BMI/IBGP 730 Autumn, 2011 Yang Xiang, Ph.D. in Computer Science Department of Biomedical."

Similar presentations


Ads by Google