Presentation on theme: "ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine,"— Presentation transcript:
ECO R European Centre for Ontological Research Basic Introduction to Ontology-based Language Technology (LT) for the Biomedical Sciences (1st year Biomedicine, UG, Belgium) Werner Ceusters European Centre for Ontological Research Universität des Saarlandes Saarbrücken, Germany
ECO R European Centre for Ontological Research Purpose of this lecture Introduce some keywords Give just a taste for ontology-based LT in Biomedicine Induce interest for further research
ECO R European Centre for Ontological Research Biomedicine: A Great Area for LT Educated users High utility of NLP Doesn’t require solution to general problem Complex and interesting (not just IE) Recent surge in data Knowledge bases available Hinrich Schütze, Novation Biosciences Russ Altman, Stanford University
ECO R European Centre for Ontological Research Biomedical Data Mining and DNA Analysis DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). Gene: a sequence of hundreds of individual nucleotides arranged in a particular order Humans have around 100,000 genes Tremendous number of ways that the nucleotides can be ordered and sequenced to form distinct genes Semantic integration of heterogeneous, distributed genome databases – Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data – Data cleaning and data integration methods developed in data mining will help Jiawei Han and Micheline Kamber
ECO R European Centre for Ontological Research DNA Analysis: Examples Similarity search and comparison among DNA sequences – Compare the frequently occurring patterns of each class (e.g., diseased and healthy) – Identify gene sequence patterns that play roles in various diseases Association analysis: identification of co-occurring gene sequences – Most diseases are not triggered by a single gene but by a combination of genes acting together – Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples Path analysis: linking genes to different disease development stages – Different genes may become active at different stages of the disease – Develop pharmaceutical interventions that target the different stages separately Visualization tools and genetic data analysis Jiawei Han and Micheline Kamber
ECO R European Centre for Ontological Research Task descriptions Sequence similarity searching – Nucleic acid vs nucleic acid 28 – Protein vs protein 39 – Translated nucleic acid vs protein 6 – Unspecified sequence type 29 – Search for non-coding DNA 9 Functional motif searching 35 Sequence retrieval 27 Multiple sequence alignment 21 Restriction mapping 19 Secondary and tertiary structure prediction 14 Other DNA analysis including translation 14 Primer design 12 ORF analysis 11 Literature searching 10 Phylogenetic analysis 9 Protein analysis 10 Sequence assembly 8 Location of expression 7 Miscellaneous 7 Total 315 Stevens R, Goble C, Baker P, and Brass A. A Classification of Tasks in Bioinformatics. Bioinformatics 2001: 17 (2):180-188.
ECO R European Centre for Ontological Research Three major challenges Analyse massive amounts of data: – Eg: high throughput technologies based upon cDNA or oligonucleotide microarrays for analysis of gene expression, analysis of sequence polymorphisms and mutations, and sequencing Appropriately link clinical histories to molecular or other biomarker data generated by genomic and proteomic technologies. Development of user-friendly computer-based platforms – that can be accessed and utilized by the average researcher for searching, retrieval, manipulation, and analysis of information from large-scale datasets
ECO R European Centre for Ontological Research BUT !!! Majority of data buried in – huge amounts of texts – Incompatibly annotated databases
ECO R European Centre for Ontological Research Text overload – According to a conservative estimate, the number of digital libraries is more than 10 5. [Norbert Fuhr 03] – Google indexed over 4.28 billion web pages; from Google press release. – But, any single engine is prevented from indexing more than one-third of the “indexable web”. from Science.Vol.285, Nr.5426.
ECO R European Centre for Ontological Research Objectives of LT in Biomedical Informatics Make large volumes of scientific texts better accessable Assist annotation of genome and phenome to allow better linking of the data – CSB: Computational Systems Biology Link biomedical data with patient record data
ECO R European Centre for Ontological Research Knowledge discovery and use
ECO R European Centre for Ontological Research Cost effectiveness Utility Artificial Intelligence Cyc Information Extraction Fastus Primary Literature Reading Keyword-based Retrieval PubMed Structure Mining LowHi Low Hi Manual Knowledge Representation Riboweb Text Mining Technologies for Biomedicine Hinrich Schütze, Novation Biosciences Russ Altman, Stanford University
ECO R European Centre for Ontological Research Scientists in areas such as molecular biology and biochemistry aim to discover new biological entities and their functions. Typical cases could be discoveries of the implications of new proteins and genes in an already known process, or implication of proteins with previously characterized functions in a separate process. The use of available information (published papers, etc.) is a key step for the discovery process, since in many cases weak or indirect evidences about possible relations hidden in the literature are used to substantiate working hypothesis that are experimentally explored. [C.Blaschke, A.Valencia: 2001]
ECO R European Centre for Ontological Research Text-based knowledge discovery Goal: Finding “new” biomedical scientific knowledge through the combination of existing knowledge as represented in the medical literature Motivation: Prevention of re-inventing the wheel, re-usage of specific knowledge outside the original domain of discovery
ECO R European Centre for Ontological Research Swanson Substance A Effects B Disease C Fish oil High blood viscosity Platelet aggregation Raynaud’s disease
ECO R European Centre for Ontological Research by C. Blaschke Protein-Protein Interaction extracted from texts
ECO R European Centre for Ontological Research Some classifiers/learning methods Steps of Knowledge Discovery Training data gathering Feature generation – k-grams, domain know-how,... Feature selection – Entropy, 2, CFS, t-test, domain know-how... Feature integration – SVM, ANN, PCL, CART, C4.5, kNN,... Limsoon Wong
ECO R European Centre for Ontological Research Basic use components: end-user – Corpus Management tool – Parser – Export module Management components: – Corpus editor super user – Grammar building workbench super user – Domain Ontology editor super user – Parser generator exporter – Linguistic ontology (multi-lingual use) exporter Functional components for text-based feature generation system
ECO R European Centre for Ontological Research Short term: single domain – Corpus collection & analysis – Domain model design & implementation – Grammar Development – Corpus Manipulation Engine – Integration in Biomining package Long term: generic system – Grammar Building Workbench – Parser Generator – Documentation What does it take to build such a system ?
ECO R European Centre for Ontological Research A “statistics only system” 22 page full paper ABSTRACT ONLY
ECO R European Centre for Ontological Research Relative Concept/Node identification (real) concepts nodes Statistic analysis is powerful, but not enough
ECO R European Centre for Ontological Research Clean separation of knowledge for deep understanding The Galen view: – linguistic knowledge – conceptual knowledge – pragmatic knowledge – criteria knowledge – terminological knowledge The LT view: –phonologic knowledge –morphologic knowledge –syntactic knowledge –semantic knowledge –pragmatic knowledge –world knowledge
ECO R European Centre for Ontological Research One word – multiple meanings Abbreviation Extraction （ Schwartz 2003 ） – Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid amino acid amino acids anaemia anemia :
ECO R European Centre for Ontological Research Syntactic variant detection Corpus – MEDLINE: the largest collection of abstracts in the biomedical domain Rule learning – 83,142 abstracts – Obtained rules: 14,158 Evaluation – 18,930 abstracts – Count the occurrences of each generated variant. Tsuruoka, et.al. 03 SIGIR]
ECO R European Centre for Ontological Research Results: “antiinflammatory effect” Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect7 0.462anti-inflammatory effect33 0.393antiinflammatory effects6 0.356Antiinflammatory effect0 0.286antiinflammatory-effect0 0.181anti-inflammatory effects23 :::
ECO R European Centre for Ontological Research Results: “tumour necrosis factor alpha” Generation Probability Generated VariantsFrequenc y 1.0 (Input)tumour necrosis factor alpha15 0.492tumor necrosis factor alpha126 0.356tumour necrosis factor-alpha30 0.235Tumour necrosis factor alpha2 0.175tumor necrosis factor alpha182 0.115Tumor necrosis factor alpha8 :::
ECO R European Centre for Ontological Research DNA PROTEIN DNA CELLTYPE and classify Thus, CIITA not only activates the expression of class II genes but recruits another B cell-specific coactivator to increase transcriptional activity of class II promoters in B cells. Recognize “names” in the text – Technical terms expressing proteins, genes, cells, etc. Biomedical NE Task (Collier Coling00,Kazama ACL02, Kim ISMB02) Identify Junichi Tsujii
ECO R European Centre for Ontological Research Text mining and classification Having a healthcare phenomenon Generalised Possession Healthcare phenomenon Human IS-A Has- possessor Has- possessed Patient Is-possessor-of Cancer patient IS-A Has-Healthcare- phenomenon Malignant neoplasm IS-A 1 1 1 2 2 3 3 lung carcinoma IS-A Mr. Smith has a pulmonary carcinoma
ECO R European Centre for Ontological Research Data integration approaches Protein interaction databases Small molecule databases Genome databases Pathway databases Protein databases Enzyme databases Gene Ontology at least, the beginnings of...
ECO R European Centre for Ontological Research Data Integration approaches 1.Data Warehousing : Data from various data sources are converted, merged and stored in a centralized DBMS. (Examples) Integrated Genomic Database 2.Hyperlinking approaches: Where links are set up between related information and data sources. SRS, Entrez (NCBI) 3.Standardization: Efforts which address the need for a common metadata model for various application domains. 4.Integration systems: Systems that can gather and integrate information from multiple sources. Some of these systems have a Mediator-Wrapper Architecture others are language based systems like Bio-Kleisli. 5.Federated Database: Cooperating, yet autonomous, databases map their individual schema’s to a single global schema. Operations are preformed against the federated schema. Steve Brady System Integration approaches
ECO R European Centre for Ontological Research CoMeDIAS (France)
ECO R European Centre for Ontological Research GenesTrace TM : Biological Knowledge Discovery via Structured Terminology
ECO R European Centre for Ontological Research The XML misconception Groupe hospitalier Léonard Devintscie Radiologie Centrale Dr. Bouaud Phlébographie des membres inférieurs Sce Pr. Charlet Dr. Brunie 29-10-99 Donald Duck Suspicion de phlébite de jambe gauche Ponction bilatérale d’une veine du dos du pied et injection de 180cc de produit de contraste image lacunaire endoluminale visible au niveau des veines péronières gauche. Absence d’opacification des veines tibiales antérieures et postérieures gauches. Les veines illiaques et la veine cave inférieure sont libres. Trombophlébite péronière et probablement tibiale antérieure et postérieure gauche.
ECO R European Centre for Ontological Research Towards Machine Readable Semantics FormStructureMeaningFunction Style Type Definition Document Type Definition Information Type Definition Knowledge Type Definition LayoutOutlineContentBehaviour Bold Centred Align Left Blink Title Paragraph Heading1 Play Subject isPartOf Date After_value Utility affectedBy Receive Protect Data about Formalism Cases Static Dynamic Standard Workflow Type Definition Usage Actor Receival Maintenance Archival Process Hao Ding, Ingeborg T. Sølvberg
ECO R European Centre for Ontological Research Triadic models of meaning: The Semiotic/Semantic triangle Sign: Language/ Term/ Symbol Referent: Reality/ Object Reference: Concept / Sense / Model / View
ECO R European Centre for Ontological Research There is ontology and “ontology” Ontology in Information Science: – “An ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents.” Ontology in Philosophy: – “Ontology is the science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality.”
ECO R European Centre for Ontological Research Why are concepts not enough? Why must our theory address also the referents in reality? – Because referents are observable fixed points in relation to which we can work out how the concepts used by different communities relate to each other ; – Because only by looking at referents can we establish the degree to which concepts are good for their purpose.
ECO R European Centre for Ontological Research Or you get nonsense: Definition of “cancer gene”
ECO R European Centre for Ontological Research Take home message: Language Technology requires a clean separation of knowledge AND (the right sort of) ontology Conceptual knowledge: the knowledge of sensible domain concepts Knowledge of definitions and criteria: how to determine if a concept applies to a particular instance Surface linguistic knowledge: how to express the concepts in any given language Knowledge of classification and coding systems: how an expression has been classified by such a system Pragmatic knowledge: what users usually say or think, what they consider important, how to integrate in software Ontology: what exists and how what exists relates to each other