ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk.

Slides:

Advertisements

Similar presentations

April 21, 2005EPSRC E-Science Meeting, NeSC Real-time Text Mining for the Biomedical Literature a collaboration between Discovery Net & myGrid Rob Gaizauskas.

Advertisements

ELTSS Alignment to Nationwide Interoperability Roadmap DRAFT: For Stakeholder Consideration in response to public comment.

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan

NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,

1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.

Search Engines and Information Retrieval

Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.

Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.

Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.

Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.

Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at

Digital Library Architecture and Technology

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

Enterprise Systems & Architectures. Enterprise systems are mainly composed of information systems. Business process management mainly deals with information.

Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.

Search Engines and Information Retrieval Chapter 1.

Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,

Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.

Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.

Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.

Taverna and my Grid Open Workflow for Life Sciences Tom Oinn

Flexible Text Mining using Interactive Information Extraction David Milward

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

DAME: A Distributed Diagnostics Environment for Maintenance Duncan Russell University of Leeds.

Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.

Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida

Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.

CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

1 MedAT: Medical Resources Annotation Tool Monika Žáková *, Olga Štěpánková *, Taťána Maříková * Department of Cybernetics, CTU Prague Institute of Biology.

Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.

Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.

Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood

A centre of expertise in digital information management UKOLN is supported by: Functional Requirements Eprints Application Profile Working.

PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.

Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.

Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.

Ukpmc.ac.uk As a result of the mandates Research in the open How mandates work in practice 29 th May, 2009 Paul Davey, UK PubMed Central Engagement Manager,

Genomic Medicine Grid Juan Pedro Sánchez Merino Instituto de Salud Carlos III

MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.

UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)

Databases, Ontologies and Text mining Session Introduction Part 2

Data challenges in the pharmaceutical industry

Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

PubMed Database Interface (Basic Course Module 4 Part A)

Knowledge Based Workflow Building Architecture

Lesson 3 Bioinformatics Laboratory

PubMed Database Interface (Basic Course: Module 4 Part A)

PubMed Database Interface Part A (Basic Course Module 4)

Presentation transcript:

ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk Harkema Natural Language Processing Group Department of Computer Science University of Sheffield Sheffield, UK

2 Overview Demonstration scenario: to show the use of text mining techniques to support biomedical researchers investigating the genetic basis of human disorders Case study: Williams-Beuren syndrome Overview of presentation & demonstration: Background on Williams-Beuren syndrome Architecture of system Text mining Workflows User interface Demonstration of system

3 Context MyGrid University Of Manchester, University of Newcastle, University Of Nottingham, University Of Sheffield, University Of Southampton, IT Innovation Centre, European Bioinformatics Institute CLEF University of Manchester, University College London, Royal Marsden Hospital, University of Cambridge, University of Sheffield, Open University

4 WBS: Clinician’s View WBS was first clinically described in 1961 before genetic screening was available WBS presents multiple but highly variable symptoms, including: Congenital heart disorders Elfin face Mental retardation with relatively spared language skills Growth retardation Dental malformations Infantile hypercalcemia Due to the variable underlying genetic basis of WBS, the symptoms that WBS patients present are notoriously variable

5 WBS: Geneticist’s View WBS is caused by a variable (typically Mb) deletion from 7q11.23 The deleted region (termed the Williams-Beuren Critical Region or WBSCR) spans multiple genes The complex genotype (multiple genes may be deleted) leads to a complex phenotype (multiple symptoms may be present)

6 Research Process Multi-step and iterative: Step 1: Sequence the section of the genome of interest Step 2: Scan the sequence for putative genes Step 3: BLAST the putative gene sequence against database of known genes to identify homologues whose function may be known Step 4: Annotate the putative gene sequence with the data associated with homologous sequences Repeat as new data and sequences become available Text Mining techniques can facilitate Step 4 Navigating biomedical literature to find papers containing information about homologous genes etc.

7 Workflows …

8 Text Mining Uncovering the information content of unstructured or semi-structured textual data sources in an automatic way Includes research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) Relevance to biomedical informatics Textual biomedical data sources contain valuable information, but volume is so large and growing so fast that it is difficult for researchers to find relevant information Some information is available in textual form only, e.g., clinical records

9 Text Mining Workflow Workflow: computational model for processes that require repeated execution of a series of analytical tasks BLAST reports provide links to abstracts in the literature Use MeSH terms to find related papers Show retrieved papers to end user

10 Architecture of System User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

11 Text Collection Server User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

12 Text Collection Server Text collection is MEDLINE ( More than 14 million abstracts since 1950’s Largest repository of biomedical abstracts Copies made available for research, updated continually Records contain semi-structured information annotated in XML Unique id – PubMed id Citation information – author(s), journal, year, etc. Manually assigned controlled vocabulary keywords (MeSH terms) Text of abstract

13 Text Collection Server Local copy Loaded in MySQL, indexed on various fields, e.g. MeSH terms Text portion indexed for search engines (Lucene, Madcow) Text preprocessed with text mining tools (AMBIT & Termino) Indexes built for term classes (proteins, genes, diseases, etc.) Server accepts web service calls to, e.g. Return text of abstract given a PubMed id Return MeSH terms of abstracts given PubMed ids Return PubMed ids of abstracts with given MeSH terms Return PubMed ids of abstracts matching a free text query Return PubMed ids of abstracts containing a specific term

14 Preprocessing / Text Mining AMBIT Lexical & terminological processing Syntactic & semantic processing Pattern recognition & discourse processing Termino Large-scale terminological resource to support term processing for (biomedical) text processing applications Efficient recognition and classification of terms in text through use of finite state recognizers compiled from terminological database Term are associated with links to outside ontologies and other terminological knowledge sources Text Mining results saved as annotations on text

15 Workflow Server User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

16 Workflow Server Workflow server runs the Freefluo enactment engine to execute XScufl workflow (designed using Taverna) WBS workflow:

17 Interface/Browsing Client User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

18 Interface/Browsing Client Two components Submit workflows for enactment Explore results, find related documents, free text search Explore results Documents organized in tree derived from MeSH hierarchy (or chromosome locations) Links to outside databases containing more information about terms Find related documents Terms hyperlinked to same terms in other documents Finding similar documents Similarity measure based on MeSH terms Similarity measure based on words in document Free text search Based on Lucene search engine

19 Interface/Browsing Client Gridsphere Portal Framework is used for relaying workflow requests to the Freefluo enactment engine Text Mining Results viewer is implemented as a Java- Swing applet for enhanced functionality and easy inclusion in portals The applet can re-enact workflow requests via the portal so that the user can further process document sets without explicitly having to enact a new workflow

20 Interface/Browsing Client Abstract body MeSH Tree Abstract Titles Free text search Search scope restrictors Linked terms Get Related Abstracts Up-to-date screen shot needed

21 Chromosome location Extracting relationships between terms Viewer can be used to show data organized according to other trees, e.g, chromosome location, GO tree, etc.

22 Further Information Papers N. Davis, G. Demetriou, R. Gaizauskas, Y. Guo, I. Roberts. In press. Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. In: International Journal of Web Services Research. R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, I. Roberts Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. In: Proceedings of the IEEE International Conference on Services Computing (SCC 2004). Contact Neil Davis: Sheffield NLP website:

23 More slides

24 Context: MyGrid Objective: To develop a comprehensive suite of middleware components specifically to support data intensive in silico experiments in biology Workflows and query specifications link together third party and local resources using web service protocols Sheffield’s contribution: Provision of text mining capabilities to link experimental results to the biomedical literature Duration, funding, participants: 4 years, ending in June 2005 EPSRC-funded e-Science pilot project Five UK universities, European Bioinformatics Institute, several industrial partners (GSK, IBM)

25 Common WBS Deletions SVAS = SupraValvular Aortic Stenosis

26 Why Research WBS? Without an understanding of the underlying causes of the disease only palliative care can be offered Before any type of therapy can be developed the pertinent genes, interactions and expression pathways must all be elucidated

27 Williams-Beuren Syndrome Congenital disorder resulting in mental retardation caused by deletion of genetic material on 7th chromosome Area in which deletions occur not well characterised – better sequence information is becoming available As new sequence information becomes available gene finding software is run against it BLAST is run against new putative genes to identify homologues whose function may be known BLAST reports provide links to abstracts in the literature

28 Why Automate The process of searching for associated papers is tedious and time consuming The gene annotation pipeline is iterative and automating time consuming elements will free up the researchers time for more specialist research Automation allows easy collection of provenance and replication of the research process

29 Architecture of System (2) 3-way division of labour sensible way to deliver distributed text mining services Providers of e-archives, such as Medline, will make archives available via web-services interface Cannot offer tailored services for every application Will provide core, common services Specialist workflow designers will add value to basic services from archive to meet their organization’s needs Users will prefer to execute predefined workflows via standard light clients such as a browser Architecture appropriate for many research areas, not just bioinformatics

30 Text Mining Service Architecture Data pre-processing and merging architecture: AMBIT & Termino MEDLINE abstracts

31 Text Mining: Termino Large-scale terminological resource to support term processing for (biomedical) text processing applications Uniform access to terminological information aggregated across many sources, without the need for multiple, source- specific terminological components Immediate entry points into a variety of outside ontologies and other knowledge sources, making this information available to processing steps subsequent to term recognition Efficient recognition of terms in text through use of finite state recognizers compiled from contents of Termino Lexical look-up service accessible via web service (

32 Workflow Server Workflow server runs the Freefluo enactment engine to execute XScufl workflow (designed using Taverna) Graves’ disease workflow:

33 Example Project: CLEF Clinical e-Science Framework Objective: To develop a high quality, secure and interoperable information repository, derived from operational electronic patient records to enable ethical and user-friendly access to patient information in support of clinical care and biomedical research Sheffield’s contribution: Analyzing clinical narratives to extract medically relevant entities and events, and their properties and relationships Duration, funding, participants: 2003 – 2005 (CLEF), 2005 – 2007 (CLEF-S) Funded by Medical Research Council (MRC) Six universities, Royal Marsden Hospital, industrial partners engaged through CLEF Industrial Forum Meetings