Presentation is loading. Please wait.

Presentation is loading. Please wait.

ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk.

Similar presentations


Presentation on theme: "ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk."— Presentation transcript:

1 ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk Harkema Natural Language Processing Group Department of Computer Science University of Sheffield Sheffield, UK

2 2 Overview Demonstration scenario: to show the use of text mining techniques to support biomedical researchers investigating the genetic basis of human disorders Case study: Williams-Beuren syndrome Overview of presentation & demonstration: Background on Williams-Beuren syndrome Architecture of system Text mining Workflows User interface Demonstration of system

3 3 Context MyGrid University Of Manchester, University of Newcastle, University Of Nottingham, University Of Sheffield, University Of Southampton, IT Innovation Centre, European Bioinformatics Institute CLEF University of Manchester, University College London, Royal Marsden Hospital, University of Cambridge, University of Sheffield, Open University

4 4 WBS: Clinician’s View WBS was first clinically described in 1961 before genetic screening was available WBS presents multiple but highly variable symptoms, including: Congenital heart disorders Elfin face Mental retardation with relatively spared language skills Growth retardation Dental malformations Infantile hypercalcemia Due to the variable underlying genetic basis of WBS, the symptoms that WBS patients present are notoriously variable

5 5 WBS: Geneticist’s View WBS is caused by a variable (typically Mb) deletion from 7q11.23 The deleted region (termed the Williams-Beuren Critical Region or WBSCR) spans multiple genes The complex genotype (multiple genes may be deleted) leads to a complex phenotype (multiple symptoms may be present)

6 6 Research Process Multi-step and iterative: Step 1: Sequence the section of the genome of interest Step 2: Scan the sequence for putative genes Step 3: BLAST the putative gene sequence against database of known genes to identify homologues whose function may be known Step 4: Annotate the putative gene sequence with the data associated with homologous sequences Repeat as new data and sequences become available Text Mining techniques can facilitate Step 4 Navigating biomedical literature to find papers containing information about homologous genes etc.

7 7 Workflows …

8 8 Text Mining Uncovering the information content of unstructured or semi-structured textual data sources in an automatic way Includes research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) Relevance to biomedical informatics Textual biomedical data sources contain valuable information, but volume is so large and growing so fast that it is difficult for researchers to find relevant information Some information is available in textual form only, e.g., clinical records

9 9 Text Mining Workflow Workflow: computational model for processes that require repeated execution of a series of analytical tasks BLAST reports provide links to abstracts in the literature Use MeSH terms to find related papers Show retrieved papers to end user

10 10 Architecture of System User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

11 11 Text Collection Server User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

12 12 Text Collection Server Text collection is MEDLINE (www.ncbi.nlm.nih.gov) More than 14 million abstracts since 1950’s Largest repository of biomedical abstracts Copies made available for research, updated continually Records contain semi-structured information annotated in XML Unique id – PubMed id Citation information – author(s), journal, year, etc. Manually assigned controlled vocabulary keywords (MeSH terms) Text of abstract

13 13 Text Collection Server Local copy Loaded in MySQL, indexed on various fields, e.g. MeSH terms Text portion indexed for search engines (Lucene, Madcow) Text preprocessed with text mining tools (AMBIT & Termino) Indexes built for term classes (proteins, genes, diseases, etc.) Server accepts web service calls to, e.g. Return text of abstract given a PubMed id Return MeSH terms of abstracts given PubMed ids Return PubMed ids of abstracts with given MeSH terms Return PubMed ids of abstracts matching a free text query Return PubMed ids of abstracts containing a specific term

14 14 Preprocessing / Text Mining AMBIT Lexical & terminological processing Syntactic & semantic processing Pattern recognition & discourse processing Termino Large-scale terminological resource to support term processing for (biomedical) text processing applications Efficient recognition and classification of terms in text through use of finite state recognizers compiled from terminological database Term are associated with links to outside ontologies and other terminological knowledge sources Text Mining results saved as annotations on text

15 15 Workflow Server User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

16 16 Workflow Server Workflow server runs the Freefluo enactment engine to execute XScufl workflow (designed using Taverna) WBS workflow:

17 17 Interface/Browsing Client User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

18 18 Interface/Browsing Client Two components Submit workflows for enactment Explore results, find related documents, free text search Explore results Documents organized in tree derived from MeSH hierarchy (or chromosome locations) Links to outside databases containing more information about terms Find related documents Terms hyperlinked to same terms in other documents Finding similar documents Similarity measure based on MeSH terms Similarity measure based on words in document Free text search Based on Lucene search engine

19 19 Interface/Browsing Client Gridsphere Portal Framework is used for relaying workflow requests to the Freefluo enactment engine Text Mining Results viewer is implemented as a Java- Swing applet for enhanced functionality and easy inclusion in portals The applet can re-enact workflow requests via the portal so that the user can further process document sets without explicitly having to enact a new workflow

20 20 Interface/Browsing Client Abstract body MeSH Tree Abstract Titles Free text search Search scope restrictors Linked terms Get Related Abstracts Up-to-date screen shot needed

21 21 Chromosome location Extracting relationships between terms Viewer can be used to show data organized according to other trees, e.g, chromosome location, GO tree, etc.

22 22 Further Information Papers N. Davis, G. Demetriou, R. Gaizauskas, Y. Guo, I. Roberts. In press. Web Service Architectures for Text Mining: An Exploration of the Issues via an E-Science Demonstrator. In: International Journal of Web Services Research. R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, I. Roberts Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. In: Proceedings of the IEEE International Conference on Services Computing (SCC 2004). Contact Neil Davis: Sheffield NLP website:

23 23 More slides

24 24 Context: MyGrid Objective: To develop a comprehensive suite of middleware components specifically to support data intensive in silico experiments in biology Workflows and query specifications link together third party and local resources using web service protocols Sheffield’s contribution: Provision of text mining capabilities to link experimental results to the biomedical literature Duration, funding, participants: 4 years, ending in June 2005 EPSRC-funded e-Science pilot project Five UK universities, European Bioinformatics Institute, several industrial partners (GSK, IBM)

25 25 Common WBS Deletions SVAS = SupraValvular Aortic Stenosis

26 26 Why Research WBS? Without an understanding of the underlying causes of the disease only palliative care can be offered Before any type of therapy can be developed the pertinent genes, interactions and expression pathways must all be elucidated

27 27 Williams-Beuren Syndrome Congenital disorder resulting in mental retardation caused by deletion of genetic material on 7th chromosome Area in which deletions occur not well characterised – better sequence information is becoming available As new sequence information becomes available gene finding software is run against it BLAST is run against new putative genes to identify homologues whose function may be known BLAST reports provide links to abstracts in the literature

28 28 Why Automate The process of searching for associated papers is tedious and time consuming The gene annotation pipeline is iterative and automating time consuming elements will free up the researchers time for more specialist research Automation allows easy collection of provenance and replication of the research process

29 29 Architecture of System (2) 3-way division of labour sensible way to deliver distributed text mining services Providers of e-archives, such as Medline, will make archives available via web-services interface Cannot offer tailored services for every application Will provide core, common services Specialist workflow designers will add value to basic services from archive to meet their organization’s needs Users will prefer to execute predefined workflows via standard light clients such as a browser Architecture appropriate for many research areas, not just bioinformatics

30 30 Text Mining Service Architecture Data pre-processing and merging architecture: AMBIT & Termino MEDLINE abstracts

31 31 Text Mining: Termino Large-scale terminological resource to support term processing for (biomedical) text processing applications Uniform access to terminological information aggregated across many sources, without the need for multiple, source- specific terminological components Immediate entry points into a variety of outside ontologies and other knowledge sources, making this information available to processing steps subsequent to term recognition Efficient recognition of terms in text through use of finite state recognizers compiled from contents of Termino Lexical look-up service accessible via web service (http://don.dcs.shef.ac.uk/termino)

32 32 Workflow Server Workflow server runs the Freefluo enactment engine to execute XScufl workflow (designed using Taverna) Graves’ disease workflow:

33 33 Example Project: CLEF Clinical e-Science Framework Objective: To develop a high quality, secure and interoperable information repository, derived from operational electronic patient records to enable ethical and user-friendly access to patient information in support of clinical care and biomedical research Sheffield’s contribution: Analyzing clinical narratives to extract medically relevant entities and events, and their properties and relationships Duration, funding, participants: 2003 – 2005 (CLEF), 2005 – 2007 (CLEF-S) Funded by Medical Research Council (MRC) Six universities, Royal Marsden Hospital, industrial partners engaged through CLEF Industrial Forum Meetings


Download ppt "ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk."

Similar presentations


Ads by Google