BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Slides:



Advertisements
Similar presentations
CONCEPTUAL WEB-BASED FRAMEWORK IN AN INTERACTIVE VIRTUAL ENVIRONMENT FOR DISTANCE LEARNING Amal Oraifige, Graham Oakes, Anthony Felton, David Heesom, Kevin.
Advertisements

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Chapter 5: Introduction to Information Retrieval
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Reasoning Methodologies in Information Technology R. Weber College of Information Science & Technology Drexel University.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Information Retrieval in Practice
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Chapter 12: Intelligent Systems in Business
Surface Mine Truck Safety Training Design And Implementation of a Multi-user VR Driving Simulator Yan W. Ha, Jeremy Murray, and Dr. Frederick C. Harris,
Overview of Search Engines
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
PubMed/How to Search, Display, Download & (module 4.1)
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Department of Mechanical Engineering, LSUSession VII MATLAB Tutorials Session VIII Graphical User Interface using MATLAB Rajeev Madazhy
Information Extraction From Medical Records by Alexander Barsky.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Using a Template to Create a Resume and Sharing a Finished Document
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Presenter: Shanshan Lu 03/04/2010
ITGS Databases.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
Design a full-text search engine for a website based on Lucene
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
SQL Based Knowledge Representation And Knowledge Editor UMAIR ABDULLAH AFTAB AHMED MOHAMMAD JAMIL SAWAR (Presented by Lei Jiang)
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
STEWARD: A Spatio-Textual Document Search Engine for HUDUSER.ORG Prof. Hanan Samet Department of Computer Science, University of Maryland, College Park,
General Architecture of Retrieval Systems 1Adrienn Skrop.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Applications of Text Mining
Multimedia Information Retrieval
Data Warehousing and Data Mining
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
CSE 635 Multimedia Information Retrieval
Batyr Charyyev.
Web Mining Department of Computer Science and Engg.
Introduction to Information Retrieval
Presentation transcript:

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics Unit, Department of Computer Science, University College London, UK (Bioinformatics, Vol. 20, no. 17, p )

2/18 Abstract BioRAT Biological Research Assistant for Text mining A new IE tool, specifically designed to perform biomedical IE. It is able to locate and analyze both abstracts and full-length papers. Less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.06% recall with 51.25% precision on full-length papers.

3/18 1. Introduction (1/2) IR helps researchers to find papers, but it still leaves a large amount of reading to be done. IE goes one stage further, and analyzes the papers on behalf of the researcher. BioRAT is given a query and, autonomously, finds a set of papers, reads them and highlights the most relevant facts in each.

4/18 1. Introduction (2/2) BioRAT uses NLP techniques and domain- specific knowledge to search for patterns in documents, with the aim of identifying interesting facts. These facts can then be extracted to produce a database of information, which has a higher ‘information density’ than a pile of papers.

5/18 2. System Outline The user enters a query into BioRAT, which is then passed on to PubMed. The user is presented with a list of papers, from which they can choose to download abstracts or full- length papers. The user can apply some pre-existing templates or create their own. In either case, the templates match patterns in the text that contains ‘useful’ information, which is extracted for display to the user and for possible incorporation into a database.

6/18

7/18

8/18

9/ Web Spidering BioRAT automatically locates and acquires full- length paper wherever paossible, instead of just using abstracts. It does this via the Internet, by following a series of hyperlinks to find each target paper. To ensure that the correct paper has been identified, and that the text conversion process has succeeded, the first part of the plain text file is compared with the corresponding abstract obtained directly from PubMed, using a fuzzy string matching routine. BioRAT only attempts to locate and download PDF papers.

10/ IE Engine IE engine is based on the GATE toolbox (General Architecture for Text Engineering, Gate is a general purpose text engineering system, whose modular and flexible design allows us to use it to create a more specialized biological IE system. Two components of GATE that must be modified for the domain-specific application are gazetteers and templates.

11/ Gazetteers One task in IE is ‘named entity recognition’, which aims to identify key items within text. BioRAT incorporates gazetteers from three sources, namely MeSH, Swiss-Prot and hand- made lists. Two gazetteers were created by hand. One consisted of 30 words describing the interaction of proteins (e.g. ‘bind’, ‘down-regulate’, ‘interact’ etc). The other consisted of a few further synonyms of proteins not already covered by the other gazetteers.

12/ Templates A template is a representation of a text pattern that allows us to extract information automatically. It consists of a number of predefined slots to be filled by the system from information contained in the text. ‘Genetic evidence for the interaction of Pex7p and Pexd14p is provided …’ and extracted from it the interaction (Pex7p  Pex13p) ‘interaction of’ (PROTEIN_1) ‘and’ (PROTEIN_2)

13/ Template Design Tool A template design tool with a graphical user interface, which allows non-expert users to develop templates without having to learn a complex new language. The properties used are: POS tag, gazetteer headings, the word stem, and the word itself.

14/18 3. Experiments (1/3)

15/18 3. Experiments (2/3)

16/18 3. Experiments (3/3)

17/18 4. Discussion The density of ‘interesting’ facts found in the abstract is much higher than the corresponding density in the full text.

18/18 5. Conclusion BioRAT: an information extraction system specially designed to process biological research papers. Feature: it uses full-length papers, rather than being limited to abstracts as previous studies have been. Recall: 20% on the abstract alone. 43% recall and over 50% precision on full- length papers.