Presentation is loading. Please wait.

Presentation is loading. Please wait.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Similar presentations


Presentation on theme: "BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics."— Presentation transcript:

1 BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics Unit, Department of Computer Science, University College London, UK (Bioinformatics, Vol. 20, no. 17, p.3206-3213)

2 2/18 Abstract BioRAT Biological Research Assistant for Text mining A new IE tool, specifically designed to perform biomedical IE. It is able to locate and analyze both abstracts and full-length papers. Less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.06% recall with 51.25% precision on full-length papers.

3 3/18 1. Introduction (1/2) IR helps researchers to find papers, but it still leaves a large amount of reading to be done. IE goes one stage further, and analyzes the papers on behalf of the researcher. BioRAT is given a query and, autonomously, finds a set of papers, reads them and highlights the most relevant facts in each.

4 4/18 1. Introduction (2/2) BioRAT uses NLP techniques and domain- specific knowledge to search for patterns in documents, with the aim of identifying interesting facts. These facts can then be extracted to produce a database of information, which has a higher ‘information density’ than a pile of papers.

5 5/18 2. System Outline The user enters a query into BioRAT, which is then passed on to PubMed. The user is presented with a list of papers, from which they can choose to download abstracts or full- length papers. The user can apply some pre-existing templates or create their own. In either case, the templates match patterns in the text that contains ‘useful’ information, which is extracted for display to the user and for possible incorporation into a database.

6 6/18

7 7/18

8 8/18

9 9/18 2.1 Web Spidering BioRAT automatically locates and acquires full- length paper wherever paossible, instead of just using abstracts. It does this via the Internet, by following a series of hyperlinks to find each target paper. To ensure that the correct paper has been identified, and that the text conversion process has succeeded, the first part of the plain text file is compared with the corresponding abstract obtained directly from PubMed, using a fuzzy string matching routine. BioRAT only attempts to locate and download PDF papers.

10 10/18 2.2 IE Engine IE engine is based on the GATE toolbox (General Architecture for Text Engineering, http://gate.ac.uk/). http://gate.ac.uk/ Gate is a general purpose text engineering system, whose modular and flexible design allows us to use it to create a more specialized biological IE system. Two components of GATE that must be modified for the domain-specific application are gazetteers and templates.

11 11/18 2.2.1 Gazetteers One task in IE is ‘named entity recognition’, which aims to identify key items within text. BioRAT incorporates gazetteers from three sources, namely MeSH, Swiss-Prot and hand- made lists. Two gazetteers were created by hand. One consisted of 30 words describing the interaction of proteins (e.g. ‘bind’, ‘down-regulate’, ‘interact’ etc). The other consisted of a few further synonyms of proteins not already covered by the other gazetteers.

12 12/18 2.2.2 Templates A template is a representation of a text pattern that allows us to extract information automatically. It consists of a number of predefined slots to be filled by the system from information contained in the text. ‘Genetic evidence for the interaction of Pex7p and Pexd14p is provided …’ and extracted from it the interaction (Pex7p  Pex13p) ‘interaction of’ (PROTEIN_1) ‘and’ (PROTEIN_2)

13 13/18 2.3 Template Design Tool A template design tool with a graphical user interface, which allows non-expert users to develop templates without having to learn a complex new language. The properties used are: POS tag, gazetteer headings, the word stem, and the word itself.

14 14/18 3. Experiments (1/3)

15 15/18 3. Experiments (2/3)

16 16/18 3. Experiments (3/3)

17 17/18 4. Discussion The density of ‘interesting’ facts found in the abstract is much higher than the corresponding density in the full text.

18 18/18 5. Conclusion BioRAT: an information extraction system specially designed to process biological research papers. Feature: it uses full-length papers, rather than being limited to abstracts as previous studies have been. Recall: 20% on the abstract alone. 43% recall and over 50% precision on full- length papers.


Download ppt "BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics."

Similar presentations


Ads by Google