Publishing Interactive Articles: Integrating Journals and Biological Databases Karen Yook Genetics Society of America, Dartmouth Journal Services, Textpresso.

Slides:



Advertisements
Similar presentations
Electronic theses and copyright Janet Aucock Head of Repository services March 2014.
Advertisements

Open Stirling: Open Access Publishing and Research Data Management at Stirling Monday 25 th March 2013 Michael White, Information Services STORRE Co-Manager/RMS.
EndNote. What is EndNote:  EndNote is referencing software that enables you to create a database of references from your readings. Your database of references.
Compiled by Helene van der Sandt. Is a search engine that searches for scholarly literature Can search across many disciplines Searches for articles,
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Software Delivery. Software Delivery Management  Managing Requirements and Changes  Managing Resources  Managing Configuration  Managing Defects 
Collaboration with IntAct and InterMine: SGD Rama Balakrishnan Saccharomyces Genome Database Gene Ontology Consortium Stanford University, CA USA.
A Systematic approach to the Large-Scale Analysis of Genotype- Phenotype correlations Paul Fisher Dr. Robert Stevens Prof. Andrew Brass.
NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004.
GMOD Meeting, May 2005 Patent Pending, Caltech Proprietary Textpresso Search engine for Biomedical Literature ~Eimear Kenny~
Software to Manage EEP Vegetation Plot Data A design proposal Michael Lee January 31, 2011.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
1 Computing for Todays Lecture 22 Yumei Huo Fall 2006.
CACAO - Penn State Gene Function and Gene Ontology January 2011
Reference Manager Making your life easier! Updated September 2007.
Modeling Functional Genomics Datasets CVM Lessons 4&5 10 July 2007Bindu Nanduri.
Cis-Regulatory/ Text Mining Interface Discussion.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Comparative Genomics of the Eukaryotes
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
SWIS Digital Inspections Project (SWIS DIP) Chris Allen, Information Management Branch California Integrated Waste Management Board November 5, 2008 The.
 You can use google docs to construct tests or surveys that can be given online.  Multiple choice, matching, or fill in one word answers work well with.
What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
The Horizontal Gene Transfer Database Jeramy Brewster, Edward Simpson and Spandana Kommalapati Mentor: Mark Goebl, PhD.
How To: Add HYPERLINKS and IMAGES with HYPERLINKS to your Outlook Signature. By: Tom Jackson
Curation Editor Flexible web based editor for non gene model data. FlyBase – Harvard University Frank Smutniak.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
1 Online Textbook Adooptions
How will we efficiently understand the interactions of ~20,000 genes, with ~200 million potential pairwise interactions? Minimally, we need to use the.
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
Community Curation Enabling the research community to contribute annotations directly to WormBase Mary Ann Tuli.
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation SAB 2008 WormBase Literature Curators Textpresso.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
©SHRM 2011 Revising Bylaws Dorothy Knapp, SPHR ▪ Volunteer Leaders’ Webcast Series ▪ April 28, 2011.
Connie Rogers LIS 764 Fall is a free bibliography composer helps you generate, edit and publish a works cited list guides you through punctuation.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
© 2007 IBM Corporation Information Server User Technology IBM Silicon Valley Lab Documentation Infrastructure Write, build, test, deliver, maintain.
Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Introduction to JavaScript CS101 Introduction to Computing.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Community Curation of Gene Descriptions Ranjana Kishore Pasadena, California.
To Boldly GO… Amelia Ireland GO Curator EBI, Hinxton, UK.
Data, data, data In-depth session on data integration.
Chapter 29 World Wide Web & Browsing World Wide Web (WWW) is a distributed hypermedia (hypertext & graphics) on-line repository of information that users.
SWIS Digital Inspections Project Chris Allen, Information Management Branch California Integrated Waste Management Board August 22, 2008.
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
Worldwide Protein Data Bank Common D&A Project Sequence Processing Modular Demo May 6, 2010 Project Deliverable.
Word 2007® Business and Personal Communication How can Microsoft Word 2007 help you work with others?
Creating Databases applications for the Web: week 2 Basic HTML review, forms HW: Identify unique source for asp, php, Open Source, MySql, Access.
Genetic Literature Curation at FlyBase-Cambridge Steven Marygold ABC meeting, December 2007 A Database of.
WP 17.1 Civil Engineering Building Permit / SharePoint Info Status on M. Manfredi.
Department of Computer Science, Florida State University CGS 3066: Web Programming and Design Spring
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
Validation db status and plans (what happened since the Collaboration meeting) Hans Wenzel 10th Physics Lists and Validation Tools working group meeting.
How to Turnitin Dr Stephen Rankin Lecturer in Academic Writing and Literacy Murdoch University A 6 step guide for submitting your assignments to Turnitin.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
12 things that you need to know about Open Access, the REF and the CRIS Rowena Rouse Scholarly Communications Manager June 2016.
XP Creating Web Pages with Microsoft Office
California Institute of Technology
Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.
Genomics research paper presentation
From one to many: expanding the Saccharomyces cerevisiae reference genome panel Stacia R. Engel Stanford University.
Department of Genetics • Stanford University School of Medicine
Exploring Microsoft® Access® 2016 Series Editor Mary Anne Poatsy
WebDAV Design Overview
1. C. briggsae sequence curation 2. SNP data handling
Presentation transcript:

Publishing Interactive Articles: Integrating Journals and Biological Databases Karen Yook Genetics Society of America, Dartmouth Journal Services, Textpresso and WormBase Karen Yook Genetics Society of America, Dartmouth Journal Services, Textpresso and WormBase ISB 2010 Session 6 : Future Directions and Challenges for Biocuration

Goal: hyperlink entities in a published paper to their database resource page

GSA-Database markup pipeline Paper accepted by GENETICS GENETICS/ DJS produces source XML file Source XML file deposited on Database server TEXT LINKING SCRIPT Database entity repository Linked XML returned to GENETICS/ DJS Linked XML returned to GENETICS/ DJS Web service Web service Corrections to Proof Interactive journal published ftp Database Curator QC Database Curator QC

What is needed for this to work Textpresso -> Arun Source file from the journal (XML format) List of entities that exist in the database Stable database URL constructor Database curator for manual quality control (QC) Method for ensuring URL stability Textpresso -> Arun Source file from the journal (XML format) List of entities that exist in the database Stable database URL constructor Database curator for manual quality control (QC) Method for ensuring URL stability

What is needed: List of entities that exist in the database, these are mined directly from the database (WormBase) Entity ClassExamples # of WormBase entities CloneA1D10, C02H12, GAP1_10204,605 Genemec-15, F25B3.5, y110a7a.10537,027 Variation (alleles)ad820sd, cb ,649 TransgeneadEx1290, cmIs6, jeIn26,546 RearrangementhDp23, stDf47, sT1709 Protein (to gene pages)MEC-15, ABF-1 StrainBC954, IE ,826 Anatomy AWC, body wall muscle 6,563 PhenotypeMuv, Lon, Dpy122 Total 1,059,047

What is needed: Stable database URL constructor: For WormBase- NAME is the entity string (lin-11, e189, gonad) CLASS is the class the entity belongs to (gene, variation, anatomy) For Saccharomyces Genome Database (SGD)- SGDID is the SGD identification for the entity Stable database URL constructor: For WormBase- NAME is the entity string (lin-11, e189, gonad) CLASS is the class the entity belongs to (gene, variation, anatomy) For Saccharomyces Genome Database (SGD)- SGDID is the SGD identification for the entity

What is needed: Database curator for manual quality control: Checks all links made by the linking script Checks for ambiguities - adds them to an exclusion list - provides the right entity NAME/ID Checks for entities that did not get linked - author writing shortcuts - science jargon - new entities not in the database yet Database curator for manual quality control: Checks all links made by the linking script Checks for ambiguities - adds them to an exclusion list - provides the right entity NAME/ID Checks for entities that did not get linked - author writing shortcuts - science jargon - new entities not in the database yet

Manual QC resolves ambiguities that an automated script cannot resolve X X

Manual QC is needed to decipher authors’ writing shortcuts All genes are hyperlinked to their respective web page (zig-2, zig-3, zig-5, zig-6, zig-7,

Manual QC maps informal terminology (jargon) to formal science terms SGPs = somatic gonad precursor cells By adding acronyms and science-field jargon as synonyms to the canonical science term in the database, we capture the current language of our community while making the science more understandable to the wider science community.

What is needed: Method for ensuring URL stability : A list of entity classes, names, URLs, and webpage status is automatically generated after the final manual QC stage has been done. This list provides an easy way to test the links for viability in the future.

WormBase creates silent links to entities not yet in the database GENETICS helps WormBase identify new entities by sending the author an entity declaration form Manual QC identifies new entities not declared by the author Silent links are made using the URL constructors, which will become live once the entities are added to the database GENETICS helps WormBase identify new entities by sending the author an entity declaration form Manual QC identifies new entities not declared by the author Silent links are made using the URL constructors, which will become live once the entities are added to the database

GSA-WormBase markup pipeline Paper accepted by GENETICS GENETICS/ DJS produces source XML file Source XML file deposited on WormBase server WormBase LINKING SCRIPT WormBase entity repository Linked XML returned to GENETICS/ DJS Web service Web service Corrections to Proof Interactive journal published ftp MOD Curator QC MOD Curator QC

GSA-WB markup pipeline includes author participation for new entities Paper accepted by GENETICS GENETICS/ DJS produces source XML file Source XML file deposited on WormBase server WormBase LINKING SCRIPT WormBase entity repository Linked XML returned to GENETICS/ DJS GENETICS s authors to fill out new entities form Web service New entities entered into form Corrections to Proof Interactive journal published ftp MOD Curator QC MOD Curator QC

Genetics sends authors an entity declaration form for WormBase

50% (11/22) author response Genetics sends authors an entity declaration form for WormBase

Caenorhabditis elegans papers marked up published starting in August waiting to be published 5 more papers accepted, still to mark up 10 entity classes marked up All have needed manual QC (ambiguities, new objects, copy edit errors) Saccharomyces cerevisiae papers marked up entity classes marked up Drosophila melanogaster papers will be ready to go live soon Caenorhabditis elegans papers marked up published starting in August waiting to be published 5 more papers accepted, still to mark up 10 entity classes marked up All have needed manual QC (ambiguities, new objects, copy edit errors) Saccharomyces cerevisiae papers marked up entity classes marked up Drosophila melanogaster papers will be ready to go live soon Where we stand at the moment

Linking to multiple MODs in a single paper Current challenges

Many fly gene names are difficult to distinguish from standard everyday words, e.g., we, for, a Information-rich XML-tagged files help identify these genes “Therefore, we tested members of major pathways that control DC such as Jun-related antigen ( Jra ), puckered ( puc ), and Src homology 2 ankyrin repeat tyrosine kinase ( shark ) in the DJNK signaling pathway and thickveins ( tkv ), schnurri ( shn ), and zipper ( zip ) in the TGF-β signaling pathway. Furthermore, we tested additional genes involved in DC such as u-shaped ( ush ), Epidermal growth factor receptor ( Egfr ), Protein kinase related to protein kinase N ( Pkn ), scab ( scb ), and Rho1 ….” Linking fly genes to FlyBase Current challenges

Benefits of published text markup For the community: Less time to get to the correct database page Transparency of science jargon Term ambiguities are dealt with before publication For the journal: Becomes a live portal to science information For the author: Extra pair of eyes for proofing their document Data enter the database faster For the database: Increase depth and accuracy of data in the database Get information before publication Resolve conflicts in term use before publication

Future challenges Extending this pipeline to other MODs / Databases Extending this pipeline to other journals - standardized XML format can help Increasing the number of Databases and entity classes linked in a multi-organism paper Fixing broken links Re-marking up publications Extending this pipeline to other MODs / Databases Extending this pipeline to other journals - standardized XML format can help Increasing the number of Databases and entity classes linked in a multi-organism paper Fixing broken links Re-marking up publications

William M. Gelbart (Harvard University) Thom Kaufman, Kathy Matthews (Indiana University) Raymund Stefancsik, Steven Marygold (Cambridge University, UK) Acknowledgements Arun Rangarajan Karen Yook Juancarlos Chan Hans-Michael Muller Paul W. Sternberg (HHMI) Tracey DePellegrin-Connelly Ruth Isaacson Tim Schedl (Wash.U. School of Medicine) Stephen Haenel Lolly Otis Sharon Faelten Saccharomyces Genome Database Marek S. Skrzypek, J. Michael Cherry (Stanford University School of Medicine)