Data Integration and Management A PDB Perspective.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

CCPN project modeling framework University of Cambridge European Bioinformatics Institute MSD group.
CVRG Presenter Disclosure Information Tahsin Kurc, PhD Center for Comprehensive Informatics Emory University CardioVascular Research Grid Core Infrastructure.
An Operational Metadata Framework For Searching, Indexing, and Retrieving Distributed GIServices on the Internet By Ming-Hsiang.
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
1.
Data Representation, Data Integration and API Delivery of PDB Data John Westbrook RCSB/PDB Rutgers University.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
Dictionaries and Ontologies in Structural Biology.
Update on PDB Data Deposition Specifications
An Overview of the RCSB Protein Data Bank
Ontology Notes are from:
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
The MEMOPS Programming Framework Wayne Boucher, Cambridge
Automated Analysis and Code Generation for Domain-Specific Models George Edwards Center for Systems and Software Engineering University of Southern California.
Teula Morgan The Adaptable Repository: Swinburne Online Journals.
Workshop on Biological Macromolecular Structure Models RCSB Protein Data Bank Rutgers, The State University of New Jersey.
1 CS 502: Computing Methods for Digital Libraries Lecture 22 Repositories.
Management and Distribution of Chemical Data in the Protein Data Bank John Westbrook, Dimitris Dimitropoulos, Jasmine Young, Peter Rose, Philip E. Bourne.
Chapter 1 Introduction to Databases
Molecular Library and Imaging Francis Collins, NHGRI Tom Insel, NIMH Rod Pettigrew, NIBIB Building Blocks and Pathways Francis Collins,NHGRI Richard Hodes,
Database Systems: Design, Implementation, and Management Ninth Edition
Enabling Rapid Interaction with the Protein Data Bank Alexy Khrabrov Rutgers University John D. Westbrook Rutgers University.
January, 23, 2006 Ilkay Altintas
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Evaluation of Structure Quality Using RCSB PDB Tools Kyle Burkhardt, Lead Data Annotator The RCSB PDB at Rutgers University.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
1 Introduction to Database Systems. 2 Database and Database System / A database is a shared collection of logically related data designed to meet the.
Development Process and Testing Tools for Content Standards OASIS Symposium: The Meaning of Interoperability May 9, 2006 Simon Frechette, NIST.
Worldwide Protein Data Bank Worldwide Protein Data Bank History of the PDB  1970s  Community discussions about how to establish.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Towards Data Attribution & Citation in the Life Sciences Philip E. Bourne UCSD 8/22/11Data Attribution and Citation.
Introduction to Database AIT632 Chapter 1 Sungchul Hong.
17 th October 2005CCP4 Database Meeting (York) CCP4(i)/BIOXHIT Database Project: Scope, Aims, Plans, Status and all that jazz Peter Briggs, Wanjuan Yang.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Structure database: PDB Tuomas Hätinen. Protein Data Bank A repository for 3-D biological macromolecular structure. It includes proteins, nucleic acids.
Protein Data Bank: An Introduction Learning to Use the RCSB PDB Portal.
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP4, Daresbury.
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Mining the Biomedical Research Literature Ken Baclawski.
Real World Experiences in Operating a Collaboratory: The Protein Data Bank Helen M. Berman Board of Governors Professor of Chemistry.
Worldwide Protein Data Bank wwPDB Common D&A Project November 24, 2009 November 24, 2009 Steering Committee Project Update.
NSDL & Access Management David Millman Columbia University Jan ‘02.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
Worldwide Protein Data Bank wwPDB Common D&A Project Full Project Team Meeting Rutgers March 16-19, 2010.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
Riccardi: DIALOGUE Workshop August 1, 2005 Supported by NSF BDI 1 Representing and Using Phylogenetic Characters in Morphbank Greg Riccardi, David Gaitros,
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
Manufacturing Systems Integration Division Development Process and Testing Tools for Content Standards Simon Frechette National Institute of Standards.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Enhancements to Galaxy for delivering on NIH Commons
Economics and Impact of the Protein Data Bank (PDB) Archive
Data Representation, Data Integration and API Delivery of PDB Data
Enabling Rapid Interaction with the Protein Data Bank
The Protein Data Bank: Evolution of a key resource in biology
Project tracking system for the structure solution software pipeline
Metadata The metadata contains
SDMX IT Tools SDMX Registry
Presentation transcript:

Data Integration and Management A PDB Perspective

What is PDB? Single international repository of three- dimensional data for biological macromolecules Public community resource Established at Brookhaven in 1971 (7 structures) Moves to RCSB in 1998 wwPDB established in 2004 > 25,000 structures in PDB

Community Scientific Community - at all levels –Structural biologists (crystallography, NMR, cryo-EM) –Biologists –Computational biologists Journals General Community –Secondary school –General public Internal –RCSB PDB staff –wwPDB members

Data Representation Macromolecular Crystallographic Information Framework XML DTD/Schema Mapping SQL Schema Mapping CORBA IDL Mapping Supporting emerging ontology representations - OWL

Elements of Dictionary Metadata Data Attributes –Definition –Examples –Data type (primitive type/regular expression patterns) –Range or allowed values Classes –Categories –Subcategories –Category groups Associations –Parent-child relationships –Interdependencies/exclusivity –Methods

Difficult Issues Resolving semantic ambiguities – encoding meaning Integrating controlled vocabularies Separation of primary and derived information Supporting rapid evolution of science

What’s Driving Data Definition IUCr-sponsored community effort Automated data acquisition Data management and data exchange for PDB New technologies (e.g. cryo-electron microscopy) High-throughput structure determination and structural genomics

Target Selection Protein Production Structure Determination PDB Deposition Merged Project Data Crystal Production Project Database Exchange Dictionary Typical Project Deposition Data Flow

Data Sharing Nightmare

Incremental Data Pipeline

Current Integration Strategy Provide software tools to collect bits of data from the output from each program step Convert data in log and output files to a common representation Merge the data corresponding to the successful outcome Provide an editor tool to enter remaining data and check consistency of results

Data Deposition and Annotation PDB ID Distribution Site Depositor Archival Data Core DB PDB Entry ADIT Annotate Validate Depositor Approval Validation Report Corrections Step 2 Step 3 Step 4 Step 1 Functional Annotation Step 5

Integrated Data Processing System ADIT ADITsrv ADIT ADITsrv Reports Final Files MAXIT Validation Database Loader Metadata Dictionaries Data Views Client Input Tool Data Assembled by Depositor ADIT ADITsrv

Features of System Different dictionaries without software changes Metadata customization of both functionality and content Automatically scales with changes in content Can be distributed to multiple deposition sites Reference data and standard nomenclature (ERFs) Self-monitoring

Data Distribution ApplicationsApplications mmCIF Data Files ( Data Reference Standard ) API Servers Relational Database mmCIF Parsers XML Files

Automatic Production of Macromolecular Structure API Components PDB Exchange Dictionary + API Specific Data Dictionaries CORBA IDL, SQL Schema, XML DTD/Schemas, Data Loaders Database Access Classes Metamodel Framework

Management Complex challenges in technology and sociology Communicate and work with diverse community Help create and enforce community policies and standards Must take advantage of the most current innovations in new technologies New technologies must be introduced so as to enable and not disrupt the users of the resource Beyond all else is the need for good data and a robust data representation

Access RCSB Protein Data Bank Site OpenMMS site (Java implementation) RCSB PDB Software Download Site (C++ and Python implementation, NDB server) RCSB PDB Dictionary Resource Site RCSB PDB Beta Data Site ftp://beta.rcsb.org/pub/pdb/uniformity/data/

Operated by three members of the RCSB: Rutgers, The State University of New Jersey; San Diego Supercomputer Center at the University of California, San Diego; Center for Advanced Research in Biotechnology/UMBI/NIST The RCSB PDB is supported by funds from the National Science Foundation (NSF), the National Institute of General Medical Sciences (NIGMS), the Office of Science, Department of Energy (DOE), the National Library of Medicine (NLM), the National Cancer Institute (NCI), the National Center for Research Resources (NCRR), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and the National Institute of Neurological Disorders and Stroke (NINDS). The RCSB PDB is a member of the wwPDB (