Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xianfeng Jeff Chen Ph.D. Research Investigator/Project Manager Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database.

Similar presentations

Presentation on theme: "Xianfeng Jeff Chen Ph.D. Research Investigator/Project Manager Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database."— Presentation transcript:

1 Xianfeng Jeff Chen Ph.D. Research Investigator/Project Manager Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database System

2 VBI responsibility in Admin Center PRCs datatype and organism Proteomics data submission and storage work flow VBI computing system architecture (CPU and storage) VBI database system prototype and functionality VBI existing database schema and status Example Y2H schema for design logics and case study Proposed data integration and knowledgebase construction Agenda Today (1) Introduction (2) Database Development (3) Strategy on Knowledgebase Development

3 Introduction

4 Proteomics Data Management (processed data) Tasks of Proteomics Data Management RAW DATA Data Storage & Visualization Tools (VBI) Analysis, Annotation, & Curation (GU) Data QA/QC, Interoperability (VBI/GU) SOP, LIMS, & Adm DB (SSS)

5 University of MichiganMicroarray and mass spectrometry CaprionMass spectrometry Harvard Proteomics InstituteGenomics and protein expression array Albert Einsten College of MedicineMass spectrometry PNNL Mass spectrometry Scripps NMR structural and X-ray crystal diffraction data Myriad GeneticsYeast two-hybrid system PRCs Major Data Type Organization Major Data Type

6 PRCs Organisms Einstein Toxoplasma gondii, Cryptosporidium parvum Caprion Brucella abortus HarvardBacillus anthracis (Protein array), Vibrio cholerae Myriad Bacillus anthracis (Y2H), Yersinia pestis, Francisella tularensis, vaccinia PNNL Orthopox (vaccinia and monkeypox), Salmonella typhimurium, Salmonella typhi Scripps SARS CoV Michigan Bacillus anthracis (TXP, MS) + host (human)

7 Proteomics Data Flow PRCS VBI Public Data Sources 2D GELS Protein Array LC Immunoaffinity purification Y2H MS MS/MS NMR X-Ray Cryoelectron Microscopy X-Ray Defraction etc… Data Types QA & QC Quality Assurance & Quality Control Converting to Standard Format Standar d Format Standard Format for Each Data Type QA & QC Quality Assurance & Quality Control Data Modeling w/ Decomposition Relational Database MIAME and MIAPE-like Standards/SOP for Data Submission

8 Database Development

9 VBI Computing System Binary Software Project Proteomics Genomics Data Storage PC Users Jeff Wei Chaitanya Chengdong Ranjan Oswald Bruno LINUX SUN (Solaris) Gimli Elenwe 7 PRCs Networked File Server TUOR Relational Database Server Proteomics Chendong, Jeff, Wei, Ranjan, Chaitanya Web Server Application Server

10 DevelopmentTest/Stage Production Web Interface Database System Development in Q3 of 2005

11 Production: Test: Development: Proteomics Database Project Websites

12 Dynamically generated webpage (1) Account management (2) File and doc management (3) News group and news update (4)Textual data display (5) 2D gel Image data display (6) Table and record query (7) Data uploading and simple submission (8)HTTP data downloading (9)SFTP file transfer Production Website Instance Functionalities:

13 Search By Experiment Select Experiment Retrieve list of Bait protein and nucleotide, Prey protein & nucleotide Links to details of bait and Prey example: Drosophila melanogaster Search By Organism Escherichia coli Saccharomyces cerevisiae Homo sapiens Drosophila melanogaster Helicobacter pylori Caenorhabclitis elegans Search By Data Type Proteomics Genomics Microarray Database Query

14 Search By Project/Experiment Scripps MS testing project Available peptide hit list Retrieve peak information and m/z & intensity list Query for Scripps Sample Data

15 Search By Experiment/Sample Query for 2 D Gel Data

16 Proteomics Database Architecture Process-Oriented Production Design 2D Gel Y2H MS NMR Protein Array LC X-Ray Cryoelectron Microscopy Immnoaffinity Purification X-Ray Defraction Multiple Schemas of Disparate Data Consolidate to One Schema to Remove Redundancy Stored Procedure for Analysis Pipeline Physical Layer Logical Layer Views --  materialized views Final Views Application Layer Three Phases of Database Design Normalized with Key-value Pair

17 Proteomics Database Architecture Three Database Instances Phase 1 Version 1 0.5-1 year Disparate Data With Multiple Schemas Individual Dataset Modeling Phase 2 Version 2 1-1.5 year Consolidation into a Few Schema A normalized data model implemented as key –value pairs, highly decomposed. Phase 3 Version 3 2 years Analysis Pipeline Procedures Logical Layer with Views for the User Physical Layer 1. Partially Processed Data 2. Data Enhanced with Knowledge 3. Interface Less Changeable 4. Curated/Annotated Data Development Test/stage Production

18 Status of VBI Database Development SchemaDevelopment Test/stageProduction Adm+(10/10)++ 2 D Gel+(10/10)++ MS+(10/10)++ Interaction+(9/10)+- Pathway+(7/10)+- Data Repository+(8/10)++ Y2H+(10/10)++ Genomics+(10/10)(GUS)++ Microarray+(10/10) (AE)++ Default Tablespace: Admin_data, Genomics_TBLS, Pathway_TBLS, Microarray_TBLS, Proteomics_TBLS. (Maturity)

19 Who (People) Where (Organization) Project (Goal) Materials and Methods (Metadata) Results (Raw Data) Conclusion and Hypothesis (Processed and Analyzed Data) Generic Experiment Data Components -------Example of Database Design Logics

20 People Experiment Project Sample Results Conclusion Hypothesis DNA /Protein Detail Y2H Data Component Modeling

21 Experiment Design Experiment Factor Factor Value Design Description Ontology Entry Ontology entries are taking care of the annotation cases 1) There are diverse choices and there exist ontologies that can better capture the information 2) What are essentially controlled vocabularies which are limited in number of choices but might grow in the future or vary by technology type Experiment Component Object Model

22 Y2H Partial Database Schema

23 Proteomics DB System Architecture Public File Server Private File Server Oracle Relational Database JDBC, Perl DBI/DBD, ODBC Batch Processing (1) Data uploading; (2) Data validation; (3) Data analysis; (4) Data processing JSP, CGI, Java Perl, Java

24 Virtual Database/ Warehouse Application Layer Web Display and Data Visualization System Architecture of Putative VBI Proteomics Knowledgebase Security Temporary data Service-Oriented MiddleWare with Process Control Array Express Mass Spectrometry Two Component System2D GelStructure DataGenomics Data ------- Data, Tool, Project, and Team Interoperability

25 Strategy on Data Integration and Construction of Knowledge Warehouse

26 Biological Information Workflow Information Storage, Queries & DB Management Cleaning, Processing Algorithms Curation and Annotation of Data Knowledge Generation Biological Research Target Discovery Diagnostics, Therapeutics & Vaccines Data Management Knowledge Management

27 Bio-IT Scope Data Integration Knowledge generation Knowledge management Knowledge presentation Phase IPhase IIPhase III First 2 years 3 rd -4 th years5 th year Raw data management Schema development Data visualization Data standardization Integration at interface level Integration of data at DB level Interoperability of datasets Normalization and warehousing Predefined query Materialized view Comparative analysis Statistical analysis VBI PDC Project Phases

28 (2) Mass spectrometry Allows identification of proteins within large complexes (2-100 proteins). Lower throughput. (1) Yeast two-hybrid system Measures association between two proteins. Allows very high throughput. Mapping the Proteome

29 Complex Interaction Model R2H Analysis N-ary interations PO 4 Proteins MS Analysis Binary interactions Infer Complex Interaction Topology Knowledgebase

30 (1) Completed Genome Ames, Ames Ancestor, a2012NCBI, TIGR (2) Yeast two-hybrid interaction dataMyriad Genetics (3) Mass Spectrometry Scripps and Caprion (4) Microarray expression profiling Univ. of Michigan (5) Interspecies and interspecies clustering NCBI(COG) and TIGR (6) Functional category assignmentGU(PIR) Data Organization Bacillus anthracis

31 (1) Annotation Improvement (1) Non-homologous based methods -------------- phylogenetic profiling, Rosetta stone pattern, operon analysis, co-expression profiling, gene neighboring etc. (2) Comparative genomics with two reference genomes --- E. Coli and Yeast (2) Identifying anchor points for data integration (1) Known metabolic pathway – E. coli and yeast; (2) Known signal transduction pathway; (3) Known Gene regulation machinery; (4) Known Protein-protein interaction map. Strategy for Knowledgebase Construction

32 Data Integration Genomics Data Improved annotation Comparative Genomics Anchor on knowledge network of Reference Genomes – E. Coli and Yeast Lay down Y2H interaction data and expend network Lay down MS multiple interaction data to expend the network Lay down microarray data to add co-expression pattern to gene network Putative Knowledgebase: No thing

33 Data Mining and Knowledge Augmentation Literature Y2H analysis MS analysis Microarray

34 Dr. Jeff Chen Project Manager/InvestigatorVBI Dr. Chendong Zhang Senior Software EngineerVBI Dr. Steve Cammer Bioinformatics ScientistVBI Dr. Oswald CrastaScientist and CI-Co-directorVBI Susan Baker DBAVBI Jiang Lu DBAVBI Ranjan Jha Software EngineerVBI Qiang Yu Software EngineerVBI Jian Li Software EngineerVBI Wei Sun Software EngineerVBI Chaitanya Kommidi Software EngineerVBI Dr.Bruno Sobral Co-PIVBI Dr. Peter MacGarveySenior Bioinformatics ScientistGU Dr. Cathy Wu Co-PIGU Paula YadvishWeb CoordinatorSSS Margaret Moore PISSS Acknowledgement NameRole Organization

Download ppt "Xianfeng Jeff Chen Ph.D. Research Investigator/Project Manager Overview and Implementation Strategy of the NIAID-Funded Bio-defense Proteomics Database."

Similar presentations

Ads by Google