Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,

Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt, Switzerland

What is Proteomics ? Separation (CEX, RP ) MS Sample BioInformatics processes DB Manual analysis Protein EST Genomic Peptide P1 P2 P3 P4 P5 E1 E2 E3 G1

About DBs for Proteomics at GeneProt Needs Data Transactional DB Data Warehouse Data Mining

Data Management Challenges  A high-throughput environment requires near real time processing  Quick response to evolving laboratory procedures and evolving user needs  Accomodate to heterogeneous data types  Manage a constantly rising flood of data  Need for a convenient data access at all levels of granularity via analysis software and web front ends  Adapt to demand for global queries across all proteomics studies  Adapt and innovate to offer new tools:  Statistics,  Data mining.

Data Flow Data export experimental data (LIMS) Identification of peptides and proteins external data sources annotation DB XML

Data details  Experimental data:  Store MS and MS/MS peak lists  Store all meta data  Identification :  Load peptide matches, identified proteins, scores  Automatic annotation and analysis:  Give access to data, store results  Expert annotation:  Give interactive access to data using a Web interface, store manual validation and annotation  External data sources:  Import information from external data sources: taxonomy, ontologies, bibliography…  Export data:  Export all or a subset of data Flat file Database dump  Misc:  Access control, security an confidentiality  data consistency/integrity checks  Error checks and corrections  Run statistics  backup and archive

Data production per project  Raw data (spectra) : 330 000 -> 1 500 000  Identified peptides : 45 000 -> 145 000  Identified sequences: 10 000 ->120 000  Database size : 15G -> 140G  Nbr projects: 16  1Tb of databases files

Implementation: transactional  Intended to capture all relevant information from proteomics experiment, protein identification automatic and manual annotation and validation.  Each proteome is isolated in its own ProtDB (16 at present).  Complex and generic data model for efficient data storage.  Built in data consistency and error checks.  A layer of « views » provides fast query access.  Web front end: interactive means to visualize, update and validate data.

Limitations  We have 16 projects on-line:  High cost of maintenance to keep all database schemas compatible.  Space : could we archive some of the projects ? New spectrometers produce more data  Inter databases queries:  Technique « exists » but implementation is often awkward and there is no efficient solution in our case.

What about overcoming these limitations and take advantage of this wealth of data ?  Decide what data are actually important in the long term.  Merge the data from all the projects.  Clean and consolidate the data.  Implement an update procedure to keep this « merged data system » up to date  (archive old projects)

Data Warehouse ?  This looks very much like the definition of a data warehouse !  Data consolidation and integration  Non instantaneous accuracy, non volatility  Comprehensive data structure  Query throughput

ProtWare: proteomics data warehouse 1. Stores consolidated and final analysis results, centralises data common to proteins in all proteome studies. 2. Is read-only, not real time, asynchronous updates are run weekly. 3. Data model is focused on proteome to proteome comparisons. 4. Comprehensive data structure which enhance the performance of analysis queries. 5. Ideally suited for statistical analysis and data mining tools. 6. Provides a decision support system.

ProtDB and ProtWare data flow analyses & statistical queries E xtraction T ransformation L oading classification, taxonomy… annotation P2P2 PnPn P1P1 export … … flat file export 10 +11 bytes 10 +5 bytes DB dump XMLXML flat file 10 +8 bytes identification automatic annotation website ProtWare

ProtDB vs ProtWare ProtDB: transactional system  Data input, real time acces to data  Data updates, annotation, validation  Error and consistency checks  Stores experimental data  Stores all steps of data annotation and validation (keep history)  In depth queries on a given proteome ProtWare: data warehouse  Read-only, asynchronous updates from ProtDB  Consolidated data and final results of annotation and validation (no history)  No experimental data  Queries oriented to proteomes comparisons, statistics, data mining  Decision support system

The needle in a haystack  Of course we are looking for the Holy Grail !  Find the interesting proteins in all our data that: Can be used for diagnostic, Can explain a disease, Can be used to cure a disease.

KDD and Data Mining  Knowledge Discovery in Databases is « the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data ».  Data Mining is the discovery stage of the KDD.  Data mining tools provide additional possibilities to explore a database.

Data Mining tools  ProtWare: the data warehouse model is protein query oriented.  R package: statistics and clustering tools  Oracle 10g new data mining functions

Database infrastructure  Data input files use XML.  RDBMS: Oracle 9i moving to Oracle 10g on Linux  ProtWare uses ANSI SQL, portable to other ANSI SQL compliant systems (PostgreSQL).  Web interface built using standard technologies:  PERL, CGI, DBI, HTML, Javascript, SVG.

Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,

Similar presentations

Presentation on theme: "Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,

Similar presentations

Presentation on theme: "Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,"— Presentation transcript:

Similar presentations

About project

Feedback