Presentation is loading. Please wait.

Presentation is loading. Please wait.

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute.

Similar presentations


Presentation on theme: "The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute."— Presentation transcript:

1 The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute

2 Outline Microarray data and standards overview ArrayExpress overall principles ArrayExpress architecture AE repository AE data warehouse Future plans and conclusions

3 Samples Genes Gene expression levels – problem 2 Sample annotations problem 1 Gene annotations Gene expression matrix Gene expression data and annotation

4 Platform comparison (Tan et al, PNAS, 2003) Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression (Margareth Cam, NIH)

5 hybridisation labelled nucleic acid array RNA extract Sample Array design hybridisation labelled nucleic acid array RNA extract Sample hybridisation labelled nucleic acid array RNA extract Sample hybridisation labelled nucleic acid array RNA extract Sample hybridisation labelled nucleic acid Microarray RNA extract Sample Experiment Gene expression data matrix normalization integration Protocol genes

6 Array scans Spots Quantitations Genes Samples Different processing levels of MA data A B C D

7 MGED standards MIAME – minimum information about a microarray experiment MAGE-OM and MAGE-ML – microarray gene expression object model and mark- up language MO – microarray ontology Data normalisation and transformations (and quality control)

8 BioEvent Experiment ArrayDesign BioMaterial BioAssayData BioAssay DesignElement UML Packages of MAGE HigherLevelAnalysis BioSequence Array QuantitationType Description Protocol Measurement AuditAndSecurity BQS what was used what was done results miscellaneous

9 MAGE – an example diagram

10 ArrayExpress aims An archive for microarray data supporting scientific publications Providing easy access to public gene expression and other to microarray data in a structured format Facilitating the sharing of microarray designs and protocols Facilitating the establishment of infrastructure for microarray data sharing

11 AE users Experimentalists Single-gene biologists Bioinformaticians; genome-wide studies Bioinformaticians – algorithm developers Software developers

12 ArrayExpress repository Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) ww w EBI Expression Profiler External Databases (EMBL, UniProt, Ensemble) Data analysis Queries, analysis MIAMExpress Submissions Array Manufacturers (Affymetrix, Agilent) Data Analysis Software (R/Bioconductor, J-Express, Resolver) Submissions Warehouse (Biomart) ArrayExpress infrastructure Submission tracking/ curation tool External MIAMExpress installations (Camb. U., EMBL) ww w MAGE-ML Analysis ArrayExpres MAGE-ML

13 AE: overall principles Adherence to community standards Data captured in a granular, formalized manner Modern but proven software technologies Incremental development

14 AE design considerations Separate data archiving from the query- optimized data warehouse Generate default implementation, then refine –~2 full-time developers –pressure to bring system online quickly Use object abstraction layer –deal with performance overhead on case-by- case basis

15 Web page template Tomcat Curation environment Oracle DB MAGE-ML DTD MAGE-OM MAGE-ML (doc) MAGE-ML document MAGE loader Velocity Castor object/ relational mapping Java servlets MAGE validator MAGE unloader error.log Web page template Repository architecture overview

16 AE schema -Why auto-generated? –AE must be able to import any valid MAGE-ML and not lose information –good for navigating through data in terms of object model –if some queries dont work well, add something to the schema Experiment-Biomaterial, Experiment-Protocol links –so far works for 400Gb of data

17 Auto-generated web pages

18 To ontologize or not to ontologize At the beginning:At the end:

19 To ontologize or not to ontologize At the beginning:At the end:

20 Model vs. ontology Model – stable; ontologies – flexible Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model

21 Experiment1 type performer …. Hybridization data 1 Experimental factors Quantitation type definitions … >15 000 000 000 data points NetCDF

22 Data warehouse schema

23 What BioMart gives to AEDW Query language abstraction –Joins automatically generated Schema optimized for performance Clear database integration roadmap

24 ArrayExpress environment

25 Future plans Data management environment automation Flexible data warehouse interface Programmatic interface (HTTP/XML based) Distributed infrastructure??

26 Distributed data infrastructure ArrayExpress A local database A local database A local database Query broker Users query find resource deliver data

27 Conclusions Conceptual object modeling works well for complex life sciences domains Many software infrastructure components can be auto-generated from object models A range of approaches can be used for modeling, e.g., UML framework + ontologies Repository and data warehouse – different aims and different implementation principles

28 Acknowledgements Gonzalo Garcia Lara - web interface Ahmet Oezcimen - DBA Anjan Sharma - curation tool Sergio Contrino, Richard Coulson – data warehouse Niran Abeygunawardena – webmaster Mohammadreza Shojatalab – MIAMExpress Misha Kapushesky – Expression Profiler Curation team: –Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner Domain-specific projects: –Susanna Sansone, Philippe Rocca- Serra Alvis Brazma MGED collaborators –Stanford, TIGR, Affymetrix, EMBL, …. BioMart team


Download ppt "The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute."

Similar presentations


Ads by Google