VectorBase Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to EBI, Sanger and ND)
VectorBase Outline 1.Project goals 2.What’s currently available 3.Current challenges and future plans
VectorBase Project goals For vector biologists: –Easy access to gene expression data consistent data processing For array specialists: –ArrayExpress submission –Advanced analysis tools –Array annotation
VectorBase BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS BASE: BioArray Software Environment Open source, active development and user community LIMS, data storage, export and analysis Web-based, user/group access control BASE 2.x adoption will bring Affy support
Data submission Community submission guidelines available First batch of experiments loaded by us Bulk data loader Sample/experiment annotation requires intervention from curators
VectorBase BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS ArrayExpress ‘PUBLIC’ STORAGE Data held in BASE is largely MIAME compliant Script for semi- automated export in TAB2MAGE format One experiment submitted so far
VectorBase BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS ArrayExpress ‘PUBLIC’ STORAGE
VectorBase BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS ArrayExpress ‘PUBLIC’ STORAGE DATA SUMMARIES BASE web interface offers powerful and extendable analysis environment Can be used for multi- site collaborations on pre-publication data Steep learning curve/not 100% intuitive Not easily linked to We provide simpler views so the casual user can quickly draw biological inferences
VectorBase
Standardised data All displayed data is processed in the same way: 1.Poor quality spots removed Currently using submitted spot flags 2.Normalisation “lowess” for two-colour experiments
VectorBase
BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS ArrayExpress ‘PUBLIC’ STORAGE DATA SUMMARIES PROBE MAPPING 3 probe types 6 array designs Mapping handled via Ensembl pipeline: –Oligo exonerate –PCR e-PCR –cDNA exonerate2genes
VectorBase GENOMIC DATA AUTOMATIC ANNOTATION GENOME BROWSER VectorBase BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS ArrayExpress ‘PUBLIC’ STORAGE DATA SUMMARIES PROBE MAPPING GFF3
VectorBase contigview
VectorBase featureview
VectorBase
BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS VECTOR BIOLOGISTS ARRAY BIOLOGISTSGENOME BIOLOGISTS ArrayExpress ‘PUBLIC’ STORAGE VectorBase GENOMIC DATA AUTOMATIC ANNOTATION GENOME BROWSER DATA SUMMARIES PROBE MAPPING DATA MINING
VectorBase BioMart Beta version currently available – Improvements still needed: –experiment annotations –Alignments (i.e. handle split alignments) Federation with current marts Integration with new data?
VectorBase Current challenges and future plans How do you want to query? CVs & ontologies APIs Community submission Manual annotation
VectorBase Querying strategy What do you want to query on? –Fetch all genes upregulated under condition X –Fetch all experiments with gene X and condition Y –Fetch all probes with expression similar to probe X All essentially boil down to: –Define probe (genes etc) –Define significant expression ANOVA? Up/down-regulation WRT what? –Define experimental conditions Sample annotation Experimental design
BULK LOADER EXPRESSION DATA STORAGE & ANALYSIS VECTOR BIOLOGISTS ARRAY BIOLOGISTSGENOME BIOLOGISTS CV / ONTOLOGY ArrayExpress ‘PUBLIC’ STORAGE GENOMIC DATA AUTOMATIC ANNOTATION GENOME BROWSER DATA SUMMARIES PROBE MAPPING DATA MINING
STORAGE & ANALYSIS ‘PUBLIC’ STORAGE GENOME BROWSER DATA SUMMARIES DATA MINING BULK LOADER EXPRESSION DATA GENOMIC DATA AUTOMATIC ANNOTATION CV / ONTOLOGY ArrayExpress Array API ? AE API ?e! API MartJ / MQL PROBE MAPPING
VectorBase Array API Perl / Java objects for retrieval / handling of array data –Dual purpose: Consistency & efficiency of VB expression website Computational access to VB data for all –Objects must be: General, DB-independent Compatible with pre-existing Bio API (BioPerl / BioJava) –Nb. May be pre-existing solution: ArrayExpress API? BioPerl-Expression? MAGE-OM-stk
VectorBase
Community data submission Carrot? –Help with ArrayExpress submission –Analysis tools –Dissemination Stick? –Outreach (courses, conferences) –Networking
VectorBase GE data manual annotators Gene-build designed arrays –Negative evidence less compelling EST clone-based arrays –
VectorBase Longer term plans Host-parasite GE data integration & analysis GE-clusters “upstream” regions regulatory elements, upstream TFs RNAi phenotypes Images
VectorBase
CVs & ontologies Integrate MGED and specialist ontologies for –Body parts –Developmental stages –Disease processes –… Allows comparison across experiments with similar experimental conditions
BioMart Most biomarts: Gene-based Mostly ‘binary’ data –e.g. a gene either has a signal domain or doesn’t Easily linked with other (gene-based) biomarts VB Biomart: Probe based –Many probes not aligned Exp data less clear –e.g. define ‘differential expression’ Exports gene/trans IDs for linking to other Marts
VectorBase Clustering A priority? Easy to do on reporter level within experiments Harder to do at gene level across all experiments –Binary gene profile: “yes/no differentially expressed in experiment” ? Amazon-style links to “genes which may have similar expression profiles”?
VectorBase BASE 2.x Adoption delayed, now in progress Brings Affymetrix support Cleaner/modern interface Better API (Java)