ArrayExpress and Expression Atlas: Mining Functional Genomics data Dr Sarah Morgan Training team

Slides:



Advertisements
Similar presentations
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Advertisements

Gramene Meeting, PAG 2015 Expression Atlas - a New Resource for Baseline and Differential Gene Expression for Plants Robert Petryszak Gene Expression Team.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Time line and procedures for datasets BCBC Pre-retreat Workshop Tyson’s Corner, VA May 11, 2011.
Gene Ontology John Pinney
Welcome to mini-symposium on ontologies for biological sample description EMBL-EBI Wellcome Trust Genome Campus Deceber 5, 2001.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Using ArrayExpress. ArrayExpress is an international public repository for well-annotated microarray data, including gene expression, comparative genomic.
GCB/CIS 535 Microarray Topics John Tobias November 15 th, 2004.
Midterm project Course: Statistics in Bioinformatics Date: 指導教授 : 陳光琦 學生 : 吳昱賢.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Support for MAGE-TAB in caArray 2.0 Overview and feedback MAGE-TAB Workshop January 24, 2008.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Emma Hastings Functional Genomics Team EBI-EMBL
Gene Set Enrichment Analysis (GSEA)
Gene Expression Omnibus (GEO)
The MGED Society Facilitating Data Sharing and Integration with Standards CTSA Omics Data Standards Working Group Chris Stoeckert Dept. of Genetics and.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI Bioinformatics Roadshow ILRI/BecA Nairobi Campus 2 nd - 3 rd March 2011.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Copyright OpenHelix. No use or reproduction without express written consent1.
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
PLEXdb Plant Expression database Ethalinda Cannon Iowa State University January 15th, 2007.
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
The European Bioinformatics Institute MGED ontology for consistent annotation of microarray experiments Manchester Bioinformatics Week Ontologies Workshop1.
Making Sense of Public Domain Expression Data- GeneVestigator
1 MIAME The MIAME website: © 2002 Norman Morrison for Manchester Bioinformatics.
Managing Data Modeling GO Workshop 3-6 August 2010.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to caArray caBIG ® Molecular Analysis Tools Knowledge Center April 3, 2011.
UBio Training Courses Micro-RNA web tools Gonzalo
MIAMExpress and the development of annotation ontologies for gene expression experiments Ele Holloway Microarray Informatics European Bioinformatics Institute.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
The Stanley Neuropathology Consortium Integrative Database: A novel web-based tool for exploring neuropathological traits, gene expression and associated.
Protein Data Bank: An Introduction Learning to Use the RCSB PDB Portal.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
Gene Expression Omnibus (GEO)
1 Outline Standardization - necessary components –what information should be exchanged –how the information should be exchanged –common terms (ontologies)
Master headline RDFizing the EBI Gene Expression Atlas James Malone, Electra Tapanari
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Mining Functional Genomics Data ArrayExpress and Gene Expression Atlas: Amy Tang, PhD ArrayExpress Production Team Functional Genomics.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Describing Bioinformatic Metadata at EBI James Malone
First of all: “Darnit Jim, I’m a doctor not a bioinformatician!”
Copyright OpenHelix. No use or reproduction without express written consent1.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Bioinformatics Shared Resource Introduction to Gene Expression Omnibus (GEO) bsrweb.sanfordburnham.org
ArrayExpress Ugis Sarkans EMBL - EBI
Pathway Informatics 30 th March, 2016 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services Health Sciences Library System University.
GEO (Gene Expression Omnibus) Deepak Sambhara Georgia Institute of Technology 21 June, 2006.
ArrayExpress and Gene Expression Atlas:
Using ArrayExpress.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
How to store and visualize RNA-seq data
Gene Expression Omnibus (GEO)
Welcome - webinar instructions
Presentation transcript:

ArrayExpress and Expression Atlas: Mining Functional Genomics data Dr Sarah Morgan Training team

What is functional genomics (FG)? The aim of FG is to understand the function of genes and other parts of the genome FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome 2ArrayExpress

What biological questions is FG addressing? When and where are genes expressed? How do gene expression levels differ in various cell types and states? What are the functional roles of different genes and in what cellular processes do they participate? How are genes regulated? How do genes and gene products interact? How is gene expression changed in various diseases or following a treatment? 3ArrayExpress

Components of a FG experiment ArrayExpress4

ArrayExpress  Is a public repository for FG data, which provides easy access to well annotated data in a structured and standardized format  Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ  Facilitates the sharing of experimental information associated with the data such as microarray designs, experimental protocols,……  Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data ( ArrayExpress5

Community standards for data requirement  MIAME = Minimal Information About a Microarray Experiment  MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment  The checklist: ArrayExpress6 RequirementsMIAMEMINSEQE 1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed)

What is an experimental factor?  The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable.  Values of the factor (“factor values”) should vary. ArrayExpress7 Experimental designFactor Factor ValuesNot factor  beef vs horse meat Diet beef, horse meat Organism (human) smoker vs non-smoker compound cigarette smoke (tobacco), no tobacco Organism (human), sex (male) face cream A vs control X compound Active ingredient A, “sham” control Cell type A X

MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray or sequencing experiments IDFInvestigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRFSample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. ADF Array Design Format file that describes probes on an array, e.g. sequence, genomic mapping location (for array data ony) Data filesRaw and processed data files. The ‘raw’ data files are the trace data files (.srf or.sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.European Nucleotide Archive Standards for microarray & sequencing MAGE-TAB format ArrayExpress8

The two databases: how are they related? ArrayExpress Expression Atlas Direct submission Import from external databases (mainly NCBI Gene Expr. Omnibus) Curation Statistical analysis Links to analysis software, e.g. Links to other databases, e.g. ArrayExpress9

What is the difference between them? ArrayExpress10 ArrayExpress Archive Central object: experiment Query to retrieve experimental information and associated data Expression Atlas Central object: gene/condition Query for gene expression changes across experiments and across platforms

The two databases: how are they related? ArrayExpress Expression Atlas Direct submissio n Import from external databases (mainly NCBI Gene Expr. Omnibus) Curation Statistical analysis Links to analysis software, e.g. Links to other databases, e.g. ArrayExpress11

ArrayExpress Archive – when to use it? Find FG experiments that might be relevant to your research Download data and re-analyze it. Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments. Submit microarray or HTS data that you want to publish. Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process. ArrayExpress 12

How much data in AE Archive? (as of late August 2013) ArrayExpress13 (up to Aug)

Breakdown of microarray vs seq. data (as of late August 2013) Microarray vs HTS RNA-, DNA-, ChIP- seq breakdown ArrayExpress14

Browsing ArrayExpress ArrayExpress15

Browsing ArrayExpress experiments ArrayExpress16

File download on the Browse page ArrayExpress17 Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives)

ArrayExpress18 ArrayExpress single-experiment view Sample characteristics, factors and factor values MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself The microarray design used

ArrayExpress19 Samples view – microarray experiment Scroll left and right to see all sample characteristics and factor values Sample characteristics Factor values Direct link to data files for one sample

ArrayExpress20 Samples view – RNA-seq experiment Direct link to fastq files at European Nucleotide Archive (ENA) Direct link to European Nucleotide Archive (ENA) record about this sequencing assay

ArrayExpress21 Files and links available for each experiment Links to: original NCBI GEO record (GSExxxxx) Sequence Read Archive study (SRPxxxxx)

Searching for experiments in ArrayExpress ArrayExpress22

Experimental factor ontology (EFO)  Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship)  Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue)  Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms) NCBI Taxonomy ArrayExpress23

ArrayExpress24 Building EFO - an example sarcoma cancer neoplasm disease Kaposi’s sarcoma Take all experimental factors sarcoma cancer neoplasm Kaposi’s sarcoma disease is the parent term is a type of disease is synonym of neoplasm is a type of cancer is a type of sarcoma Find the logical connection between them disease neoplasm cancer sarcoma Kaposi’s sarcoma [-] Organize them in an ontology

ArrayExpress25 Exploring EFO - an example

Experimental factor ontology (EFO) EFO developed to:  increase the richness of annotations in databases  expand on search terms when querying ArrayExpress and Expression Atlas using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) using child terms (e.g. “bone”  “rib” and “vertebra”)  promote consistency (e.g. F/female/, 1day/24hours)  facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) ArrayExpress26

Example search: “leukemia” ArrayExpress27 Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term

Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinoma Hands-on exercise 2 Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress 28

The two databases: how are they related? ArrayExpress Expression Atlas Direct submissio n Import from external databases (mainly NCBI Gene Expr. Omnibus) Curation Statistical analysis Links to analysis software, e.g. Links to other databases, e.g. ArrayExpress29

Expression Atlas – when to use it? Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas; Discover which genes are differentially expressed in a particular biological condition that you are interested in. ArrayExpress 30

Expression Atlas construction Analysis pipeline ArrayExpress31 genes Cond.1Cond.2Cond.3 Linear model* (Bio/C Limma ) Cond.1 Cond.2 Cond.3 Input data (Affy CEL, non-Affy processed) 1= differentially expressed 0 = not differentially expressed A dummy example: Output: 2-D matrix * More information about the statistical methodology:

ArrayExpress32 genes Cond.1Cond.2Cond.3 Exp.1 genes Cond.4Cond.5Cond.6 Exp. 2 genes Cond.XCond.YCond.Z Exp. n Statistical test Statistical test Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition Expression Atlas construction

ArrayExpress33 Expression Atlas construction

Expression Atlas home page ArrayExpress34 Query for genes Query for conditions The ‘advanced query’ option allows building more complex queries Restrict query by direction of differential expression (up, down, both, neither)

Mapping microarray probes to genes Ensembl genes Probe identifiers Expression data per probe Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. From Ensembl genes, we also get: Compara genes External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms

ArrayExpress36 Expression Atlas home page The ‘Genes’ and ‘Conditions’ search boxes Genes Conditions

ArrayExpress37 Expression Atlas - single gene query

ArrayExpress38 Expression Atlas - single gene query 2 Affymetrix probe sets mapping to Saa4 gene The array design used (Affymetrix GeneChip Mouse Genome ) Experiment design (similar to ArrayExpress samples view) Analytics = statistics

ArrayExpress39 Atlas ‘condition-only’ query

Atlas ‘condition-only’ query (cont’d) heatmap view ArrayExpress40

ArrayExpress41 Atlas gene + condition query

ArrayExpress42 Atlas query refining (method 1) What if there are no terms in the “REFINE YOUR QUERY” box which fit my biological question?

ArrayExpress43 Atlas query refining (method 2)

ArrayExpress44 Atlas query refining (method 2) AND

ArrayExpress45 Atlas query refining (method 2) AND

Hands-on exercise 3 Find genes in the “androgen receptor signaling pathway” which are (i) expressed in prostate carcinoma and (ii) involved in regulation of transcription from RNA Pol II Hands-on exercise 4 Find information on Tbx5 expression in mouse in relation to Holt-Oram syndrome

which genes are expressed in a normal human kidney? “baseline”“differential” which genes are up- regulated in pancreatic islets of pregnant mice?

RNA-Seq data in new Atlas FASTQ files metadata Photo © Google/Connie Zhou Atlas results

New Atlas RNA-Seq pipeline quality control painting by shardcore ( low-quality reads NNACGANNNNN mapping Baseline Differential quantification summarization HTSeq cufflinks DESeq Fonseca, N.A. et al (2013) iRAP – an integrated RNA-seq Analysis Pipeline, Bioinformatics, submitted

which genes are expressed in a normal human kidney? “baseline”“differential” which genes are up- regulated in pancreatic islets of pregnant mice?

www-test.ebi.ac.uk/gxa

which genes are expressed in a normal human kidney? which genes are up- regulated in pancreatic islets of pregnant mice? “baseline”“differential”

Kim et al. (2010) Nature Medicine 16:804—808

Future plans for new Atlas Search across datasets for genes for conditions Gene sets and pathways Proteomics

Programmatic access It is possible to download data via an API for both ArrayExpress and the Atlas For array express: l For expression atlas:

Baseline Expression on page example for human BRCA1 gene:

Differential expression on summary page for human BRCA1 gene

A mock-up of baseline Expression Atlas experiment page, including protein expression data from two external resources, e.g. PRIDE and Peptide Atlas

Data submission to ArrayExpress Archive ArrayExpress 68

Data submission to Arrayexpress ArrayExpress69

Data submission to ArrayExpress ArrayExpress70 MAGE-TAB route recommended for large/complicated experiments. MAGE-TAB template spreadsheet (IDF and SDRF) tailor-made for your experiment if you follow the MAGE-TAB submission tool

Submission of HTS data ArrayExpress71 ArrayExpress acts as a “broker” for submitter. Meta-data and processed data: ArrayExpress Raw sequence reads* (e.g. fastq, bam): ENA *See for accepted read file formathttp://

What happens after submission? ArrayExpress72 confirmation Submission ‘closed’ so no more editing on your end Curation: We will you with any questions May ‘re-open’ submission for you to make changes Can keep data private until publication. Will provide login account details to you and reviewer for private data access Get your submission in the best possible shape to shorten curation and processing time!

Find out more Training co-ordinator for functional genomics: Visit our eLearning portal, Train online, at for courses on ArrayExpress and Atlas Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: us at: Atlas mailing list: Programmatic access is available for both resources. ArrayExpress73