How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk
Talk summary How do we archive RNA-seq data in ArrayExpress How do we process RNA-seq data How we display RNA-seq data in the Expression Atlas 26/08/2011 HTS data in ArrayExpress and Atlas
Components of a functional genomics experiment 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpress www.ebi.ac.uk/arrayexpress/ Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated data in a structured and standardized format Facilitates the sharing of microarray designs, experimental protocols,…… Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/) 4 26/08/2011 HTS data in ArrayExpress and Atlas
Standards for sequencing MINSEQE guidelines Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): General information about the experiment Essential sample annotation including experimental factors and their values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) Essential experimental and data processing protocols Sequence read data with quality scores, raw intensities and processing parameters for the instrument Final processed data for the set of assays in the experiment 5 26/08/2011 HTS data in ArrayExpress and Atlas
Standards for microarray & sequencing MAGE-TAB format MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data: IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates. 6 26/08/2011 HTS data in ArrayExpress and Atlas
Types of data that can be submitted Maybe remove 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpress – two databases 26/08/2011 HTS data in ArrayExpress and Atlas
What is the difference between Archive and Atlas? Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Atlas Gene and/or condition queries Query across experiments and across platforms 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpress – two databases 10 26/08/2011 HTS data in ArrayExpress and Atlas
How much data in AE Archive? 11 ArrayExpress
Browsing the AE Archive 26/08/2011 HTS data in ArrayExpress and Atlas
Browsing the AE Archive The date when the data were loaded in the Archive Number of assays AE unique experiment ID Curated title of experiment Species investigated loaded in Atlas flag Raw sequencing data available in ENA The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed 13
Browsing the AE Archive 26/08/2011 HTS data in ArrayExpress and Atlas
RNA-seq data in AE Archive 26/08/2011 HTS data in ArrayExpress and Atlas
HTS data in AE Archive 03.08.2018 HTS data in ArrayExpress and Atlas
AE Archive – experiment view 26/08/2011 HTS data in ArrayExpress and Atlas
Link to raw data in ENA 03.08.2018 Master headline
RNA-seq processing pipeline ArrayExpress Archive ENA Data Acquisition EGA FASQ files Short reads (FASTQ files) Summary level data Expression Atlas Ensembl RNAseq Processing pipeline RPKMs BAMs SDRF FASTQ Direct data submissions and GEO import 26/08/2011 HTS data in ArrayExpress and Atlas
RNA-seq processing pipeline: ArrayExpressHTS ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA It can be used: on a local computer remotely on the EBI R Cloud, www.ebi.ac.uk/tools/rcloud The interface for running this pipeline is in R and here is an example how the pipeline could be run with default options. This argument here is the name of the project to do the analysis for, so in case the pipeline find this name as a local directory it will look in it for the raw data and the metadata files. In case this directory does not exist it will search for a public ArrayExpress experiment with this accession. It will then proceed to download this data from ENA and the metadata from ArrayExpress, and then analyse it. So this pipeline can be used for analysing: - private data sitting on one’s computer - or data publicly available through ArrayExpress and ENA. It can also be run: - on a local computer - or remotely on the EBI R-Cloud Goncalves et al., Bioinformatics 2011 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpressHTS in Bioconductor 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpressHTS pipeline transcriptome or genome Bowtie, BWA or TopHat cufflinks or MMSEQ filtering options (e.g., average base quality, read complexity,…) For each of these steps there are several options that can be easily configured, for example: - the reference can be set up to a genome or a transcriptome - the aligner can be any of the supported ones, Bowtie, BWA or TopHat - in case the aligner is not supported the pipeline can be started from the already aligned BAM files... - several filtering options, such as for example cut offs based on average base quality or read complexity, can be combined at will - and for expression estimation cufflinks or a new method called MMSEQ can be used. 26/08/2011 HTS data in ArrayExpress and Atlas
Using ArrayExpressHTS library("ArrayExpressHTS") aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE) 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpressHTS on the R cloud R-server R-server R-server - SDRF - IDF ArrayExpress ArrayExpressHTS R package References, Index & Annotation - RAW DATA - Experiment meta data ENA Pipeline tools - tophat - bowtie - bwa - cufflinks - samtools - ExpressionSet - Quality reports User Project Storage 24 26/08/2011 HTS data in ArrayExpress and Atlas
RNA-seq processing pipeline ArrayExpress Archive ENA Data Acquisition EGA FASQ files Short reads (FASTQ files) Summary level data Expression Atlas Ensembl RNAseq Processing pipeline RPKMs BAMs SDRF FASTQ Direct data submissions and GEO import 26/08/2011 HTS data in ArrayExpress and Atlas
ArrayExpress – two databases 26 26/08/2011 HTS data in ArrayExpress and Atlas
Expression Atlas Experiment selection criteria The criteria we use for selecting experiments for inclusion in the Atlas are as follows: For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME/MINSEQE scores Experiment must have 6 or more assays Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available 27 26/08/2011 HTS data in ArrayExpress and Atlas
Expression Atlas Atlas construction Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression 26/08/2011 HTS data in ArrayExpress and Atlas
Expression Atlas Atlas construction
Expression Atlas 26/08/2011 HTS data in ArrayExpress and Atlas
Atlas home page http://www.ebi.ac.uk/gxa/ Restrict query by direction of differential expression Query for genes Query for conditions The ‘advanced query’ option allows building more complex queries 26/08/2011 HTS data in ArrayExpress and Atlas
Atlas gene summary page 26/08/2011 HTS data in ArrayExpress and Atlas
Atlas heatmap view 26/08/2011 HTS data in ArrayExpress and Atlas
Atlas experiment page 03.08.2018
View of RNA-seq data in Ensembl 26/08/2011 HTS data in ArrayExpress and Atlas
Atlas gene-condition query 36 26/08/2011 HTS data in ArrayExpress and Atlas
Data submission to AE 26/08/2011 HTS data in ArrayExpress and Atlas
Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. 26/08/2011 HTS data in ArrayExpress and Atlas
What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. 26/08/2011 HTS data in ArrayExpress and Atlas
To find out more Email questions regarding ArrayExpressHTS to: Angela Goncalves, filimon@ebi.ac.uk Andrew Tikhonov, andrew@ebi.ac.uk Read more at: Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. http://www.ncbi.nlm.nih.gov/pubmed/21233166 http://www.bioconductor.org/packages/2.9/bioc/html/ArrayExpressHTS.html R-cloud: http://www.ebi.ac.uk/Tools/rcloud/ eLearning courses: http://www.ebi.ac.uk/training/online/ 26/08/2011 HTS data in ArrayExpress and Atlas