Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to store and visualize RNA-seq data

Similar presentations


Presentation on theme: "How to store and visualize RNA-seq data"— Presentation transcript:

1 How to store and visualize RNA-seq data
Gabriella Rustici Functional Genomics Group

2 Talk summary How do we archive RNA-seq data in ArrayExpress
How do we process RNA-seq data How we display RNA-seq data in the Expression Atlas 26/08/2011 HTS data in ArrayExpress and Atlas

3 Components of a functional genomics experiment
26/08/2011 HTS data in ArrayExpress and Atlas

4 ArrayExpress www.ebi.ac.uk/arrayexpress/
Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated data in a structured and standardized format Facilitates the sharing of microarray designs, experimental protocols,…… Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data ( 4 26/08/2011 HTS data in ArrayExpress and Atlas

5 Standards for sequencing MINSEQE guidelines
Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): General information about the experiment Essential sample annotation including experimental factors and their values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) Essential experimental and data processing protocols Sequence read data with quality scores, raw intensities and processing parameters for the instrument Final processed data for the set of assays in the experiment 5 26/08/2011 HTS data in ArrayExpress and Atlas

6 Standards for microarray & sequencing MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data: IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates. 6 26/08/2011 HTS data in ArrayExpress and Atlas

7 Types of data that can be submitted
Maybe remove 26/08/2011 HTS data in ArrayExpress and Atlas

8 ArrayExpress – two databases
26/08/2011 HTS data in ArrayExpress and Atlas

9 What is the difference between Archive and Atlas?
Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Atlas Gene and/or condition queries Query across experiments and across platforms 26/08/2011 HTS data in ArrayExpress and Atlas

10 ArrayExpress – two databases
10 26/08/2011 HTS data in ArrayExpress and Atlas

11 How much data in AE Archive?
11 ArrayExpress

12 Browsing the AE Archive
26/08/2011 HTS data in ArrayExpress and Atlas

13 Browsing the AE Archive
The date when the data were loaded in the Archive Number of assays AE unique experiment ID Curated title of experiment Species investigated loaded in Atlas flag Raw sequencing data available in ENA The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed 13

14 Browsing the AE Archive
26/08/2011 HTS data in ArrayExpress and Atlas

15 RNA-seq data in AE Archive
26/08/2011 HTS data in ArrayExpress and Atlas

16 HTS data in AE Archive HTS data in ArrayExpress and Atlas

17 AE Archive – experiment view
26/08/2011 HTS data in ArrayExpress and Atlas

18 Link to raw data in ENA Master headline

19 RNA-seq processing pipeline
ArrayExpress Archive ENA Data Acquisition EGA FASQ files Short reads (FASTQ files) Summary level data Expression Atlas Ensembl RNAseq Processing pipeline RPKMs BAMs SDRF FASTQ Direct data submissions and GEO import 26/08/2011 HTS data in ArrayExpress and Atlas

20 RNA-seq processing pipeline: ArrayExpressHTS
ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA It can be used: on a local computer remotely on the EBI R Cloud, The interface for running this pipeline is in R and here is an example how the pipeline could be run with default options. This argument here is the name of the project to do the analysis for, so in case the pipeline find this name as a local directory it will look in it for the raw data and the metadata files. In case this directory does not exist it will search for a public ArrayExpress experiment with this accession. It will then proceed to download this data from ENA and the metadata from ArrayExpress, and then analyse it. So this pipeline can be used for analysing: - private data sitting on one’s computer - or data publicly available through ArrayExpress and ENA. It can also be run: - on a local computer - or remotely on the EBI R-Cloud Goncalves et al., Bioinformatics 2011 26/08/2011 HTS data in ArrayExpress and Atlas

21 ArrayExpressHTS in Bioconductor
26/08/2011 HTS data in ArrayExpress and Atlas

22 ArrayExpressHTS pipeline
transcriptome or genome Bowtie, BWA or TopHat cufflinks or MMSEQ filtering options (e.g., average base quality, read complexity,…) For each of these steps there are several options that can be easily configured, for example: - the reference can be set up to a genome or a transcriptome - the aligner can be any of the supported ones, Bowtie, BWA or TopHat - in case the aligner is not supported the pipeline can be started from the already aligned BAM files... - several filtering options, such as for example cut offs based on average base quality or read complexity, can be combined at will - and for expression estimation cufflinks or a new method called MMSEQ can be used. 26/08/2011 HTS data in ArrayExpress and Atlas

23 Using ArrayExpressHTS
library("ArrayExpressHTS") aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE) 26/08/2011 HTS data in ArrayExpress and Atlas

24 ArrayExpressHTS on the R cloud
R-server R-server R-server - SDRF - IDF ArrayExpress ArrayExpressHTS R package References, Index & Annotation - RAW DATA - Experiment meta data ENA Pipeline tools - tophat - bowtie - bwa - cufflinks - samtools - ExpressionSet - Quality reports User Project Storage 24 26/08/2011 HTS data in ArrayExpress and Atlas

25 RNA-seq processing pipeline
ArrayExpress Archive ENA Data Acquisition EGA FASQ files Short reads (FASTQ files) Summary level data Expression Atlas Ensembl RNAseq Processing pipeline RPKMs BAMs SDRF FASTQ Direct data submissions and GEO import 26/08/2011 HTS data in ArrayExpress and Atlas

26 ArrayExpress – two databases
26 26/08/2011 HTS data in ArrayExpress and Atlas

27 Expression Atlas Experiment selection criteria
The criteria we use for selecting experiments for inclusion in the Atlas are as follows: For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME/MINSEQE scores Experiment must have 6 or more assays Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available 27 26/08/2011 HTS data in ArrayExpress and Atlas

28 Expression Atlas Atlas construction
Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression 26/08/2011 HTS data in ArrayExpress and Atlas

29 Expression Atlas Atlas construction

30 Expression Atlas 26/08/2011 HTS data in ArrayExpress and Atlas

31 Atlas home page http://www.ebi.ac.uk/gxa/
Restrict query by direction of differential expression Query for genes Query for conditions The ‘advanced query’ option allows building more complex queries 26/08/2011 HTS data in ArrayExpress and Atlas

32 Atlas gene summary page
26/08/2011 HTS data in ArrayExpress and Atlas

33 Atlas heatmap view 26/08/2011 HTS data in ArrayExpress and Atlas

34 Atlas experiment page

35 View of RNA-seq data in Ensembl
26/08/2011 HTS data in ArrayExpress and Atlas

36 Atlas gene-condition query
36 26/08/2011 HTS data in ArrayExpress and Atlas

37 Data submission to AE 26/08/2011 HTS data in ArrayExpress and Atlas

38 Submission of HTS gene expression data
Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files  For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. 26/08/2011 HTS data in ArrayExpress and Atlas

39 What happens after submission?
confirmation Curation The curation team will review your submission and will you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. 26/08/2011 HTS data in ArrayExpress and Atlas

40 To find out more Email questions regarding ArrayExpressHTS to:
Angela Goncalves, Andrew Tikhonov, Read more at: Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. R-cloud: eLearning courses: 26/08/2011 HTS data in ArrayExpress and Atlas


Download ppt "How to store and visualize RNA-seq data"

Similar presentations


Ads by Google