How to store and visualize RNA-seq data

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Using ArrayExpress. ArrayExpress is an international public repository for well-annotated microarray data, including gene expression, comparative genomic.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Introduction to RNA-Seq and Transcriptome Analysis
Gene Expression Omnibus (GEO)
The MGED Society Facilitating Data Sharing and Integration with Standards CTSA Omics Data Standards Working Group Chris Stoeckert Dept. of Genetics and.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI Bioinformatics Roadshow ILRI/BecA Nairobi Campus 2 nd - 3 rd March 2011.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Copyright OpenHelix. No use or reproduction without express written consent1.
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Website.
Data Analysis Project Advanced Bioinformatics BIF
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
Gene Expression Omnibus (GEO)
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
ArrayExpress and Expression Atlas: Mining Functional Genomics data Dr Sarah Morgan Training team
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
The iPlant Collaborative
Bioinformatics for biologists
No reference available
Copyright OpenHelix. No use or reproduction without express written consent1.
Fab25 User Training Cerium Labs LabCollector - LIMS Lynette Ballast.
ArrayExpress Ugis Sarkans EMBL - EBI
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
NCRI Cancer Conference November 1, 2015.
ArrayExpress and Gene Expression Atlas:
T3/Tutorials: Data Submission
Introductory RNA-seq Transcriptome Profiling
Getting GO annotation for your dataset
Cancer Genomics Core Lab
Hub Updates for Year 3 Carl Kesselman.
Regulatory Genomics Lab
exRNA Metadata Standards
Using ArrayExpress.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Introductory RNA-Seq Transcriptome Profiling
Central Document Library Quick Reference User Guide View User Guide
What is Bioinformatics?
Training course on Euro SDMX Registry
Gene Expression Omnibus (GEO)
Regulatory Genomics Lab
Transcriptomics Data Visualization Using Partek Flow Software
Knowledge-Guided Sample Clustering
Introduction to RNA-Seq & Transcriptome Analysis
Regulatory Genomics Lab
FaceBase Hub Years 1 through 5
Presentation transcript:

How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk

Talk summary How do we archive RNA-seq data in ArrayExpress How do we process RNA-seq data How we display RNA-seq data in the Expression Atlas 26/08/2011 HTS data in ArrayExpress and Atlas

Components of a functional genomics experiment 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpress www.ebi.ac.uk/arrayexpress/ Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ Provides easy access to well annotated data in a structured and standardized format Facilitates the sharing of microarray designs, experimental protocols,…… Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/) 4 26/08/2011 HTS data in ArrayExpress and Atlas

Standards for sequencing MINSEQE guidelines Minimal Information about a high-throughput Nucleotide SEQuencing Experiment The proposed guidelines for MINSEQE are (still work in progress): General information about the experiment Essential sample annotation including experimental factors and their values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….) Essential experimental and data processing protocols Sequence read data with quality scores, raw intensities and processing parameters for the instrument Final processed data for the set of assays in the experiment 5 26/08/2011 HTS data in ArrayExpress and Atlas

Standards for microarray & sequencing MAGE-TAB format MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data: IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols. SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter. Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates. 6 26/08/2011 HTS data in ArrayExpress and Atlas

Types of data that can be submitted Maybe remove 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpress – two databases 26/08/2011 HTS data in ArrayExpress and Atlas

What is the difference between Archive and Atlas? Query by experiment, sample and experimental factor annotations Filter on species, array platform, molecule assayed and technology used Atlas Gene and/or condition queries Query across experiments and across platforms 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpress – two databases 10 26/08/2011 HTS data in ArrayExpress and Atlas

How much data in AE Archive? 11 ArrayExpress

Browsing the AE Archive 26/08/2011 HTS data in ArrayExpress and Atlas

Browsing the AE Archive The date when the data were loaded in the Archive Number of assays AE unique experiment ID Curated title of experiment Species investigated loaded in Atlas flag Raw sequencing data available in ENA The total number of experiments and assay retrieved The direct link to raw and processed data. An icon indicates that this type of data is available. The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed 13

Browsing the AE Archive 26/08/2011 HTS data in ArrayExpress and Atlas

RNA-seq data in AE Archive 26/08/2011 HTS data in ArrayExpress and Atlas

HTS data in AE Archive 03.08.2018 HTS data in ArrayExpress and Atlas

AE Archive – experiment view 26/08/2011 HTS data in ArrayExpress and Atlas

Link to raw data in ENA 03.08.2018 Master headline

RNA-seq processing pipeline ArrayExpress Archive ENA Data Acquisition EGA FASQ files Short reads (FASTQ files) Summary level data Expression Atlas Ensembl RNAseq Processing pipeline RPKMs BAMs SDRF FASTQ Direct data submissions and GEO import 26/08/2011 HTS data in ArrayExpress and Atlas

RNA-seq processing pipeline: ArrayExpressHTS ArrayExpressHTS is an R based pipeline for pre-processing, expression estimation and data quality assessment of RNA-seq datasets The pipeline can be used for analyzing: private data public data, available through ArrayExpress and ENA It can be used: on a local computer remotely on the EBI R Cloud, www.ebi.ac.uk/tools/rcloud The interface for running this pipeline is in R and here is an example how the pipeline could be run with default options. This argument here is the name of the project to do the analysis for, so in case the pipeline find this name as a local directory it will look in it for the raw data and the metadata files. In case this directory does not exist it will search for a public ArrayExpress experiment with this accession. It will then proceed to download this data from ENA and the metadata from ArrayExpress, and then analyse it. So this pipeline can be used for analysing: - private data sitting on one’s computer - or data publicly available through ArrayExpress and ENA. It can also be run: - on a local computer - or remotely on the EBI R-Cloud Goncalves et al., Bioinformatics 2011 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpressHTS in Bioconductor 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpressHTS pipeline transcriptome or genome Bowtie, BWA or TopHat cufflinks or MMSEQ filtering options (e.g., average base quality, read complexity,…) For each of these steps there are several options that can be easily configured, for example: - the reference can be set up to a genome or a transcriptome - the aligner can be any of the supported ones, Bowtie, BWA or TopHat - in case the aligner is not supported the pipeline can be started from the already aligned BAM files... - several filtering options, such as for example cut offs based on average base quality or read complexity, can be combined at will - and for expression estimation cufflinks or a new method called MMSEQ can be used. 26/08/2011 HTS data in ArrayExpress and Atlas

Using ArrayExpressHTS library("ArrayExpressHTS") aehts <- ArrayExpressHTS("E-GEOD-16190", usercloud = FALSE) 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpressHTS on the R cloud R-server R-server R-server - SDRF - IDF ArrayExpress ArrayExpressHTS R package References, Index & Annotation - RAW DATA - Experiment meta data ENA Pipeline tools - tophat - bowtie - bwa - cufflinks - samtools - ExpressionSet - Quality reports User Project Storage 24 26/08/2011 HTS data in ArrayExpress and Atlas

RNA-seq processing pipeline ArrayExpress Archive ENA Data Acquisition EGA FASQ files Short reads (FASTQ files) Summary level data Expression Atlas Ensembl RNAseq Processing pipeline RPKMs BAMs SDRF FASTQ Direct data submissions and GEO import 26/08/2011 HTS data in ArrayExpress and Atlas

ArrayExpress – two databases 26 26/08/2011 HTS data in ArrayExpress and Atlas

Expression Atlas Experiment selection criteria The criteria we use for selecting experiments for inclusion in the Atlas are as follows: For microarray-based experiments, array designs must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done) High MIAME/MINSEQE scores Experiment must have 6 or more assays Sufficient replication and large sample size EF and EFV must be well annotated Adequate sample annotation must be provided Processed data must be provided or raw data which can be renormalized must be available 27 26/08/2011 HTS data in ArrayExpress and Atlas

Expression Atlas Atlas construction Data is taken as normalized by the submitter Gene-wise linear models (limma) and t-statistics are applied to identify the differentially expressed genes across all biological conditions, in all the experiments The result is a two-dimensional matrix where rows correspond to genes and columns correspond to biological conditions The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression 26/08/2011 HTS data in ArrayExpress and Atlas

Expression Atlas Atlas construction

Expression Atlas 26/08/2011 HTS data in ArrayExpress and Atlas

Atlas home page http://www.ebi.ac.uk/gxa/ Restrict query by direction of differential expression Query for genes Query for conditions The ‘advanced query’ option allows building more complex queries 26/08/2011 HTS data in ArrayExpress and Atlas

Atlas gene summary page 26/08/2011 HTS data in ArrayExpress and Atlas

Atlas heatmap view 26/08/2011 HTS data in ArrayExpress and Atlas

Atlas experiment page 03.08.2018

View of RNA-seq data in Ensembl 26/08/2011 HTS data in ArrayExpress and Atlas

Atlas gene-condition query 36 26/08/2011 HTS data in ArrayExpress and Atlas

Data submission to AE 26/08/2011 HTS data in ArrayExpress and Atlas

Submission of HTS gene expression data Submit via MAGE-TAB submission route Submit: MAGE-TAB spreadsheet containing details of the samples and protocols used. Trace data files for each sample (in SRF, FASTQ or SFF format ) Processed data files  For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA). If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely. 26/08/2011 HTS data in ArrayExpress and Atlas

What happens after submission? Email confirmation Curation The curation team will review your submission and will email you with any questions. Possible reopening for editing We will send you an accession number when all the required information has been provided. We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public. 26/08/2011 HTS data in ArrayExpress and Atlas

To find out more Email questions regarding ArrayExpressHTS to: Angela Goncalves, filimon@ebi.ac.uk Andrew Tikhonov, andrew@ebi.ac.uk Read more at: Goncalves et al. (2011). A pipeline for RNA-seq data processing and quality assessment. http://www.ncbi.nlm.nih.gov/pubmed/21233166 http://www.bioconductor.org/packages/2.9/bioc/html/ArrayExpressHTS.html R-cloud: http://www.ebi.ac.uk/Tools/rcloud/ eLearning courses: http://www.ebi.ac.uk/training/online/ 26/08/2011 HTS data in ArrayExpress and Atlas