Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots

Similar presentations


Presentation on theme: "NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots"— Presentation transcript:

1 NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots
9/15/2018 NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots Tanja Davidsen, PhD NCI Center for Biomedical Informatics and IT March, 2017 National Cancer Institute

2

3 The NCI Genomic Data Commons
Provide the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine One of the NCI resources supporting this vision in the NCI Genomics Data Commons which will provide….

4 The NCI Genomic Data Commons
Support the receipt, quality control, integration, storage, and redistribution of standardized genomic data sets derived from cancer research studies Available data NCI Funded cancer genomics datasets User submissions Data searching and retrieval/downloading Harmonization of raw sequence (alignment and variant calling) of all GDC data Application of state-of-the-art methods of generating derived data Developed, supported, and hosted by U. Chicago The GDC achieves this goal knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. Genomic Data Commons

5 NCI Genomic Data Commons
a unified data repository for the research community NCI Genomic Data Commons Data Storage Retrieval, Submission, & Harmonization Researchers

6 NCI Genomic Data Commons
The GDC went live on June 6, 2016 with approximately 4.1 PB of data. This includes: 2.6 PB of legacy data 1.5 PB of “harmonized” data 577,878 files about cases (patients), in 42 cancer types, across 29 primary sites. 10 major data types, ranging from Raw Sequencing Data, Raw Microarray Data, to Copy Number Variation, Simple Nucleotide Variation and Gene Expression. Data are derived from 17 different experimental strategies, with the major ones being RNA- Seq, WXS, WGS, miRNA-Seq, Genotyping Array and Expression Array. Foundation Medicine announced the release of 18,000 genomic profiles to the GDC at the Cancer Moonshot Summit.

7 GDC: Data Submission & Harmonization
Data Harmonization

8 GDC: Data Retrieval GDC Website Data Transfer Tool Data Portal
Visualization Tools Legacy Archive { "data": { "hits": [ {"project_id": "TCGA-SKCM”,"primary_site": "Skin”} , {"project_id": "TCGA-PCPG”,"primary_site": "Nervous System”} , {"project_id": "TCGA-LAML”,"primary_site": "Blood”} , {"project_id": "TCGA-CNTL”,"primary_site": "Not Applicable”} , {"project_id": "TCGA-UVM”,"primary_site": "Eye”} GDC API API URL Endpoint URL parameters Query parameters 8

9 Content in the Genomic Data Commons
TCGA ,353 cases TARGET ,178 cases Current ~58,000 cases Foundation Medicine 18,000 cases Cancer studies in dbGaP ~4,000 cases MMRF ~1,000 cases Coming soon NCI-MATCH ~3,000 cases Clinical Trial Sequencing Program ~3,000 cases Planned (1-3 years) Cancer Driver Discovery Program ~5,000 cases Human Cancer Model Initiative ~1,000 cases APOLLO – VA and DoD ~8,000 cases GDC launched with two of the major NCI genomic data sets, TCGA and TARGET.

10 The NCI Cancer Genomics Cloud Pilots
Understanding how to meet the research community’s need to analyze large-scale cancer genomic and clinical data

11 NCI Cancer Genomics Cloud Pilots
Cloud Pilots provide: Access to large genomic data sets without need to download Access to popular pipelines and visualization tools Ability for researchers to bring their own tools and pipelines to the data Ability for researchers to bring their own data and analyze in combination with NCI genomic data Workspaces, for researchers to save and share their data and results of analyses Democratize access to NCI-generated genomic and related data, and to create a cost-effective way to provide scalable computational capacity to the cancer research community. These pilots were initiated two years ago based on our awareness that the traditional model of data download and management by every research group was no longer scalable and the NCI wanted to explore the effectiveness of co-locating data and compute in a cloud environment for access and analysis. The overall goals is to…..

12 NCI Genomic Data Commons
GDC/Cloud Pilot Ecosystem Researchers Broad FireCloud ISB CGC SBG CGC Cancer Genomic Data NCI Genomic Data Commons NCI Cloud Pilots Data Storage Retrieval, Submission, & Harmonization Data Compute Analysis, Workflows, & Pipelines

13 Three NCI Genomics Cloud Pilots
PI: Gad Getz Google Cloud Firehose in the cloud including Broad best practices workflows Broad Institute PI: Ilya Shmulevich Leverage Google infrastructure; Novel query and visualization Institute for Systems Biology PI: Deniz Kural Amazon Web Services Interactive data exploration; > 30 public pipelines Seven Bridges Genomics

14 Broad Institute Cloud Pilot
Targeted at users performing analyses at scale. Modeled after their Firehose analysis infrastructure developed for the TCGA program. Users can upload their own data and tools and/or run the Broad’s best practice tools and pipelines on pre-loaded data.

15 Institute for Systems Biology Cloud Pilot
Closely tied with Google Cloud Platform tools including BigQuery, App Engine, Cloud Datalab, Google Genomics, and Compute Engine Level-3 TCGA data in BigQuery allows fast SQL-like queries across the entire dataset Web interface allows scientists to interactively compare and define cohorts PI / Biologist web access Computational Research Scientist Python, R, SQL Algorithm Developer ssh, programmatic access ISB-CGC Web App Google Cloud Console Google APIs ISB-CGC APIs Compute Engine VMs Cloud Storage BigQuery Genomics Local Storage ISB-CGC Hosted Data Controlled-Access Data Open-Access Data User Data

16 Seven Bridges Genomics Cloud Pilot
Built upon the SBG commercial cloud-based genomics platform Graphical query interface to identify hosted data of interest Includes a native implementation of the Common Workflow Language specification for creating user-defined workflows

17 Timeline & Extension Selection Design/Build I Design/Build II Evaluation Extension Jan 2014 Sept 2014 April 2015 Jan 2016 Sept 2016 One year contract extension for all three NCI Cloud Pilots Continue to make all current tools and data available for an additional year Build on the tools/analyses available Continue to make the platform, pipeline, and tools portable (Dockerization, Workflow languages - CWL/WDL) New datasets added (including pediatric cancer data) New datatypes added: Proteomics, Imaging data, multiple genome builds Overall ~2.5 PB of data in extension

18 Community Evaluation of the Cloud Pilots
Cloud Credits Storage and compute credits are available to researchers through a tiered system to use the Cloud Pilots Grant supplements Support NCI grantees to serve as beta-testers and conduct genomic analysis relevant to their research on one or more Cloud Pilots DREAM Challenge Crowd-based competition to identify the optimal methods for detecting and quantifying mRNA fusions and isoforms from RNA-Seq data

19


Download ppt "NCI’s Genomics Data Commons (GDC) & NCI Cloud Pilots"

Similar presentations


Ads by Google