Download presentation
Presentation is loading. Please wait.
1
Cloud based NGS data analysis
of KM12 cell line Ettore Rizzo1,2 Ph.D, Roberta Bosotti2, Giovanni Carapezza2, Sebastiano Di Bella2, Antonella Isacchi2, Riccardo Bellazzi1 Ph.D. 1Laboratory of Bioinformatics, Mathematical Modelling and Synthetic Biology, University of Pavia, via Ferrata 1 Pavia Italy 2Nerviano Medical Sciences S.r.l., Via Pasteur Louis, 10, Nerviano (Milan), Italy BACKGROUND. In order to detect DNA variants and investigate gene expression data, we developed two different scalable and parallelizable workflows capable of running on the cloud (e.g., Amazon Web Services – AWS), that allow to reduce the time of the analysis and are cost-effective. The developed pipelines integrate different state-of-the-art bioinformatics tools and are build on top of COSMOS [1], a workflow management system that allow to lower genomic data analysis cost in two ways: 1) it implements a highly parallelizable workflow that can be quickly and efficiently run on a large compute cluster, and 2) it takes advantage of AWS spot-instance pricing to reduce the cost per hour. As test case, the implemented pipelines were applied to the analysis of Next-Generation Sequencing data of the DNA and RNA extracted from the KM12 human colorectal cancer cell line for whole exome and whole transcriptome sequencing. The analysis allowed to highlight the characteristic TPM3-NTRK1 genomic rearrangement harbored by this cell line [2]. MATERIALS AND METHODS. The focus of this study is the implementation of two COSMOS workflows that respectively perform variant discovery in DNAseq data and evaluate differential gene expression in RNAseq data. The DNAseq workflow implements the GATK [3] best practice protocol (Broad Institute), which is a widely accepted analysis standard. The involved steps are the following: mapping and marking duplicates; local realignment around indels; base quality score recalibration; variant calling by HaplotypeCaller and variant quality score recalibration. This pipeline also includes annotation through Annovar [4] and structural variation screening using DELLY [5], a Structural Variant (SV) discovery method suitable for the detection of copy-number alterations, duplication events or balanced rearrangements (inversions, translocations). The RNAseq workflow implements the TCGA RNAseq pipeline. Reads are aligned to the reference genome through MapSplice [6], that allows also the detection of splicing junctions. Isoform-level and gene-level abundance are then estimated through RSEM [7]. Finally differential expression analysis is performed using the Bioconductor package edgeR [8]. In order to automate and simplify the process of building, configuring, and managing the AWS EC2 cluster used to run the described pipelines, we rely on StarCluster toolkit. It allows indeed to launch and shutdown cluster nodes without user intervention and automatically installs both a job manager and a file sharing system a on all the cluster nodes. After the workflow management system loads a workflow, a “workflow” parser breaks up each stage of the workflow into multiple jobs that are then executed in parallel. Jobs are distributed from a master node to worker nodes using a standard job manager as Grid Engine. Users can monitor real-time their workflows, state and job dependencies and use of resources per each job, through a dynamic web interface provided by COSMOS. RESULTS The developed pipelines were executed on a cloud computing environment which uses 5 node (1 master and 4 worker node) with each node a “cc2.8xlarge” AWS instance with 32 cores and 60 Gb of Ram. DNAseq analysis took less than 3 hours of AWS “wall” time from raw data processing to annotation step and, more importantly, cost less than 50€. RNAseq analysis took less than 2 hours and cost less than 40€. To evaluate experiment quality and the accuracy of our DNAseq pipeline, we compared SNV calls against KM12 publically available NGS data (CCLE, Cancer Cell Line Encyclopedia [9]) obtaining a 95% overlap between call sets. Finally, the presence of the known TPM3-NTRK1 rearrangement in KM12 was detected by both exome and transcriptome analysis (see below). REFERENCES [1] Gafni, Erik, et al. "COSMOS: Python library for massively parallel workflows." Bioinformatics (2014): btu385. [2] Ardini, Elena, et al. "The TPM3-NTRK1 rearrangement is a recurring event in colorectal carcinoma and is associated with tumor sensitivity to TRKA kinase inhibition." Molecular oncology 8.8 (2014): [3] McKenna, Aaron, et al. "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data." Genome research 20.9 (2010): [4] Wang, Kai, Mingyao Li, and Hakon Hakonarson. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data." Nucleic acids research (2010): e164-e164. [5] Rausch, Tobias, et al. "DELLY: structural variant discovery by integrated paired-end and split-read analysis." Bioinformatics (2012): i333-i339. [6] Wang, Kai, et al. "MapSplice: accurate mapping of RNA-seq reads for splice junction discovery." Nucleic acids research (2010): e178-e178. [7] Li, Bo, and Colin N. Dewey. "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome." BMC bioinformatics 12.1 (2011): 323. [8] Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." Bioinformatics 26.1 (2010): [9] Barretina, Jordi, et al. "The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity." Nature (2012): .
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.