Download presentation
Published byTheodore Washington Modified over 10 years ago
1
The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair in Bioinformatics SVP & CIO St. Jude Children’s Research Hospital
2
PCGP Data: 917 TB, 148 million files
St. Jude Data: The First 50 Years 2 1/2 Years (1000 TB) PCGP Data: 917 TB, 148 million files 48 Years (800 TB) The Data Deluge
3
St. Jude/WashU Pediatric Cancer Genome Project
Launched Feb. 2010 St. Jude/WashU collaboration WGS on 600 patients (leukemia, brain tumors, solid tumors) Matched germline and tumor samples 1200 genomes (~90 billion bp/genome) in 36 months ~2 Petabytes of data 1rst genome: $3.5B, 13 years 1 genome: 86 4-drawer file cabinets if printed 2 PB: the adult human brain is estimated to store a limit of 2.5PB The PCGP Project
4
Challenges to Information Sciences
Moving data Data workflow Data analysis Computational horsepower Data storage Data sharing PCGP Challenges
5
Moving Data Multi-Terabyte data transit across networks is not trivial
DNA sequence raw data reads, contig assembly, alignment to reference, variants, etc. shipped to SJCRH as binary BAM files: ~100 GB 24 hrs to infinity to send via commodity internet Internet2 connectivity (10 Gbs via MRC) to transfer files from WashU to SJCRH Evaluated 5 different fast data transfer algorithms….selected FDT (developed at CalTech to transfer LHC data at Cern) Developed a pipeline to facilitate transfer Today: ~5 hour transit time/file Fast Data Transfer: developed by physics community to move Large Hadron Collider data Pipeline facilitates WashU transfer to SJ; upon receipt BAM files are automatically detected, Pallas is launched, files are distributed to appropriate locations, and initial processing begins. Moving Data
6
Moving data between data centers required investment in our networking infrastructure:
Thick red lines represent 40 Gb/s InfiniBand. There are four IB links between the two data centers providing Gb/s of bandwidth for HPC. IB network allows for centralized parallel scratch accessible to all HPC resources. The IB protocol allows for maximum saturation of bandwidth between resources, alleviating the need to stage data for large data sets (e.g. - assembly runs). The upgrade also keeps HPC traffic off the campus backbone (except for failover purposes) and visa versa. Less contention -> better performance. Moving Data
7
Moving data around campus required investment in networking infrastructure:
Overall SJ network architecture PCGP, PACS and Proton Beam has driven need for 10Gb/s backbone PCGP has driven need for 40 Gb/s Infiniband connectivity between HPC systems Moving Data
8
Data Workflow Began work on PCGP 9 months prior to launch
Developed a LIMS system for Validation Lab Developed a PCGP SharePoint site to facilitate collaboration internally Developed a bioinformatics workflow engine: PALLAS Security management Data provenance management Intermediate and final result tracking Flexible workflow design Rapid new analytical algorithms/tools configuration Web-based LSF job submission and monitoring Support a range of protocols to connect to other web application systems, databases, file systems, and etc. Integrated with applications, such as SRM, Genome Browser and etc. Data integration with tissue sample, clinical, and research data Vision: parse each algorithm to the appropriate computing environment Pallas is used primarily in the early stages of data analysis at this time; most algorithms are still run one by one. LIMS is part of our SRM2 development project Keeps track of provenance and should allow elimination of intermediate files thus minimizing disk storage requirements Makes submission of jobs to cluster easy (handles LSF for you) Integrates with other SJ applications facilitating data integration Data Workflow
9
Jinghui Zhang and CompBio Team
BAM Quality Assurance: Tumor Purity Algorithm (SJCRH) Not Disease/Genomic Swap (SNP checks) Xenograft Filter (Remove Contaminating Mouse Reads) Gene Exon and Genome Coverage algorithms (Gang Wu) BAM file work: Bam file extraction and visualization Samtools and C++/bioperl api’s Bambino IGV Single Nucleotide Variation: Freebayes In-house PCGP Copy Number Variation: Stan’s Copy Number Algorithm Regression Tree Algorithm Structural Variation: One End Anchored Inference: CREST ViralTopology Fusion Detection: In-house (Michael Rusch) RNAseq: RNAseq mysql/Cufflinks ChipSeq: ChiPseq mysql/in house (John Obenauer) viralScan in-house (McGoldrick) Integration: GFF intersect Gff2fasta gffBuilders Cancer warehouse Visualization: Circos maker BED GFF Tracks maker ~25 algorithms in use to fully analyze WGS data Collecting the data for one genome takes a week and costs ~$8000 Analyzing the data for one genome can be done at SJ in a couple days and costs ? The $1000 genome is near (12-18 months); analysis will remain much more expensive for quite some time Data Analyses
10
Computational Horsepower (HPCF)
IBM BladeCenter (810 cores/3TB RAM) IBM iDataplex (1,008 cores/4TB RAM) – April 2010 SGI Altix UV1000 (640 cores/5TB RAM/60TB storage using Lustre v2.2) – December 2011 IBM SoNAS (780 TB) – March 2011 Data Transfer Node (10 Gbps I2 connection) – April 2011 Internal Data Transfer Node (10 Gbps x2) – June 2011 QDR Infiniband (40 Gbps for all HPC equipment) – January 2012 Software (Platform LSF, Intel Parallel Studio) Total: 2,366 cores, 13TB RAM (estimated 11.6 Tflops) 2010: 365,000 cpu hours 2011: 712,000 cpu hours Computational horsepower was needed to support the PCGP program: Combined performance estimate: 11.6 Tflops (trillion floating point operations per second) iDataplex: 4.5 TFLOPS BladeCenter: 1.3 TFLOPS Altix: 5.8 TFLOPS PCGP has consumed 365K cpu-hours in 2010 712K cpu-hours in 2011 Computational Horsepower (HPCF)
11
Data Storage IBM SoNAS (780 TB) – March 2011
Scales to 21PB; 1 billion files/filesystem; 7,200 drives Current total on campus: 3.8 Petabytes (3,800,000 Gb) PCGP uses 917 TB (<- +500TB on tape), 148 million data files IBM TSM systems for backup/archive (Tiered) 240 SAS (15k) drives 480 SAS-NL (7.2k) drives Current 7,900 tape capacity, up to 1.6TB/tape; PB total 734 TB usable under one file system High speed/low latency backend interconnect (QDR InfiniBand 20Gb per port and 100ns latency) Data storage is the greatest IT challenge, we’ve invested significantly in upgrading storage/backup infrastructure Left Image: IBM SoNAS System: scales to 21PB, 1 billion files/file system, 7,200 drives 480 SAS-NL (7.2k) drives 240 SAS (15k) drives 780TB usable under one file system Right Image: TSM Tape Library ~4,000 tape slots x ~1.6 Tb each with compression = 6,400 Tb = 6.4 Pb EI Library has 1989 slots – 3 (two cleaning tapes, one system tape) RI Library (SoNAS included) has 1989 slots – 3 (two cleaning tapes, one system tape) 800GB per tape w/o compression. Tape compression varies and can reach towards 1.6TB (and sometimes more) per tape . We have three storage pools today: system (15k), nearline (7.2k) and “hsm” which is tape. SoNAS use capacity thresholds to trigger migrations between storage pools. The GPFS policy engine typically weighs a files age and size high when it scans a file system (in parallel mind you, across all nine interface nodes) and assembles “work lists” that spreads the file migration across many (up to all) interface nodes. So, it seeks the most bang for the buck in order to keep the file system from getting full. For example: If ten old 100GB files can drop a storage pool by 1TB and that is enough to reach the low watermark for a given storage pool, then it will likely move those ten old files. We typically migrate from the system pool to the nearline pool and from the nearline pool to the “hsm” pool. We can also trigger a manual migrations like when we migrate BucketIntermediate data from either disk pool to the “hsm” pool. In that case, we use SQL like statements to deliberately move files “over 1GB and 90 days old” from the nearline pool to the “hsm” pool. The policy engine is flexible enough to allow me to focus on file types, names, directories, and file sets. File sets are logical file systems within a single file system which allow us to establish placement rules, quotas, snapshots, etc on focused regions instead of the “whole” file system. 60s-4 min to retrieve data from tape Data Storage
12
>356 Patients/712 Complete Genomes
Gene sequencing project identifies potential drug targets in common childhood brain tumor Nature June 20, 2012 Researchers studying the genetic roots of the most common malignant childhood brain tumor have discovered missteps in three of the four subtypes of the cancer that involve genes already targeted for drug development. The most significant gene alterations are linked to subtypes of medulloblastoma that currently have the best and worst prognosis. They were among 41 genes associated for the first time to medulloblastoma by the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project. World's largest release of comprehensive human cancer genome data helps researchers everywhere speed discoveries Nature Genetics May 29, 2012 To speed progress against cancer and other diseases, the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project today announced the largest-ever release of comprehensive human cancer genome data for free access by the global scientific community. The amount of information released more than doubles the volume of high-coverage, whole genome data currently available from all human genome sources combined. This information is valuable not just to cancer researchers, but also to scientists studying almost any disease. Genome sequencing initiative links altered gene to age-related neuroblastoma risk Journal of the American Medical Association March 13, 2012 St. Jude Children’s Research Hospital – Washington University Pediatric Cancer Genome Project and Memorial Sloan-Kettering Cancer Center discover the first gene alteration associated with patient age and neuroblastoma outcome. Researchers have identified the first gene mutation associated with a chronic and often fatal form of neuroblastoma that typically strikes adolescents and young adults. The finding provides the first clue about the genetic basis of the long-recognized but poorly understood link between treatment outcome and age at diagnosis. Cancer sequencing initiative discovers mutations tied to aggressive childhood brain tumors Nature Genetics January 29, 2012 Findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) offer important insight into a poorly understood tumor that kills more than 90 percent of patients within two years. The tumor, diffuse intrinsic pontine glioma (DIPG), is found almost exclusively in children and accounts for 10 to 15 percent of pediatric tumors of the brain and central nervous system. Cancer sequencing project identifies potential approaches to combat aggressive leukemia Nature January 11, 2012 Researchers with the St. Jude Children's Research Hospital - Washington University Pediatric Cancer Genome Project (PCGP) have discovered that a subtype of leukemia characterized by a poor prognosis is fueled by mutations in pathways distinctly different from a seemingly similar leukemia associated with a much better outcome. The work provides the first details of the genetic alterations fueling a subtype of acute lymphoblastic leukemia (ALL) known as early T-cell precursor ALL (ETP-ALL). The results suggest ETP-ALL has more in common with acute myeloid leukemia (AML) than with other subtypes of ALL. Gene identified as a new target for treatment of aggressive childhood eye tumor Nature January 11, 2012 New findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) have helped identify the mechanism that makes the childhood eye tumor retinoblastoma so aggressive. The discovery explains why the tumor develops so rapidly while other cancers can take years or even decades to form. The finding also led investigators to a new treatment target and possible therapy for the rare childhood tumor of the retina, the light-sensing tissue at the back of the eye. Fundamental insights in medulloblastomas, retinoblastomas, ETP-ALL, pontine gliomas, and neuroblastomas. Public release of all available data (prior to SJ analyses – 9 month quarantine on publication) Our release doubled the amount of human whole genome sequence data available globally Progress
13
Data Sharing http://www.pediatriccancergenomeproject.org
23 people; ~5000 man-hours to build Explore Data Sharing
14
Integrating genomics with clinical data is a new frontier
We have a prototype integrated data warehouse in place IDW covering heme malignancies and will be adding other diseases and other analysis data types over time We are currently displaying actionable genotype data in our EMR and will continue to supplement the EMR with additional genetic data Data Integration is critical: platform data (expression, WGS, methylation, etc.) and processed data (“genomics” data with phenotype data (clinical care, clinical research)) Data Sharing
15
Key: Staff Total=>150 FTEs with “research informatics” skills
19 Academic Departments 2 PhD 2 Support Information Sciences PCGP 5 PhD 1 Dev. 8-10 Faculty 50-60 Support Staff 10 PhD Bioinformatics 2 developers Enterprise Informatics Clinical 127 FTEs 81 FTEs Research 56 FTEs Offshore Developers 15 FTEs HPC Shared Resources Computational Biology Having the right staff has been critical to this success: All St. Jude academic departments have embedded informatics professionals bioinformatics data managers CRAs All interact and collaborate with foci of IT expertise Computational Biology IS Shared resources ~150 informatics staff on campus ~200 SJ staff working on this project Total=>150 FTEs with “research informatics” skills Key: Staff
16
$ummary Project total cost: $65M (11 WashU and SJCRH, sequencing costs, staffing, IT, etc.) New “IT” SJCRH: 10 FTEs in CompBiol, 0 FTEs in IS Capital IT investment: ~$7.2 M at SJCRH, $9M at WashU IT is ~25% of overall project costs (doesn’t include costs of other participating SJ FTEs)
17
Information Sciences PCGP Team
Ashish Pagare David Zhao Dan Alford Stephen Espy Kiran Chand Bobba Scott Malone Dr. Antonio Ferreira Bill Pappas James McMurry Dr. Jianmin Wang Dr. John Obenauer Jared Becksfort Pankaj Gupta Dr. Suraj Mukatira Simon Hagstrom Sundeep Shakya Asmita Vaidya Swetha Mandava Bhagavathy Krishna Manohar Gorthi Sandhya Rani Kolli Sivaram Chintalapudi Roshan Shrestha Irina McGuire PJ Stevens Thanh Le John Penrod Pat Eddy Dr. Dan McGoldrick 27 IS staff contribute to the PCGP (no particular order) Key: Staff
18
Questions?
19
Data Workflow PALLAS cluster large memory GPU
Contig assembly SV CNV INDELS SNV CIRCOS Conceptual at this stage; also hardware changes will result in most processing done on the Altix SGI Altix UV1000 (large memory) supports current bioinformatics codes more robustly than the cluster or GPU-only environments. It can be expanded with GPU modules as well. PALLAS Data Workflow
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.