Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,

Slides:



Advertisements
Similar presentations
Summary of Cloud Computing (CC) from the paper Abovce the Clouds: A Berkeley View of Cloud Computing (Feb. 2009)
Advertisements

QCloud Queensland Cloud Data Storage and Services 27Mar2012 QCloud1.
Statewide IT Conference30-September-2011 HPC Cloud Penguin on David Hancock –
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
XSEDE 13 July 24, Galaxy Team: PSC Team:
Dawei Lin, Ph.D. Director, Bioinformatics Core UC Davis Genome Center July 20, 2008, SLIMS (Solexa sequencing.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
Bioinformatics Core Facility Ernesto Lowy February 2012.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
Empowering Bioinformatics Workflows Using the Lustre Wide Area File System across a 100 Gigabit Network Stephen Simms Manager, High Performance File Systems.
LARGE SCALE DEPLOYMENT OF DAP AND DTS Rob Kooper Jay Alemeda Volodymyr Kindratenko.
1 Developing a Data Management Plan C&IT Resources for Data Storage and Data Security Patrick Gossman Deputy CIO for Research January 16, 2014.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
The CRI compute cluster CRUK Cambridge Research Institute.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Pti.iu.edu/sc14 The National Center for Genome Analysis Support Supercomputing 2014 November 17-21, 2014.
Providing National Cyberinfrastructure to Biologists, esp. Genomicists. William K. Barnett, Ph.D. (Director) Thomas G. Doak (Manager & Domain Biologist)
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
Power and Cooling at Texas Advanced Computing Center Tommy Minyard, Ph.D. Director of Advanced Computing Systems 42 nd HPC User Forum September 8, 2011.
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
1 The Cloud and Desktop as a Service as a teaching tool for different research communities David Wallom Oxford e-Research Centre.
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
Tackling I/O Issues 1 David Race 16 March 2010.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
PRESENTED BY– IRAM KHAN ISHITA TRIPATHI GAURAV AGRAWAL GAURAV SINGH HIMANSHU AWASTHI JAISWAR VIJAY KUMAR JITENDRA KUMAR VERMA JITENDRA SINGH KAMAL KUMAR.
SIU Information Technology April 28, 2015 Research Computing and Cyberinfrastructure.
Red Hat Enterprise Linux Presenter name Title, Red Hat Date.
Architecture of a platform for innovation and research Erik Deumens – University of Florida SC15 – Austin – Nov 17, 2015.
Canadian Bioinformatics Workshops
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
February 3, 2009 Bridging Academic and Medical Cultures Academic Research Systems and HIPAA William K. Barnett Anurag Shankar.
Accessing the VI-SEEM infrastructure
A Brief Introduction to NERSC Resources and Allocations
What is HPC? High Performance Computing (HPC)
CyVerse Tools and Services
Tools and Services Workshop
University of Chicago and ANL
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Cloud computing-The Future Technologies
Virtual laboratories in cloud infrastructure of educational institutions Evgeniy Pluzhnik, Evgeniy Nikulchev, Moscow Technological Institute
National Center for Genome Analysis Support
Recap: introduction to e-science
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Future Data Architectures Big Data Workshop – April 2018
Business Process Management Software
Richard LeDuc, Ph.D. (Manager)
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Trip report: Visit to UPPNEX
Presentation transcript:

Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc, Ph.D. (Manager) National Center for Genome Analysis Support

GCC. July 27, 2012National Center for Genome Analysis Support: Summary NCGAS and its mission NCGAS cyberinfrastructure The 100 Gigabit demonstration Scaling genomics analysis Trinity optimization

GCC. July 27, 2012National Center for Genome Analysis Support: Changing genomics analytical needs Next Gen sequencers are generating more data and getting cheaper Sequencing is:  Becoming commoditized at large centers and  Multiplying at individual labs Analytical capacity has not kept up  Bioinformatics support  Computational support (thousand points solution)  Storage support

GCC. July 27, 2012National Center for Genome Analysis Support: NCGAS widens the analytical bottleneck Funded by National Science Foundation Large memory clusters for assembly Bioinformatics consulting for biologists Optimized software for better efficiency Open for business at:

GCC. July 27, 2012National Center for Genome Analysis Support: Making it easier for Biologists Galaxy interface provides a “user friendly” window to NCGAS resources Supports many bioinformatics tools Available for both research and instruction. Common Rare Computational Skills LOW HIGH

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: NCGAS Cyberinfrastructure at IU Mason large memory cluster (512 GB/node) Quarry cluster (16 GB/node) Data Capacitor (1 PB at 20 Gbps throughput) Research File System (RFS) for data storage Research Database Cluster for managing data sets. All interconnected with a high speed internal network (40 Gbps)

GCC. July 27, 2012National Center for Genome Analysis Support: GALAXY.IU.EDU Model Virtual box hosting Galaxy.IU.edu The host for each tool is configured to meet IU needs Quarry Mason Data Capacitor RFS UITS/NCGAS establishes tools, hardens them, and moves them into production. A custom Galaxy tool can be made to import data from the RFS to the DC. Individual labs can get duplicate boxes – provided they support it themselves. Policies on the DC guarantee that untouched data is removed with time.

GCC. July 27, 2012National Center for Genome Analysis Support: NCGAS Sandbox Demo at SC 11 STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms

10 Gbps 100 Gbps NCGAS Mason (Free for NSF users) IU POD (12 cents per core hour) Amazon EC2 (20 cents per core hour) Data Capacitor NO data storage Charges Amazon Cloud Storage $80 – 120 per TB per month Lustre WAN File System Your Friendly Neighborhood Sequencing Center Your Friendly Neighborhood Sequencing Center Your Friendly Neighborhood Sequencing Center Two Options for Computation and Storage

GCC. July 27, 2012Customize footer: View menu/Header and Footer Commodity Internet (1Gbps but highly variable) Internet2 (100Gbps) Gbps NLR to Sequencing Centers (10Gbps/link) IU Data Capacitor (20 Gbps throughput) Ultra SCSI 160 Disk (1.2 Gbps, 160 MBps) DDR3 SDRAM (51.2 Gbps, 6.4GBps, ) This Architecture Scales!

GCC. July 27, 2012National Center for Genome Analysis Support: How would this work at scale? 1.Biologists use Galaxy to execute workflows 2.Sequence data mounted via Lustre WAN or automatically transferred using Internet2 3.Data Capacitor flows data into Mason or other computational clusters 4.Data Capacitor mounts or mirrors reference data from NCBI or other sources 5.Results delivered through web interfaces and to visualization or other science tools

Performance Improvements Richard LeDuc GCC /27/2012 Inchworm GraphFromFastA ReadsToTranscriptsQuantifyGraph Butterfly

Final Results Richard LeDuc GCC /27/2012

Trinity Results Significantly reduced runtime, while maintaining correctness of results Results are published Source code is commit to official SourceForge repository Continued support for HPC optimization for Trinity Brian Haas at Broad is developing Trinity workflows for Galaxy Richard LeDuc GCC /27/2012

Bio-IT World. April 25, 2012National Center for Genome Analysis Support: In Sum… NG Sequencing is creating a analytical problem that cannot be solved at sequencing centers NCGAS can provide a global scale infrastructure to better serve the needs of biologists who cannot become bioinformaticians to accomplish their research. Trinity is no longer a resource hog

GCC. July 27, 2012National Center for Genome Analysis Support: Thank You Questions? Bill Barnett Rich LeDuc