Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.

Slides:



Advertisements
Similar presentations
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
Advertisements

© 2012 IBM Corporation 1 IBM Cognos 10 family Analytics in the hands of everyone Address all your analytic needs Report, Analyze, Model, Plan and Collaborate.
DTI Image Processing Pipeline and Cloud Computing Environment Kyle Chard Computation Institute University of Chicago.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
HP Quality Center Overview.
Collaboration on Large Datasets using Globus Rachana Ananthakrishnan University of Chicago.
XSEDE 13 July 24, Galaxy Team: PSC Team:
Globus.org/genomics Optimizations in running large-scale Genomics workloads in Globus Genomics using HTCondor Ravi Madduri Joint work with.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Rutgers University Libraries What is RUcore? o An institutional repository, to preserve, manage and make accessible the research and publications of the.
Accelerate Business Success With CRM CRM Interoperability.
Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K) Biomedical Big Data Initiative (BD2K)
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago.
Annual SERC Research Review - Student Presentation, October 5-6, Extending Model Based System Engineering to Utilize 3D Virtual Environments Peter.
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Sage Bionetworks Mission Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by.
Globus Genomics – Science as a Service for large scale NGS analysis
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Configuration Management (CM)
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
A framework to support collaborative Velo: Knowledge Management for Collaborative (Science | Biology) Projects A framework to support collaborative 1.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
-- Don Preuss NCBI/NLM/NIH
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
SCAP E SCAPE Project EU project aimed at building a scalable platform for planning and execution of computation intensive processes for ingestion or migration.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
The Astronomy challenge: How can workflow preservation help? Susana Sánchez, Jose Enrique Ruíz, Lourdes Verdes-Montenegro, Julian Garrido, Juan de Dios.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
NEES Cyberinfrastructure Center at the San Diego Supercomputer Center, UCSD George E. Brown, Jr. Network for Earthquake Engineering Simulation NEES TeraGrid.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Transforming video & photo collections into valuable resources John Waugaman President - Tygart Technology, Inc.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Globus online Software-as-a-Service for Research Data Management Steve Tuecke Deputy Director, Computation Institute University of Chicago & Argonne National.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
© Copyright AARNet Pty Ltd PRAGMA Update & some personal observations James Sankar Network Engineer - Middleware.
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
Lars Ailo Bongo NBS meeting Tromsø, Jan 23, 2016 NeLS Norwegian e-Infrastructure for Life Sciences Overview and recent developments
Globus online Delivering a scalable service Steve Tuecke Computation Institute University of Chicago and Argonne National Laboratory.
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Canadian Bioinformatics Workshops
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Globus online Cloud Based Services for Science Steve Tuecke Computation Institute University of Chicago and Argonne National Laboratory.
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
Enhancements to Galaxy for delivering on NIH Commons
Accessing the VI-SEEM infrastructure
Tools and Services Workshop
University of Chicago and ANL
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Joseph JaJa, Mike Smorul, and Sangchul Song
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Overview of Workflows: Why Use Them?
What is UiPATH? For more details visit this link online-training.
Presentation transcript:

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL

Challenges in Sequencing Analysis Proposed Approach Using Globus Genomics Example Collaborations Relevance to XSEDE Q&A Outline

Challenges in Sequencing Analysis Sequencing Centers Data Movement and Access Challenges Manual Data Analysis Public Data Public Data Storage Local Cluster/ Cloud Seq Center Research Lab Data is distributed in different locations Research labs need access to the data for analysis Be able to Share data with other researchers/collaborators Inefficient ways of data movement Data needs to be available on the local and Distributed Compute Resources Local Clusters, Cloud, Grid How do we analyze this Sequence Data Once we have the Sequence Data Picard GATK Fastq Ref Genome Alignment Variant Calling Manually move the data to the Compute node (Re)Run Script Install Modify Install all the tools required for the Analysis BWA, Picard, GATK, Filtering Scripts, etc. Shell scripts to sequentially execute the tools Manually modify the scripts for any change Error Prone, difficult to keep track, messy.. Difficult to maintain and transfer the knowledge FTP, SCP, HTTP SCP FTP, SCP, HTTP

Globus Genomics Sequencing Centers Public Data Public Data Storage Local Cluster/ Cloud Seq Center Research Lab Globus Online Provides a High-performance Fault-tolerant Secure file transfer Service between all data-endpoints Data ManagementData Analysis Picard GATK Fastq Ref Genome Alignment Variant Calling Galaxy Data Libraries Globus Online Integrated within Galaxy Web-based UI Drag-Drop workflow creations Easily modify Workflows with new tools Globus Genomics on Amazon EC2 Analytical tools are automatically run on the scalable compute resources when possible Galaxy Based Workflow Management System FTP, SCP, others FTP, SCP SCP Globus Genomics FTP, SCP, HTTP

Workflows can be easily defined and automated with integrated Galaxy Platform capabilities Data movement is streamlined with integrated Globus file- transfer functionality Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure Globus Genomics

Professionally managed and supported platform Best practice pipelines Enhanced workbench with breadth of analytic tools Technical support and bioinformatics consulting Access to pre-integrated end-points for reliable and high- performance data transfer (e.g. Broad Institute, Perkin Elmer, etc.) Cost-effective solution with subscription-based pricing Additional Capabilities

Globus Genomics – A flexible, scalable, simplified analysis platform Accessibility Unified Web-interface for obtaining genomic data and applying computational tools to analyze the data Easily integrate your own tools and scripts for analysis (CLI based tools) Collection of tools (Tools Panel) that reflect good practices and community insights Access every step of analysis and intermediate results:  View, Download, Visualize, Reuse (History Panel) Reproducibility Track provenance and ensure repeatability of each analysis step:  input datasets, tools used, parameter values, and output datasets Annotate each step or collection of steps to track and reproduce results Intuitive Workflow Editor to create or modify complex workflows and use them as templates – Reusable and Reproducible Transparency Publish and share metadata, histories, and workflows at multiple levels Store public and generated datasets as Data Libraries – e.g: hg19 Ref Genome Shared datasets and workflows can be imported by other users for reuse Publish Templates Data and Tools Globus Online Integration Access GO Endpoints and transfer data from within Galaxy UI and into Galaxy workspace Leverage local cluster or cloud based scalable computational resources for parallelizing the tools

Example Collaborations Dobyns Lab Backround: Investigate the nature and causes of a wide range of human developmental brain disorders Approach: Replaced manual analysis with Globus Genomics Results: Achieved greater than 10X speed-up in analysis of exome data Future Plans: Leverage scale-out capability of Globus Genomics by running increasingly larger data sets

XSEDE’s Mission Statement accelerat[ing] open scientific discovery by enhancing the productivity of researchers, engineers, and scholars and making advanced digital resources easier to use.” Relevance to XSEDE Key XSEDE Goals That Globus Genomics Addresses “Deepen and extend the impact of eScience infrastructure on research and education; in particular, to reach communities that have not previously made use of it; and Expand the environment through the integration of new capabilities and resources such as instruments and data repositories based on the identified needs of the community.”

Globus Genomics leverages an XSEDE service –Globus Transfer for data movement –Globus Nexus for identity management –Globus Groups for group-based access management Integrates advanced digital resources –sequencing centers, a commercial cloud provider, and NGS analysis pipelines Reduces the cost and complexity of scientific discovery for a new community (NGS researchers) who have not historically made much use of advanced eScience infrastructures. Relevance to XSEDE (Cont..)

Globus Genomics achieves these goals without making use of XSEDE supercomputers Choice to use Amazon cloud services rather than XSEDE systems for Globus Genomics computations is deliberate –scales at which our target users operate today, the costs associated with the use of Amazon cloud computers are modest, and Amazon’s on-demand, pay-as-you-go storage and computing capabilities match user needs better than the proposal- and queue-based access policies provided by XSEDE computers. We plan to explore using XSEDE resources to execute Globus Genomics pipelines XSEDE Vs AWS

This work was supported in part by the NIH through the NHLBI grant: The Cardiovascular Research Grid (R24HL085343) and by the U.S. Department of Energy under contract DE-AC02-06CH We are grateful to Amazon, Inc., for an award of Amazon Web Services time that facilitated early experiments. The Globus Genomics and Globus Online teams at University of Chicago and Argonne National Laboratory Acknowledgments

More information on Globus Genomics and to sign up: More information on Globus Online: Questions? Thank you! For more information