USF Health Informatics Institute (HII)

Slides:



Advertisements
Similar presentations
Remote Visualisation System (RVS) By: Anil Chandra.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
XSEDE 13 July 24, Galaxy Team: PSC Team:
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
DANSE Central Services Michael Aivazis Caltech NSF Review May 23, 2008.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
BIOCMS: Resource Integration and Web Application Framework for Bioinformatics DHUNDY R BASTOLA †, *, ANIL KHADKA †, MOHAMMAD SHAFIULLAH † AND HESHAM ALI.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
The University of Texas Research Data Repository : “Corral” A Geographically Replicated Repository for Research Data Chris Jordan.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Development Principles PHIN advances the use of standard vocabularies by working with Standards Development Organizations to ensure that public health.
Lustre at Dell Overview Jeffrey B. Layton, Ph.D. Dell HPC Solutions |
第三組 Produce a report on 1.SAP NetWeaver 2.SAP Web Application Server 3. SAP Solution Manager ~ Team member ~ 何承恩 謝岳霖 徐翊翔 陳鼎昇.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Customized cloud platform for computing on your terms !
Bioinformatics Core Facility Ernesto Lowy February 2012.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
DISTRIBUTED COMPUTING
WITSML Service Platform - Enterprise Drilling Information
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
DANSE Central Services Michael Aivazis Caltech NSF Review May 31, 2007.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
NML Bioinformatics Service— Licensed Bioinformatics Tools High-throughput Data Analysis Literature Study Data Mining Functional Genomics Analysis Vector.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.
Cole David Ronnie Julio. Introduction Globus is A community of users and developers who collaborate on the use and development of open source software,
1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
In Vivo Imaging Middleware and Applications RSNA 2007 Berkant Barla Cambazoglu The Ohio State University Department of Biomedical Informatics.
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
Role Activity Sub-role Functional Components Control Data Software.
Brian Corrie Technical Lead, iReceptor Technical Director, IRMACS Centre Simon Fraser University Services for Distributed Data, Security and Computation.
© Copyright IBM Corporation 2016 Diagram Template IBM Cloud Architecture Center Using the Diagram Template This template is for use in creating a visual.
The Evolution of the Italian HPC Infrastructure Carlo Cavazzoni CINECA – Supercomputing Application & Innovation 31 Marzo 2015.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
NA-MIC National Alliance for Medical Image Computing Core 1b – Engineering Data Management Daniel Marcus Washington University.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
What if your app could put the power of analytics everywhere decisions are made? Modern apps with data visualizations built-in have the power to inform.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.
ETRIKS Platform for bioinformatics ISGC 17/03/15 Pengfei Liu, CC-IN2P3/CNRS.
Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.
Enhancements to Galaxy for delivering on NIH Commons
Compute and Storage For the Farm at Jlab
Accessing the VI-SEEM infrastructure
Tools and Services Workshop
StoRM: a SRM solution for disk based storage systems
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
DI4R, 30th September 2016, Krakow
Daniel Murphy-Olson Ryan Aydelott1
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Research Data Archive - technology
HII Technical Infrastructure
Big Data - in Performance Engineering
CDISC SHARE API v1.0 CAC Update 22 February 2018
Providing an Research Environment for life sciences
* Introduction to Cloud computing * Introduction to OpenStack * OpenStack Design & Architecture * Demonstration of OpenStack Cloud.
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Introduction to High Performance Computing Using Sapelo2 at GACRC
HCA Data Access Oct 3rd 2019.
Presentation transcript:

USF Health Informatics Institute (HII) Data Infrastructure USF Health Informatics Institute (HII)

Data Management TEDDY ‘omics data Quantity of data (~550 TB) Diversity of data sources (10 labs, >28 analytes) Number of analytical partners (9 EAP groups, 53 HPC users, 80 data sharing platform users) Number of data releases (>70 releases) 2

Data Management Total raw data storage as of APR 2018: 2.0 PB Total expected Case-Control data: ~550 TB Dietary Biomarkers  - 4.1 MB Exome – 100 GB Gene Expression - 12 to 14 TB Metabolomics - 16 to 24 TB Microbiome & Metagenomics –  86 TB Proteomics – 2 to 3 TB RNA Sequencing – 150 TB SNPs – 60 GB Whole genome sequencing – 250 TB 3

Data Infrastructure hdExchange Objective: Components: Comprehensively store, manage, and share HII Big Data assets in support of ‘omics analysis Allow analytical partners to bring their analyses to the data Components: Data Infrastructure Clinical Data Warehouse Big Data Repository Controlled Vocabulary Repository Laboratory Transfer CLI Data Exchange API Analytical Infrastructure High Performance Computing (HPC) Cluster Analysis Software Library hdExchange health informatics institute hiiData Product Data Exchange Platform 4

Data Infrastructure 5

Data Infrastructure: hdExchange API Primary mechanism for programmatically accessing TEDDY clinical and ‘omics data from HPC environment Hides the complexities of backend data management providing single point of contact for straightforward access to data assets https://exchange.hiidata.org/documentation.htm 6

Data Infrastructure: hdExchange API Specifications RestFUL Web API W3C Standards for Rest Architecture Token-based API authentication Synchronous delivery of tabular data Clinical metadata and data dictionaries Asynchronous processing of data file requests Background process and message queue for scalability and big data 7

Data Infrastructure: Lab Transfer CLI hdExchange Laboratory Transfer CLI Command line interface (CLI) utility for conducting transfer of ‘omics files from the laboratory to the data repository Manages credentials and transfer mechanism interaction Supports implementation of custom rules and validation logic for file details and structure Automates checksum generation/verification, indexing, cataloging, and notification Includes end-to-end logging of transfer process activity https://exchange.hiidata.org/download/hdxlabcli.0.0.4.tar.gz 8

Data Infrastructure: Lab Transfer CLI Specifications: Developed in Python Wraps hdExchange Web API requests Leverages hdExchange standard security interface Utilizes several transfer mechanisms and is designed to be extensible to others Includes sFTP and Expedat 9

Data Infrastructure: hdExplore Data Sharing Platform Web application consisting of an interactive user interface for accessing TEDDY clinical metadata and associated documentation along with a suite of data manipulation and visualization tools https://explore.teddystudy.org/project/TEDDY/begin.view? 10

Data Infrastructure: hdExplore 11

Analytical Infrastructure: HPC Cluster High Performance Computing (HPC) Cluster Advanced computing resource comprising hundreds of compute nodes, thousands of cores, and over a dozen TB of memory Allows remote high-throughout, parallelized processing required for the complex and custom analytical software pipelines used for the analysis of ‘Omics Big Data 12

Analytical Infrastructure: HPC Cluster Hardware The HPC platform consists of two clusters: HII (hii): 90+ nodes with ~ 1600 Cores / 8 TB Memory RC (circe): 400+ nodes with ~ 5000 Cores / 12 TB Memory Nodes in the clusters are upgraded and expanded on a continual basis. Specs of latest 40 nodes provisioned: Processor: 20-core E5-2650 v3 @ 2.30GHz (Haswell Microarchitecture) Memory: 128 GB @ 2133 Mhz MPI/Storage Interconnect: QDR Infiniband @ 32 Gb/s All nodes have access to the following storage arrays: 1.7 PB DDN GPFS (Home Directories, Group Shares, and Scratch) 300 TB Oracle ZFS (Genetic Sequence Files and Curated Results) https://usf-hii.github.io/pages/hii-hpc.html 13

14