Peter Clapham Informatics Support Group. About the Institute ● Funded by Wellcome Trust. ● 2 nd largest research charity in the world. ● ~700 employees.

Slides:



Advertisements
Similar presentations
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Peter Berrisford RAL – Data Management Group SRB Services.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
An Overview of OGSA-DAI Kostas Tourlas
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
The North American Carbon Program Google Earth Collection Peter C. Griffith, NACP Coordinator; Lisa E. Wilcox; Amy L. Morrell, NACP Web Group Organization:
PNFS, 61 th IETF, DC1 pNFS: Requirements 61 th IETF – DC November 10, 2004.
An iRODS-based Distributed Data Management System for CyberSKA Cameron Kiddle, Arne Grimstrup, Russ Taylor – University of Calgary Venkat Mahadevan, Erik.
Software Frame Simulator (SFS) Technion CS Computer Communications Lab (236340) in cooperation with ECI telecom Uri Ferri & Ynon Cohen January 2007.
A Very Brief Introduction to iRODS
SMART/FHIR Genomic Resources An overview... For latest see
Plannes security for items, variables and applications NEPS User Rights Management.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Technical Architectures
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.
Alcatel Customized Training Site David Otero University of San Diego MSIT 526 Dr. Carl Rebman.
RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.
Submitting Book Chapters via Manuscript Central A Short Guide for Wiley-VCH Authors.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Opensource for Cloud Deployments – Risk – Reward – Reality
Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring, Managing, and Troubleshooting Resource Access.
MEDIN Data Guidelines. Data Guidelines Documents with tables and Excel versions of tables which are organised on a thematic basis which consider the actual.
Kuali Rice at Indiana University Rice Setup Options July 29-30, 2008 Eric Westfall.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Controlled Vocabularies (Term Lists). Controlled Vocabs Literally - A list of terms to choose from Aim is to promote the use of common vocabularies so.
STEALTH Content Store for SharePoint using Caringo CAStor  Boosting your SharePoint to the MAX! "Optimizing your Business behind the scenes"
IRODS Executive Overview Summer 2014 Edition June 18, 2014 iRODS Executive Overview (Summer 2014)1.
Web Designing By Bhupendra Ratha, Lecturer School of Library & Information Science D.A.V.V., Indore.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Penn Groups PennGroups Central Authorization System June 2009.
F. Toussaint (WDCC, Hamburg) / / 1 CERA : Data Structure and User Interface Frank Toussaint Michael Lautenschlager World Data Center for Climate.
Client – Server Architecture. Client Server Architecture A network architecture in which each computer or process on the network is either a client or.
XML Registries Source: Java TM API for XML Registries Specification.
CAS Lightning Talk Jasig-Sakai 2012 Tuesday June 12th 2012 Atlanta, GA Andrew Petro - Unicon, Inc.
The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.
Overview of the SAS® Management Console
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
NT SECURITY Introduction Security features of an operating system revolve around the principles of “Availability,” “Integrity,” and Confidentiality. For.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Wellcome Trust Sanger Institute Informatics Systems Group Ensembl Compute Grid issues James Cuff Informatics Systems Group Wellcome Trust Sanger Institute.
Data Integration and Management A PDB Perspective.
Case Study.  Client needed to build a mobile viewer where a employee can review various files to which they have access from the server  The review.
API Crash Course CWU Startup Club. OUTLINE What is an API? Why are API’s useful? What is HTTP? JSON? XML? What is a RESTful API? How do we consume an.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Distributed Time Series Database
D. Duellmann - IT/DB LCG - POOL Project1 The LCG Pool Project and ROOT I/O Dirk Duellmann What is Pool? Component Breakdown Status and Plans.
Object storage and object interoperability
1 AHM, 2–4 Sept 2003 e-Science Centre GRID Authorization Framework for CCLRC Data Portal Ananta Manandhar.
Enterprise Network Systems TCP Mark Clements. 3 March 2008ENS 2 Last Week – Client/ Server Cost effective way of providing more computing power High specs.
Client – Server Architecture A Basic Introduction 1.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
An Introduction to GPFS
Introduction  Model contains different kinds of elements (such as hosts, databases, web servers, applications, etc)  Relations between these elements.
ODBC, OCCI and JDBC overview
JDBC Database Management Database connectivity
Web Technologies IT230 Dr Mohamed Habib.
The Client/Server Database Environment
Research Data Archive - technology
Learn How Performance of Java Application Can be Improved?
USF Health Informatics Institute (HII)
RSS 2000 v3 Product Presentation
JSTOR as a Shibboleth Target
A Network Operating System Edited By Maysoon AlDuwais
Presentation transcript:

Peter Clapham Informatics Support Group

About the Institute ● Funded by Wellcome Trust. ● 2 nd largest research charity in the world. ● ~700 employees. ● Large scale genomic research. ● Sequenced 1/3 of the human genome (largest single contributor). ● We have active cancer, malaria, pathogen and genomic variation studies. ● All data is made publicly available. ● Websites, ftp, direct database. access, programmatic APIs.

The Sanger Institute: a little background Founded 1992 as a UK sequencing centre with an initial 5 year plan to sequence 2 yeast, the nematode worm and 1/6 th of the human genome (First draft of human genome. Sanger upped contribution to 1/3) 1997 (yeast genome completed) 2003 (first mouse genome draft Malarial parasite sequence completed) 2010 (Completion of 1000 genomes Start or uk10k study) 2005 (WTGCCC established) 2008 (start of 1000 genome project)

Sequence till 2011

5 Research Programmes

Beginnings Sanger started with a single zone to accept bam and bai files produced from the central sequencing pipeline. This is THE starting point for all our usergroups who make use of locally produced sequence data, so the service needs to be: Solid at it's core. 2 am support calls are bad(tm) Vendor agnostic. Sensibly maintainable. Scalable, in terms of capacity and remain relatively performant. Extensible

iRODS layout Data lands by preference onto iRES servers in the green datacenter room Data is then replicated to Red room datacenter via a resource group rule with checksums added along the way Both iRES servers are used for r/o access and replication does work either way if bad stuff happens. Various data and metadata integrity Checks are made. Simple, scalable and reliable (so far) Oracle RAC Cluster IRODS server IRES servers SAN attached luns from various vendors

Metadata Rich Example attribute fields → Users query and access data largely from local compute clusters Users access iRODS locally via the cli attribute: library attribute: total_reads attribute: type attribute: lane attribute: is_paired_read attribute: study_accession_number attribute: library_id attribute: sample_accession_number attribute: sample_public_name attribute: manual_qc attribute: tag attribute: sample_common_name attribute: md5 attribute: tag_index attribute: study_title attribute: study_id attribute: reference attribute: sample attribute: target attribute: sample_id attribute: id_run attribute: study attribute: alignment

Sysadmin Perspective Keep It Simple works. Reflected by very limited downtime aside from upgrades The core has remained nicely solid Upgrades can be twitchy (2.4 → over the past few year has not been without surprises...) Some queries need some optimisation. Fortunately we have some very helpful DBA's

End User Perspective Users are particularly happy with the meta data rich environment. Now they can find their files and gain access in a reliable fashion. So far so good. Satisfied users. ● So happy they've requested iRODS areas for their specific use purposes

Federating Zones Top level zone (sanger) acts as a Kerberos enabled portal Users login here and receive a consistent view of the world. Allows separation of impact between user groups Zone server load Different access control requirements. Clear separation as groups consider implementing their own rules within their zone Each zone has it's own group oversight which is responsible for managing it's disk utilisation. Separation reduces horse trading and makes the process much less involved...

Sanger Zone Arrangement /seq /uk10k/humgen /Archive Sanger 1 Portal zone (provides Kerberised access) Federation using head zone accounts

Pipeline Team Perspective In general stuff is fine BUT some particular pain points have been found. The good news is that some have been addressed, such as improving client icommand exit codes (svn 3.3 tree) and the ability to now create groups and populate them as an igroupadmin. Other pain points, data entry into iRODS is not Atomic. No re-use of connections Local use of Json formatting, not natively supported by iRODS clients

But iRODS is Extensible Java API Python API C API

Baton Thin layer over parts of the iRODS C API ● JSON support ● Connection friendly ● Comprehensive logging ● autoconf build on Linux and OSX Current state ● Metadata listing ● Metadata queries ● Metadata addition