ELIXIR: a sustainable infrastructure for biological information in Europe Workshop on the future of Big Data Management The Blackett Laboratory, Imperial.

Slides:



Advertisements
Similar presentations
Delivery of industrial-strength Grid middleware: Establishing an effective European approach Professor Yike Guo Imperial College London, UK & InforSense.
Advertisements

The benefits of using cloud computing for Stem Cell Imaging Nick Trigg CEO, Constellation 24 th June 2009.
Identity management – life sciences perspective Ugis Sarkans European Bioinformatics Institute.
Wrapup. NHGRI strategic plan What does the NIH think genomics should be for the next 10 years? [Nature, Feb. 2011]
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
DEVELOPMENT OF A EUROPEAN NETWORK OF LIBRARIES Hans Geleijnse Director of Library and IT Services & CIO Tilburg University, The Netherlands.
Steven Newhouse, Head of Technical Services Virtualisation and Cloud Computing at EBI.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
EMBL-EBI and Bioinformatics Steven Newhouse, Head of Technical Services, EMBL-EBI.
Research and Innovation Research and Innovation Research and Innovation Research and Innovation Research Infrastructures and Horizon 2020 The EU Framework.
Welcome to EMBL-EBI Dr Laura Emery. Before we start… Stand up How experienced are you in bioinformatics? Get to know each other by arranging yourselves.
Small Molecules EBI Bioinformatics Roadshow Gareth Owen, ChEBI group
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
1. 2 ISMB/ECCB 2004  31 July – 5 August 2004  SIGs 29 & 30 July 2004  Glasgow, Scotland,UK  Scottish Exhibition and Conference Centre.
European Life Sciences Infrastructure for Biological Information ELIXIR: Safeguarding the results of life sciences research in Europe.
1 Large-scale Data Processing Challenges David Wallom.
Steven Newhouse, Head of Technical Services European Bioinformatics Institute: ICT Challenges.
Bioinformatics and it’s methods Prepared by: Petro Rogutskyi
The Preparatory Phase Proposal a first draft to be discussed.
ISBE An infrastructure for European (systems) biology Martijn J. Moné Seqahead meeting “ICT needs and challenges for Big Data in the Life Sciences” Pula,
European Life Sciences Infrastructure for Biological Information ELIXIR
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
Learning and exploring Life science through the EBI reosurces and tools BIOQUEST workshop_2011 Vicky Schneider, EMBL-EBI Training Programme Project leader.
1 Common Challenges Across Scientific Disciplines Laurence Field CERN 18 th November 2013.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
ELIXIR UK - Industry Engagement sector Gabriella Rustici School of Biological Sciences.
Transforming Community Services Commissioning Information for Community Services Stakeholder Workshop 14 October 2009 Coleen Milligan – Project Manager.
ESTELA Summer Workshop, 26 June 2013 The EU-SOLARIS project.
EMBL-EBI EMBL-EBI EMBL-EBI What is the EBI's particular niche? Provides Core Biomolecular Resources in Europe –Nucleotide; genome, protein sequences,
ASCAC-BERAC Joint Panel on Accelerating Progress Toward GTL Goals Some concerns that were expressed by ASCAC members.
Bioinformatics Core Facility Guglielmo Roma January 2011.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
A Data Centre for Science and Industry Roadmap. INNOVATION NETWORKING DATA PROCESSING DATA REPOSITORY.
An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.
B i o i n f o r m a t i c s / B i o m e d i c a l A p p l i c a t i o n s i n E E L A Mexico, D.F., october 22 – 26, e – s c i e n c e M e x i c.
Learning and exploring Life science through the EBI reosurces and tools BIOQUEST workshop_2011 Vicky Schneider, EMBL-EBI Training Programme Project leader.
European Life Sciences Infrastructure for Biological Information ELIXIR and Identity Management 2 nd Workshop on Federated Identity.
EMBL-EBI Data Archives – An Overview. The EMBL-EBI mission Provide freely available data and bioinformatics services to all facets of the scientific community.
1 The European Open Science Cloud: Open Day Event EMBL, Heidelberg, 20 January 2016 Joint Research Centre (JRC) The European Commission’s in-house science.
Describing Bioinformatic Metadata at EBI James Malone
A European Open Science Cloud
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
Cultural Heritage in Tomorrow ’s Knowledge Society Cultural Heritage in Tomorrow ’s Knowledge Society Claude Poliart Project Officer Cultural Heritage.
For EGI/EUDAT EMBL/ELIXIR use-cases Tony Wildish
European Life Sciences Infrastructure for Biological Information Safeguarding the results of life science research in Europe Niklas.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Business Engagement Program for SMEs Javier Jiménez Business Development.
European Life Sciences Infrastructure for Biological Information EGI 2015, Lisbon, 18 May 2015 Rafael C Jimenez, ELIXIR CTO ELIXIR.
Distributed Computing Infrastructures for e-Science: Future Perspectives EGI Technical Forum 2012 The Clarion Congress Hotel, Freyova 945/33, Prague 18.
SciencePAD Incubation Laboratory Alberto Di Meglio – CERN.
Barbara Alessandrini. Istituto Zooprofilattico Sperimentale dell’Abruzzo e del Molise is a public health institution engaged in research. Our mission.
European Life Sciences Infrastructure for Biological Information European Life Sciences Infrastructure for Biological Information.
EBI is an Outstation of the European Molecular Biology Laboratory. Rodrigo Lopez Head of EMBL-EBI/ES Andrew Lyall ELIXIR PM. ELIXIR and the integration.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.
For EGI/EUDAT EMBL/ELIXIR use-cases Tony Wildish
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
ICT22 – 2016: Technologies for Learning and Skills ICT24 – 2016: Gaming and gamification Francesca Borrelli DG CONNECT, European Commission BRUXELLES.
ELIXIR Core Data Resources and Deposition Databases
EMBL’s European Bioinformatics Institute
ELIXIR: Potential areas for collaboration with e-Infrastructures
ELIXIR: Authentication and Authorization Infrastructure Requirements
ELIXIR Safeguarding the results of life science research in Europe
3rd Annual Forum for SMEs: Meeting Overview
Florian Gräf Software Developer of the McEntyre group at EMBL-EBI
EGI Webinar - Introduction -
Presentation transcript:

ELIXIR: a sustainable infrastructure for biological information in Europe Workshop on the future of Big Data Management The Blackett Laboratory, Imperial College, London June 27 & 28, 2013 Andrew Lyall PhD, ELIXIR Project Manager

European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory International organisation created by treaty (cf CERN, ESA) 20 year history of service provision and scientific excellence EMBL-EBI has 500+ Staff, €50 Million Budget, at least a million users, 20 petabytes of data, 10,000 cpus Data are doubling in less than a year Bandwidth between disk and memory is at least as big an issue as obtaining sufficient CPU-cycles Disk space t 2

EMBL-EBI Mission Statement 3 To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress To contribute to the advancement of biology through basic investigator-driven research in bioinformatics To provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators To help disseminate cutting-edge technologies to industry

Human Genome project 1989 – 2000 – sequencing the human genome -Just 1 “individual” – actually a mosaic of about 24 individuals but as if it was one -Old school technologies -A bit epic Now -Same data volume generated in ~3mins in a current large scale centre -It’s all about the analysis 4

Cost of data generation decreasing exponentially 5

Rate of data generation is increasing 6

EBI’s technical infrastructure 20 PB of “raw” disk -Big archives on two systems, no tape backup (analysis is recovery would be very hard; disaster recovery by institutional replication in US) ~20,000 cores in 2 major farms A Vmware Cloud (“Embassy Cloud”) allowing remote users to directly mount large datasets (in pilot mode) 4 machine rooms; 2 in London, 2 in Cambridge Janet uplink at 10 Gb/sec 7

Machine room architecture Internet 8

Biology already has a data infrastructure For the human genome -(…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl For the function of genes and proteins -For all genes, in text and computational – UniProt and GO For all 3D structures -To understand how proteins work – PDBe For where things are expressed -The differences and functionality of cells - Atlas 9

And, it is growing fast … We have to scale across all of (interesting) life -There are a lot of species out there! We have to handle new areas, in particular medicine -A set of European haplotypes for good imputation -A set of actionable variants in germline and cancers We have to improve our chemical understanding -Of biological chemicals -Of chemicals which interfere with Biology 10

Core data collections at EMBL-EBI 11 Genomes & Genes 1.Ensembl: Joint project with Sanger Institute - high-quality annotation of vertebrate genomes 2.Ensembl Genomes: Environment for genome data from other taxons Genomes: Catalogue of human variation from major World populations 4.EGA*: European Genotype Archive* – genotype, phenotype and sequences from individual subjects and controls 5.ENA: European Nucleotide Archive – all DNA & RNA, nextgen reads and traces Transcription 6.ArrayExpress: Archive of transcriptomics and other functional genomics data 7.Expression Atlas: Differentially expressed genes in tissues, cells, disease states & treatments Protein 8.UniProt: Archive of protein sequences and functional annotation 9.InterPro: Integrated resource for protein families, motifs and domains 10.PRIDE: Public data repository for proteomics data 11.PDBe: Protein and other macromolecular structure and function Small molecules 12.ChEBI: Chemical entities of biological interest 13.ChEMBL: Bioactive compounds, drugs and drug-like molecules, properties and activities Processes 14.IntAct: Public repository for molecular interaction data 15.Reactome: Biochemical pathways and reactions in human biology 16.Biomodels: Mathematical models of cellular processes Ontologies 17.GO: Gene Ontology, consistent descriptions of gene products Scientific literature 18.CiteXplor: Bibliographic query system

Universal, comprehensive, integrated 12 Life sciences Medicine Agriculture Pharmaceuticals Biotechnology Environment Bio-fuels Cosmaceuticals Neutraceuticals Consumer products Personal genomes Etc…

Biology is a big data science Not perhaps as big as high energy physics, but “just” one order of magnitude less (~35 PB plays ~300PB) Hetreogenity/diversity of data far larger -Lots and lots of details -Always have “dirty” data even in best case scenarios Big Data => scaleable algorithms -CS ‘stringology’ – eg, Burrows-Wheeler transform Biological inference => statistics -“just” need to collapse a 3e9 dimensional problem into something more tractable We are often I/O not CPU bound 13

Big Data Example: Personalised medicine Personalised medicine will require sequencing of the genomes of large numbers of patients and volunteers It will be necessary to compare at least some of these genomes with the reference data collections Most hospitals and clinical research institutes will not wish to maintain up-to-date copies of the reference data collections It will be therefore be necessary to send these genomes to the institutes that hold the reference data collections It seems likely that this will be achieved using secure VMs and secure clouds holding the reference data collections EMBL-EBI is engaging with stakeholders to evaluate opportunities in this area. 14

Collaborator “Embassy” Clouds Pharmaceutical companies put significant effort into creating secure “EBI-like” services on their own infrastructure Many other users with high computational requirement do not wish to recreate our infrastructure on their own site A secure cloud environment providing “Cloud-Embassies” at EMBL- EBI would obviate this Embassy owners would have complete control over their virtual infrastructure Embassy owners could bring their own data and software to compute against EMBL-EBIs data and services Such services would be managed with legally acceptable collaboration agreements. 15

The Helix-Nebula Science-Cloud Three members of EIROforum (CERN, EMBL & ESA) Thirteen European IT providers (more are joining) A pan-European partnership of academia and industry to create cloud solutions and foster innovation in science Stimulate the creation of a cloud computing market in Europe (cf USA) Two year pilot phase after which it will be made more widely available to commercial and public domain EMBL will use it for the analysis of large genomes 16

Many different modes of data generation & utilisation Internet Thousands of data producers Millions of data users A few international collaborators A few very large data producers gigabytes megabytes petabytes terabytes A significant number of large data users terabytes 17

GenomicsHigh Energy Physics Astronomy Data distribution patterns are discipline dependent 18

Fully Centralised Fully Distributed How to coordinate biological data provision? Pros: Easier to benefit from re-use Easier to standardise Cons: Impossible to centralise all necessary expertise Risk of bottlenecks Lack of diversity Pros: More responsive Better support for Human languages Cons: Communication & coordination overheads Harder to support end users Harder to provide multi-decade stability 19

Robustly connected nodes coordinated by a strong hub We need the best of both models… Diversity of Nodes: Domain Specific Regional National Training Etc… Hub Node 20

ELIXIR: A sustainable infrastructure for biological information in Europe… medicine bioindustries society ESFRI BMS RI Preparatory Phase coordinated by EMBL-EBI Currently entering Construction Phase Interim board is meeting regularly New data centres commissioned in London Technical hub building construction is under way Fifteen countries have joined already Many more have expressed interest Over €100 Million already invested More than 60 proposals for nodes environment 21

ELIXIR: Distributed nodes with a hub 22

23

Conclusions Data management is becoming a significance challenge in biology: size, complexity, ELSI… High-throughput data-generators and users will be situated all round Europe The environment will be very heterogeneous with complex data and many different modalities of use Organising I/O from disk to memory is as big a challenge as obtaining sufficient CPU-cycles 24

Biology: The big challenges of big data Vivien Marx Nature, Volume: 498, Pages: 255–260, Published:13 June 2013 Biologists are joining the big-data club. With the advent of high-throughput genomics, life scientists are starting to grapple with massive data sets, encountering challenges with handling, processing and moving information that were once the domain of astronomers and high-energy physicists. (& thank you for your attention) 25