Presentation is loading. Please wait.

Presentation is loading. Please wait.

APPLICATIONS OF GRIDS TO BIOMEDICINE Ignacio Blanquer Vicente Hernández.

Similar presentations


Presentation on theme: "APPLICATIONS OF GRIDS TO BIOMEDICINE Ignacio Blanquer Vicente Hernández."— Presentation transcript:

1 APPLICATIONS OF GRIDS TO BIOMEDICINE Ignacio Blanquer Vicente Hernández

2 OBJECTIVES Describe some challenges faced in the adaptation of biomedical applications to Grids and the proposed solutions. Discuss on open research lines and needs from the area of Healthgrids. Share requirements that are common to many other research areas. Introduction and motivation

3 CONTENIDOS Introduction and motivation. Two significant areas – Medical Imaging. – Bioinformatics. Conclusions. 3

4 THE GRID AND HIGH PERFORMANCE COMPUTING RESEARCH GROUP A Research Group on the area of ICT and Computational Sciences, created in 1986 by Vicente Hernández and currently comprising 29 members (http://www.grycap.upv.es).http://www.grycap.upv.es The group started its activity in the area of basic research in Numeric Computing. This research was evolving to the area of Parallel and Distributed Computing. Further, the group focus on Grid Technologies. Currently, the main target of the research activity of the group is e- Science. Introduction and motivation e-Science Grid Technologies Parallel Computing Distributed Computing Numeric Computing Middleware e-Infrastructur. Fotonics Medical Imaging e-Gov. Bioinf. Enginee ring Proteo mics Epidem.

5 THE GRID AND HIGH PERFORMANCE COMPUTING RESEARCH GROUP The GRyCAP has participated in 19 European Projects, acting as a coordinator in 10 of them (7 of them in the biomedical sector). The GRyCAP is integrated in the Institute for Applications of Advanced Information and Communication Technologies (ITACA) and in the Biomedical Research Network Centre (CRIB). Vicente Hernández is currently theScientific Coordinator of the Spanish Network for e-Science. Introduction and motivation

6 WHAT IS E-SCIENCE?

7 E-SCIENCE AS “ENHANCED” SCIENCE "e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.“ "e-Science will change the dynamic of the way science is undertaken.“ Introduction and motivation John Taylor Director General of Research Councils Office of Science and Technology

8 E-SCIENCE AS A NEW VIEW OF SCIENCE Large challenges with large social impact The famous “Data deluge” Virtual laboratories as alternative to experimental ones. Leading role of simulation Interdisciplinarity. Virtual Research Communities. 8 Introduction and motivation networking grids instrumentation computing data curation… Technology push value added of distributed collaborative research (virtual communities) Application pull Source: Mario Campolargo, Former Head of Unit on e-Infrastructures, European Commission

9 … AND WHAT IS BIOMEDICINE?

10 THE BIOMEDICAL RESEARCH COMMUNITY Biomedicine integrates many different disciplines related with health, life sciences and biochemistry. It comprises the storage, management and processing of data related with the physiology and structure of living beings. So the Biomed Community is wide, heterogeneous and has many different challenges. 10

11 TECHNOLOGICAL NEEDS IN BIOMEDICINE Introduction and motivation Bioinformatics Biomedical Simulation HT / HPComputing HT / HPComputing Data storage and Integration Data storage and Integration CollaborativeWork CollaborativeWork Medical Imaging Epidemiology ServiceIntegration ServiceIntegration Authorizationand Security Authorizationand Security Grids & e-Science 200915-19/6/2009Santander

12 CHALLENGES IN BIOMEDICINE Increasing computing requirements Genetics, Genomics, Proteomics, Metagenomics, etc. Medical Imaging postprocessing. Biomedical simulation (VPH). Data storage and Integration Integrate distributed storage resources. Efficient storage for read-only bulk data. Authorization and Security Virtual Organizations. Encrypted Storage. Collaborative Work Natural way to develop distributed applications on common and distributed data. Introduction and motivation

13 HEALTHGRID ASSOCIATION International non-profit organisation http://wwww.healthgrid.org http://wwww.healthgrid.org – With the objective of contributing to structuring the European Research Area on Health. – To create partnerships to benefit high education and research centres in the field of Grid and distributed computing for health, in a broad sense and in an International context. Coordinators of the White Paper of Health Grids. http://www.healthgrid.org/download.php. http://www.healthgrid.org/download.php Organisers of a yearly conference – Last edition (7th) was in Paris in 2010. 13

14 SHARE PROJECT SHARE: “Supporting and structuring Healthgrid Activities & Research in Europe: Developing a Roadmap” An European Project (www.eu-share.org) aiming at developing a Roadmap forwww.eu-share.org Objectives, Ciurrent Situation, Deficiencies, Barriers, Opportunities, Milestones and stakeholders fro the adoption of Grids to Health. The Roadmap has covered the study of Technological, Legal. Ethical and Economic Issues. Available at www.healthgrid.org Introduction and motivation 14

15 DO WE HAVE TOOLS TO TACKLE THESE CHALLENGES?

16 E-INFRASTRUCTURES E-Infrastructures are supporting e-Science. Build-up on top of the scientific networks with the aim of sharing resources to enable sharing data, applications and scientific results. It is not only a matter of technology, and require the definition of policies and agreements. Introduction and motivation Linking at the speed of the light Sharing computers, instruments and applications Sharing and federating scientific data.......

17 EGEE Bioinformatics Activity, Dr. C. Blanchet (Resp.) 17 240 sites 45 countries 80,000 CPUs 5 PetaBytes >5000 users >100 VOs >100,000 jobs./day Arqueology Astronomy Astrophysics Civil Protection Computational Chemistry Earth Sciences Finances Fusion Geophysics High Energy Physics Life Sciences Biomedicine Multimedia Material Sciences … 32% Source: R. Jones, EGEE Dir.

18 THE BIOMED VO IN EGEE 253 Registered users. Authorised for and important share of the resources available (around 25%). 8 Million jobs and 25 million CPU hours since the starting of the accounting. Introduction and Motivation 18

19 E-INFRASTRUCTURES: ES-NGI National (Spanish) Grid Infrastructure Coordinated by the CSIC – IFCA (Isabel Campos). Evolving to become an ICTS. Currently comprising 5725 cores shared by 10 spanish centres. Supporting 7 thematic virtual organisations. In close cooperation with Portugal. Introduction and Motivation

20 MEDICAL IMAGING Significant Use Cases

21 CHALLENGES ON MEDICAL IMAGING Large content provider At the end of 2010 it will use the 30% of world global storage 1. Intensive computing processing. Data critical – Privacy Privacy management has been a major concern on highly distributed systems such as Grids 2. Evidence-base medicine Need for storing and organising knowledge. Need for computer aided workflows of post-processing. 1 Fact sheet: Information Storage Trends, IBM. 2 SHARE HealthGrid Roadmap 21 Medical Imaging use case

22 Challenges in Medical Imaging Medical Images are large and thus post-processing is computationally intensive, exceeding in many cases the resources of hospitals. Key information in medical images can be difficult to observe, even for trained specialists. Training is mainly based on evidence. The data produced yearly in a medium-sized hospital, is on the order of terabytes, so efficient organisation of the data for research is difficult. Data is stored distributed, but consolidated access is difficult or inexistent. Privacy is a key issue dealing with patient data, and even more with medical images. Grids & e-Science 2009 15-19/6/2009 Santander Medical Imaging use case

23 CIBERINFRAESTRUCTURA VALENCIANA PARA IMAGEN MÉDICA ONCOLÓGICA Platform developed in the frame of the project “Creación de una Ciberinfraestructura para el Aprendizaje, Investigación y Estudio Epidemiológico del Cáncer Mediante Imágenes Médicas”. Coordinated by the UPV with the participation of 5 Hospitals from the Comunidad Valenciana and British Telecom. The platform uses TRENCADIS, developed by the UPV, to organise medical data by virtual communities. It enables the creation and search of structured reports and associated images. 23 Spain Medical Imaging use case

24 Technical Design Principles TOWARDS A GRID ENVIRONMENT FOR PROCESSING AND SHARING DICOM OBJECTS (TRENCADIS) Open and Standard Software Architecture, based on the Web Service Resource Framework (WSRF) and implemented using Globus 4. Integrates different local storages of DICOM objects from several centres. Different storage resources are virtualized providing a common interface. It organizes data by communities. Indexing is performed locally and central services simply keep references to the storages where information relevant to a each community is stored.

25 TRENCADIS Middleware: Data Model Semantic Organisation – Users organise themselves on Virtual Communities. – From the studies available, only those matching the selection criteria of the Virtual Community profile are accessible. – From the images available to a Virtual Community, a user can create an experiment with the studies matching a set of restrictions. – From this experiment, more detailed views can be obtained. – The criteria for the selection of the relevant information relies on the DICOM tags of the images and the Structured Reports. View: (e.g. Patients Between 1 and 2 Years with hetrogeneous Supratentorial Findings) Experiment: (e.g Neuroblastoma) User Comunity (e.g. Paediatric Neuroimaging) Global Database: All the Images and Reports Shared 25 Medical Imaging use case

26 Applications Virtual Storage DICOM-SR Storage Processing Plug-in TRENCADIS SW ARCHITECTURE Server Services Informat. Server Ontology Server VOMS Server Keys Server Code Server Mw Components DICOM Storage. Grid Ontolog. Image Downl. Image Downl. Volume Rendering DCM Package CORE Mw DICOM Gateway SQL Gateway Execution Service Fabric DICOM Workstation SQL PACS Computing Farm Communic. https GridFtp SSL

27 Conventional Report vs. Structured Report Observation of an irregular espiculated mass of 7mm. of maximum diameter, Located on the inter- quadrant inferior line of the left breast. The lesion comprises heterogeneous, linear, grained and branched microcalcifications Apparently a malignant lesion Maglinancy – Based on: Mass – Size: 7mm. – Shape: Irregular – Margins: Espiculated Associated – Calcification » Tipus: Heterogeneous » Tipus: Branched Distribution: Grouped. 27 Free text. There is no structure defined and agreed and not associated to images. Structure and lexicon defined and agreed. Associated to images. Medical Imaging use case

28 Increase of Expressivity 28 Problem: Retrieve all the ID Reports from the Patients that have a Staging Value of II with an Heterogeneous Secondary Lesion. SELECT Patients.SRID FROM (Patients INNER JOIN Stages ON Patients.PID = Stages.PID) INNER JOIN Lesions ON (Stages.PID = Lesions.PID) AND (Stages.IDS = Lesions.IDS) WHERE ((Stages.IDS="StageII") AND (Lesions.IDL="2") AND (Lesions.CONTENT = “Heterogeneous")) Find /root/STAGE_II/Findings/Lesion_2 ‘like (CONTENT, ”Heterogeneous”)’ Medical Imaging use case

29 Fetching DICOM-SR Objects Medical Imaging use case

30 SECURITY MODEL Authorisation and Authentication – Users require X509 certificates and use proxies for a Single Sign-on. – Roles and groups in VOMS credentials are defining access permissions. Privacy – Every transaction uses Secure Protocols. – Data is encrypted in Grid stores to prevent being accessed by users with administrative privileges. – Encryption keys are fragmented and stored replicated in Key stores. Medical Imaging use case

31 PATIENT-BASED STORES Commercial Providers AccelaRAD : seemyradiology.comseemyradiology.com eMiX: www.eMix.comwww.eMix.com Candelis: www.candelis.comwww.candelis.com Medical Imaging use case 31

32 BIOINFORMATICS Significant Use Cases

33 CURRENT USERS OF E- INFRASTRUCTURES Used to work on the Internet GeneBank, PDB, SWISSPROT, KEGG, NCBI portal. Users of a wide set of common tools BLAST, ClustalW, ePCR, SAGE, EMBOSS, Phylip,... Open-Source and Open-Data community. Currently, it is a significant computing Intensive community >5-10% of the European HPC resources. >5% of the EGEE infrastructure (largest non LHC community). Bioinformatics Use Case 33

34 METAGENOMIC ANALYSIS INTEREST IS INCREASING Metagenomic analysis is the study of samples with the genomes of different living beings. The Genomic Standards Consortium has set-up a working group called the M5 (metagenomics, metadata, meta- analysis, multiscale models, and meta-infrastructure). It aims to improve data-sharing and interoperability in metagenomics studies. This imply standardisation of annotation and tools to enable setting up workflows on computational infrastructures at large. Bioinformatics Use Case 34

35 ONE RELEVANT AREA IS METAGENOMICS Bioinformatics Use Case in two years. At current database sizes all-versus-all com- parisons are already impossible without a supercomputer. Development of more efficient algorithms will help, but this will not solve the basic problem of too little comput- ing power. Individual access to supercomputers or cloud computing would help, at least temporarily. 35

36 COMPUTING NEEDS Annotation of large samples of metagenomes require a large amount of computational resources that exceeds the resources used in sequential approaches. Computing the homologous search using BLAST of the Metagenome of the Sargassos’ sea with respect to the Gene Bank Non Redundant Database involves: 811.372 Sequences of the Sargassos’ sea sample. 7.166.228 Sequences in the NCBI NR. 1.4 CPU*years. Around 60 Gbytes of secondary storage. However, this process is highly parallel, which could reduce significantly the response time required. Bioinformatics Use Case 36

37 DESIGN OF EXPERIMENTS Selection of resources. Replication of reference databases close to the computing elements to be used. Execution of a single job per block. Collection of results. Bioinformatics Use Case nr

38 DATA PARTITIONING Predictability is a fundamental factor to take decissions for an efficient scheduling and resubmission of jobs Otherwise, it is difficult to know if a job is stacked or simply performing slowly. Resources with a limitation on the wall time that cannot be guaranteed could be excluded from an experiment. The choice of partitioning strategy and its size has a strong impact on performance The execution time depend on the size of the input data and the size of the reference database. The benchmarking information published by the resources is however not reliable enough. 38 Bioinformatics Use Case

39 DATA PARTITIONING 39 Bioinformatics Use Case Results obtained with a partitioning schema of 610 blocks of approximately equal size % Completed jobs — Blocks per size — Blocks per sequences Time (minutes)

40 OBTAINED RESULTS Bioinformatics Use Case Alternative: Send redundant jobs

41 OBTAINED RESULTS The execution of the experiment in a single computer would have been useless It will have required 512 days, and the results would have been fairly out- to-date. The experiment on the Grid took 13 days Crunching factor of 40. But much more larger if considering part of the experiment (90% in 7 days) Crunching factor of 80. If every job would have successfully ended, speed- up could have been incremented in a factor 140 at 90%. Important focus on reducing failure rates. Bioinformatics Use Case

42 QUALITY OF THE PHYLOGENETIC ANNOTATION OF BACTERIES Comparative phylogenetic experiment on a soil sample with respect to different releases of the NR Gene Bank Database. Many of the associations of sample fragments to biological families have changed, even recently. The changing rate does not decreases as time goes by, being increased in many cases. This reveals that the complete diversity of such communities is not sufficiently well described on current data bases. 42 Bioinformatics Use Case

43 CROSS COMPARISON AMONG BIOLOGICAL FAMILIES Using simultaneously different infrastructures

44 OBJECTIVE Taxonomical and functional characterisation of a bacterial population from the digestive tube of Ostrinina nubilalis. Comparison between individuals breed in captivity and those collected directly in the countryside. Bioinformatics Use Case

45 MATERIAL AND METHODS Execution blastx (blastall 2.2.22), expected value: 0,01, blast hits: 5000. Reference database nr_nEuk_Pstipitis_Scerevisiae_Dhansenii.fna. (8,1 GB formatted with formatdb). It consists on the nr database plus the genomes of Pichia stipitis, Saccharomyces cerevisiae S288c and Debaryomyces hansenii CBS767.. Input sequences 538480 sequences with a size of 229 MB and an average length of 363,45 nucleotides. Obtained through pyrosequencing 454 (Titanium). Bioinformatics Use Case

46 RESULTS Submission using the three available e-Infrastructures EGEE (VO: biomed), EELA (VO: prod.vo.eela-eu.eu), NGI (VO: vo.blast.es- ngi.eu) 600 Jobs split by job size (390KB) with an average duration of : 13,11 hours. Distributed accordingly to the availability of resources: EGEE(240), EELA(120), NGI(240). Results Estimated sequential time: 6113 hours (8 months). Makespan: 5 days and 10 hours. Approximated speed-up: 47X. Percentage of resubmission: 18%..Total output: 26 GB. Bioinformatics Use Case

47 EXPERIMENT PROGRESSION Bioinformatics Use Case

48 EXPERIMENT THROUGHPUT Bioinformatics Use Case

49 BUT THINGS ARE GETTING WORSE… Bioinformatics Use Case 49

50 OR EVEN WORSER! Bioinformatics Use Case 50

51 CONCLUSIONS Many tools for bioinformatics and medical imaging require an enormous processing time. Data storage and management is becoming a problem that cannot be faced separately. The use of e-Science infrastructures constitute an effort that can only be tackled from a multidisciplinar group. Those data challenges are not for the future but present nowadays.

52 DATA DELUGE? Do you think that it was hard? Animals entered the Noa’s Ark two by two… Edwards Hicks (1780-1849)

53 CONTACT Ignacio Blanquer Vicente Hernández Universidad Politécnica de Valencia Camino de Vera s/n 46022 Valencia, Spain Tel: +34-963879743 Fax. +34-963877274 E-mail:iblanque@dsic.upv.es vhernand@dsic.upv.esiblanque@dsic.upv.es vhernand@dsic.upv.es 53


Download ppt "APPLICATIONS OF GRIDS TO BIOMEDICINE Ignacio Blanquer Vicente Hernández."

Similar presentations


Ads by Google