Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics Applications in the Spanish Network for e-Science

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Outline The Spanish Network for e-Science –Structure and link with the Spanish NGI. Bioinformatics applications in the Spanish Network for e-Science. Challenges for Bioinformatics on the Grid. Bioinformatics Session - EGEE’09 - Barcelona 2

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 The Creation of the Spanish Network for e-Science As a consequence of the interest raised by the different research centres and groups participating in national and international projects on Grids and Supercomputing, the white book for the e- Science was produced (http://www.fecyt.es/e- ciencia/libroblanco.htm).http://www.fecyt.es/e- ciencia/libroblanco.htm The need for a global coordination and the development of common tool for easing the access to resources, the Spanish Network for e-Science (CAC-2007-52) was created by the Ministry of Science and Innovation –Officially approved on December 2007 and coordinated by Vicente Hernández García (Universidad Politécnica de Valencia). One of the mandates of the Network was to set up the Spanish NGI, which has been officially created in July 2009 –The ministry nominated Isabel Campos (IFCA) as the coordinator of the Spanish NGI. Bioinformatics Session - EGEE’09 - Barcelona 3

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Participant Groups More than 50 different institutions and 97 Research Groups. More than 1000 researchers. Dynamic Structure –28 Groups have been incorporated after the starting of the activity. Structured in Four Activity Areas –EGEE Booth Number 6. Bioinformatics Session - EGEE’09 - Barcelona 4

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Infrastructure Bioinformatics Session - EGEE’09 - Barcelona 5 CESGA 339 cores 1 TB UPV 36 cores 1 TB UNIZAR 54 cores 0.8 TB CIEMAT 220 cores 2.7 TB PIC 1296 cores 10 TB IFCA 867 cores 1 TB gLite-based Own BDII (EGEE-Compatible) Supporting IBERGRID (ES+PT) 3 Different WMs (Xbroker, gLite-WMS, GridWay)

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Applications 3 Roles are identified –Mature applications aiming at a challenging experiment. –Pilots that require intensive porting and a feasibility study. –Support groups with experience on porting applications. Pilots, Applications and Support Groups are certified by an expert board. An internal call for projects was set up. Bioinformatics Session - EGEE’09 - Barcelona 6 Pilots Applications Pilot Selection Expert Panel Analysis and Selection Resource Allocation Pilot migration Support Groups Deploym. and test Report Applications proposal Expert panel Autonom. migration Assisted Migration Production NGI infrastructure Support Groups

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Overview of the Bioinformatics Applications Consolidated Use –Work on current databases to analyse quality, improve annotation or increase the usability  CD-HIT.  GSBLAST.  BiG - Metagenomics. Emerging Use –Port new applications on the Grid for providing new services  Gfrodock.  G-MIRA.  Filogen. Bioinformatics Session - EGEE’09 - Barcelona 7

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 http://www.e-ciencia.es/wiki/index.php/CD-HIT CD-HIT Identification of Representative Sequences of Protein Families using CD-HIT –Proposed by the National Centre of Oncological Research (CNIO). –It proposes using the resources available through the Spanish Network for e-Science and the CD-HIT algorithm to create more regularly non redundant versions of the available databases. Bioinformatics Session - EGEE’09 - Barcelona 8

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 http://www.e-ciencia.es/wiki/index.php/BLAST GBLAST Analysis of the horizontal transference of genes through a BLAST Processing Service –Proposed by the “Instituto de Biología Celular y Molecular de Plantas” and the GRyCAP, from the Universidad Politécnica de Valencia. –This experiment aims at identifying the horizontal transference of gens between prokaryotes and plants, using the UINPROT database, and comparing all known prokaryotic sequences (~4M) among all the known sequences of plants (~0.5M), animals (~1.5M) and fungus (~0.4M). Output size using the columns as input and the rows as reference database Bioinformatics Session - EGEE’09 - Barcelona 9

10 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 http://www.e-ciencia.es/wiki/index.php/GFrodock GFrodDock Grid-Fast ROtational DOCKing –Proposed by the Centro de Investigaciones Biológicas – CSIC. –The objective is determining the interaction between two proteins by means of the analysis of their atomic structure. –Aiming at solving one of the CAPRI (Critical Assessment of Predicted Interactions) scientific challenges. Bioinformatics Session - EGEE’09 - Barcelona 10

11 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Metagenomic Analysis on the Grid BiG Quality of the phylogenetic annotation of bacteria –Comparative phylogenetic experiment on a soil sample with respect to different releases of the NR Gene Bank Database. –Many of the associations of sample fragments to biological families have changed, even recently. –The changing rate does not decreases as time goes by, being increased in many cases. –This reveals that the complete diversity of such communities is not sufficiently well described on current data bases. Bioinformatics Session - EGEE’09 - Barcelona 11

12 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 http://www.e-ciencia.es/wiki/index.php/MIRA GMIRA Assembly of Pyrosequences –Proposed by the “Instituto de Biología Molecular y Celular de Plantas” and the Grid and High Performance Computing Research Group of the Universidad Politécnica de Valencia. –The new high-throughput sequencing techniques are producing millions of readings between 80 and 500 nucleotids each, requiring intensive post-processing for their assembly. –This pilot focuses on porting to the Grid one well-known code for this kind of sequences, which requires vast computing and memory resources. Bioinformatics Session - EGEE’09 - Barcelona 12

13 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 http://www.e-ciencia.es/wiki/index.php/Filogen Filogen Construction of Phylogenetic trees –Proposed by the Institute of Research on Engineering in Aragon (I3A). –Phylogenetics aims at reconstructing the evolutionary relations among species and living beings using the information from their genome. –This pilot focuses on porting a suite of general purpose codes for such objective, in order to reduce the long response time required for challenging executions. Bioinformatics Session - EGEE’09 - Barcelona 13

14 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Current Status 4 Projects already have a VO created (vo.odthpiv.es- ngi.eu, vo.blast.es-ngi.eu, vo.filogen.es-ngi.eu and vo.frodock.es-ngi.eu ). 3 Projects (GBLAST, FILOGEN, and g-MIRA), have been granted with resources for porting through an internal project call. 33% of the resources have been consumed by the biomed applications. Resource Usage Bioinformatics Session - EGEE’09 - Barcelona 14

15 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Challenges 1/2 From the point of view of the resources –Improved scheduling of jobs  Highly dynamic nature of the behaviour of resources (multiple entry points, information system refreshment delays, wide geographic distribution, …).  Need for Quality of Service and job run-length prediction.  Need for much more scalable algorithms and models Go beyond the simple high-throughput approach based on splitting the input.  I/O Bandwidth consume minimisation Improvement of locality of reference for large databases. –Specialised resources  Main memory constraints.  Availability of pre-existing tuned configurations of widely used software. Bioinformatics Session - EGEE’09 - Barcelona 15

16 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Challenges 2/2 From the point of view of the community –Trade-off on Public Database between extensively covering the available information and its quality.  Many results of using Grid in bioinformatics have been focused on this issue.  Since databases are exponentially growing on size, this issue seems to be valid for the medium-term. –Popularisation of community access  Availability of simpler interfaces and configurable workflows  But Grids are not adequate for any kind of problems Do not create over-expectances. Many research group already have medium-size computing resources which can tackle most of the daily work. Create user’s confidence. Bioinformatics Session - EGEE’09 - Barcelona 16


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics."

Similar presentations


Ads by Google