Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.

Reproducible Environment for Scientific Applications (Lab session) Tak-Lon (Stephen) Wu.

Workflows Information Flows Prof. Silvia Olabarriaga Dr. Gabriele Pierantoni.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI WS-PGRADE/gUSE Supporting e-Science communities in Europe Zoltan Farkas.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Connecting Workflow-Oriented Science Gateways to Multi-Cloud Systems Zoltán Farkas, Péter Kacsuk, Ákos Hajnal MTA SZTAKI.

Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.

CloudBroker integration to WS- PGRADE/gUSE Zoltán Farkas MTA SZTAKI LPDS

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker Platform integration into WS-PGRADE/gUSE Zoltán Farkas MTA.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Sharing, integrating and executing different workflows in heterogeneous multi-cloud systems Peter Kacsuk MTA SZTAKI SCI-BUS is supported.

From P-GRADE to SCI-BUS Peter Kacsuk, Zoltan Farkas and Miklos Kozlovszky MTA SZTAKI - Computer and Automation Research Institute of the Hungarian Academy.

Sharing Workflows through Coarse-Grained Workflow Interoperability : Sharing Workflows through Coarse-Grained Workflow Interoperability G. Terstyanszky,

Introduction to SHIWA Technology Peter Kacsuk MTA SZTAKI and Univ.of Westminster

Introduction to WS-PGRADE and gUSE Tutorial Akos Balasko 04/17/

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

P-GRADE and GEMLCA.

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.

A scalable and flexible platform to run various types of resource intensive applications on clouds ISWG June 2015 Budapest, Hungary Tamas Kiss,

1 Practical information for the GEMLCA / P-GRADE hands-on Tamas Kiss University of Westminster.

SHIWA and Coarse-grained Workflow Interoperability Gabor Terstyanszky, University of Westminster Summer School Budapest July 2012 SHIWA is supported.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Accessing Cloud Systems from WS-PGRADE/gUSE Zoltán Farkas MTA SZTAKI LPDS.

SHIWA: Is the Workflow Interoperability a Myth or Reality PUCOWO, June 2011, London Gabor Terstyanszky, Tamas Kiss, Tamas Kukla University of Westminster.

1 P-GRADE Portal hands-on Gergely Sipos MTA SZTAKI Hungarian Academy of Sciences.

Introduction to the program of the summer school Peter Kacsuk MTA SZTAKI SCI-BUS is supported by the FP7 Capacities Programme under contract.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI MTA SZTAKI background for the DARIAH CC Zoltan Farkas MTA SZTAKI LPDS,

OpenNebula: Experience at SZTAKI Peter Kacsuk, Sandor Acs, Mark Gergely, Jozsef Kovacs MTA SZTAKI EGI CF Helsinki.

Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.

Supporting Big Data Processing via Science Gateways EGI CF 2015, November, Bari, Italy Dr Tamas Kiss, CloudSME Project Director University of Westminster,

The EUBrazilOpenBio-BioVeL Use Case in EGI Daniele Lezzi, Barcelona Supercomputing Center EGI-TF September 2013.

SHIWA Simulation Platform (SSP) Gabor Terstyanszky, University of Westminster EGI Community Forum Munnich March 2012 SHIWA is supported by the FP7.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Providing cloud-based simulation services for SMEs EGI 2015, May, Lisbon Dr Tamas Kiss, CloudSME Project Director University of Westminster, London,

Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.

1 Globe adapted from wikipedia/commons/f/fa/ Globe.svg IDGF-SP International Desktop Grid Federation - Support Project SZTAKI.

Occopus and its usage to build efficient data processing workflow infrastructures in clouds József Kovács, Péter Kacsuk, Ádám Novák, Ádám Visegrádi MTA.

DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Accessing cloud resources through the WS-PGRADE/gUSE and CloudBroker integrated.

SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI CloudBroker usage Zoltán Farkas MTA SZTAKI LPDS

European Grid Initiative The EGI Federated Cloud as Educational and Training Infrastructure for Data Science Tiziana Ferrari/ EGI.eu.

SCI-BUS project Pre-kick-off meeting University of Westminster Centre for Parallel Computing Tamas Kiss, Stephen Winter, Gabor.

EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu

Centre for Parallel Computing Tamas Kiss Centre for Parallel Computing A Distributed Rendering Service Tamas Kiss Centre for Parallel Computing Research.

Exposing WS-PGRADE/gUSE for large user communities Peter Kacsuk, Zoltan Farkas, Krisztian Karoczkai, Istvan Marton, Akos Hajnal,

CloudSME – Cloud-based Simulation platform for Manufacturing and Engineering from project to company Dr Tamas Kiss, CloudSME Project Director Chair of.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.

SHIWA SIMULATION PLATFORM = SSP Gabor Terstyanszky, University of Westminster e-Science Workflows Workshop Budapest 09 nd February 2012 SHIWA is supported.

Big Data is a Big Deal!.

Peter Kacsuk, Zoltan Farkas MTA SZTAKI

Dag Toppe Larsen UiB/CERN CERN,

Tamas Kiss University Of Westminster

Flowbster: Dynamic creation of data pipelines in clouds

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

Dag Toppe Larsen UiB/CERN CERN,

WS-PGRADE for Molecular Sciences and XSEDE

Peter Kacsuk MTA SZTAKI

P-GRADE and GEMLCA.

An easier path? Customizing a “Global Solution”

Microsoft Ignite NZ October 2016 SKYCITY, Auckland.

Distributing META-pipe on ELIXIR compute resources

Introduction to the SHIWA Simulation Platform EGI User Forum,

Presentation transcript:

Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University of Cantabria Tamas Kiss, Gabor Terstyanszky – University of Westminster Peter Kacsuk – MTA SZTAKI

MapReduce/Hadoop * MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner * Map: input data in divided into chunks and analysed on different nodes in a parallel manner * Reduce: collating the work and combining the results into a single value * Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework * Originally for bare-metal clusters – popularity in cloud is growing * Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004 Introduction MapReduce and big data

Motivation * Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework * Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim * Integration of Hadoop with workflow systems and science gateways * Automatic setup of Hadoop software and infrastructure * Utilization of the power of Cloud Computing Motivations

CloudSME project * To develop a cloud-based simulation platform for manufacturing and engineering Funded by the European Commission FP7 programme, FoF: Factories of the Future July 2013 – March 2016 EUR 4.5 million overall funding Coordinated by the University of Westminster 29 project partners from 8 European countries 24 companies (all SMEs) and 5 academic/research institutions Spin-off company established – CloudSME UG One of the industrial use-cases: datamining of aircraft maintenance data using MapReduce based parallelisation Motivations

* Set up a disposable cluster in the cloud, execute Hadoop application and destroy cluster * Cluster related parameters and input files provided by user * Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job * Two methods proposed: * Single Node Method * Three Node Method Approach

* Connect to cloud and launch servers * Connect to the master node server and setup cluster configuration * Transfer input files and job executable to master node * Start the Hadoop job by running a script in the master node * When the job is finished, delete servers from cloud and retrieve output if the job is successful Approach Single node method

* Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration * Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back * Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources Approach Three node method

Current implementations 1. EGI FedCloud implementation using OCCI 2. CloudBroker Platform implementation to access multiple heterogeneous clouds (e.g. Amazon, CloudSigma, OpenStack, OpenNebula)

EGI FedCloud Implementation

CloudBroker Implementation

EGI FedCloud Implementation * Hadoop application is registered at EGI AppDB : * Python API to interact with OCCI interface to create, monitor and destroy VMs. * A grid certificate is needed to manage VMs. * JSON catalog including all available EGI FedCloud sites and flavours. * EGI Block Storage (via OCCI interface) to host HDFS, which offers more storage than the one provided by the VM disk.

General purpose, workflow-oriented gateway framework Supports the development and execution of workflow-based applications Enables the multi- cloud and multi- grid execution of any workflow Supports the fast development of gateway instances by a customization technology Implementation WS-PGRADE/gUSE

* Each box describes a task * Each arrow describes information flow such as input files and output files * Special node describes parameter sweeps Implementation WS-PGRADE/gUSE

Implementation SHIWA workflow repository Workflow repository to store directly executable workflows Supports various workflow system including WS- PGRADE, Taverna, Moteur, Galaxy etc. Fully integrated with WS- PGRADE/gUSE

Implementation Supported storage solutions Local (user’s machine): * Bottleneck for large files * Multiple file transfers: local machine – WS-PGRADE – Bootstrap node – Master node – HDFS sftp/ftp/https/http: * Two stage file transfer: Server – Master node – HDFS Swift: * Direct transfer from Swift to HDFS * Using Hadoop’s distributed copy application Amazon S3: * Direct transfer from S3 to HDFS * Using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

How to use it? 1. Create an account on the EGI FedCloud WS-PGRADE Gateway: 2. Import the Hadoop workflow(s) to your account from the SHIWA Workflow Repository 3. Download and customise sample configuration files 4. Configure workflow by uploading configuration files and Hadoop source/executables 5. Submit See demonstration and user manual for further details

How to use it? One node method

How to use it? Three node method

How to use it? Cluster configuration file – cluster.cfg: * Cluster section * infrastructure: Infrastructure tag (e.g. fedcloud, cloudbroker) * cloud: Resource cloud (e.g. CESNET, BIFI) * app: Hadoop version (e.g. Hadoop-2.7.1) * flavour: Size of the VM (e.g. Small, Medium, Large and XLarge) * nodes: Number of Hadoop nodes * volume: Size of block storage in GB for each node (only available for fedcloud infrastructure) * Credentials section * myproxy_server: Hostname of myproxy server (e.g. myproxy1.egee.cesnet.cz) * myproxy_password: Myproxy password * username: Myproxy user name

How to use it? Cluster configuration file example: * [cluster] * infrastruture=fedcloud * cloud=CESNET * app=Hadoop * flavour=Small * nodes=2 * * [credentials] * myproxy_server=myproxy1.egee.cesnet.cz * myproxy_password=****** * username=josecarlos.blanco

How to use it? Job configuration file – job.cfg: * input_data: Where to get input data from. Currently supported input data sources are local (uploaded directly by the user to the portal server), s3/sn3/s3a, hftp, hdfs, swift, http/https and sftp/ftp. * output_data: Where to transfer output data files. Current supported output data destinations are local, s3/sn3/s3a, hftp, hdfs, swift, http/https and sftp/ftp. * class_job: The main class name of the Hadoop application to be executed on the cluster, complete with package info (e.g. org.myorg.WordCount) * jar_file: jar file of the Hadoop application to be executed (e.g. WordCount.jar). * map_tasks: Number of map tasks for job. * reduce_tasks: Number of reduce tasks for job

How to use it? Job configuration file example: * [job] * input_data = local * output_data = local * job_class = WordCount * jar_file = wordcount.jar * map_tasks = 2 * reduce_tasks = 1

How to use it? How to create the data bundle – data.tar? * Copy Hadoop job executable (jar file) OR its source code files together with the build.sh script to compile it into a folder * If your input files are local: compress your Hadoop input files into an input.tar and copy this compressed file to the same folder where your executable/source code is. * Compress the content of your folder including the job executable or its code sources and the build.sh script, and the compressed job input files (if input_data is local) as a tar file called Data.tar.

Any questions? Contact: Carlos Blanco: Tamas Kiss:

Demo overview * Workflow * Hadoop Three Node: deploy, execute and destroy * Cluster * Two nodes * Flavour Small * 10 GB of block storage for each node ( HDFS ) * CESNET site * Job * WordCount as a simple example of MapReduce application * Input data: Amazon S3 ( ) * Output data : local * WS-PGRADE portal * guse-fedcloud-gateway.sztaki.hu/