Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.

Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK kisst@wmin.ac.uk Carlos Blanco – University of Cantabria Tamas Kiss, Gabor Terstyanszky – University of Westminster Peter Kacsuk – MTA SZTAKI

MapReduce/Hadoop * MapReduce: to process large datasets in parallel and on thousands of nodes in a reliable and fault-tolerant manner * Map: input data in divided into chunks and analysed on different nodes in a parallel manner * Reduce: collating the work and combining the results into a single value * Monitoring, scheduling and re-executing the failed tasks are the responsibility of the MapReduce framework * Originally for bare-metal clusters – popularity in cloud is growing * Hadoop: Open source implementation of the MapReduce framework introduced by Google in 2004 Introduction MapReduce and big data

Motivation * Many scientific applications (like weather forecasting, DNA sequencing, molecular dynamics) parallelized using the MapReduce framework * Installation and configuration of a Hadoop cluster well beyond the capabilities of domain scientists Aim * Integration of Hadoop with workflow systems and science gateways * Automatic setup of Hadoop software and infrastructure * Utilization of the power of Cloud Computing Motivations

CloudSME project * To develop a cloud-based simulation platform for manufacturing and engineering Funded by the European Commission FP7 programme, FoF: Factories of the Future July 2013 – March 2016 EUR 4.5 million overall funding Coordinated by the University of Westminster 29 project partners from 8 European countries 24 companies (all SMEs) and 5 academic/research institutions Spin-off company established – CloudSME UG One of the industrial use-cases: datamining of aircraft maintenance data using MapReduce based parallelisation Motivations

* Set up a disposable cluster in the cloud, execute Hadoop application and destroy cluster * Cluster related parameters and input files provided by user * Workflow node executable would be a program that sets up Hadoop cluster, transfers files to and from the cluster and executes the Hadoop job * Two methods proposed: * Single Node Method * Three Node Method Approach

* Connect to cloud and launch servers * Connect to the master node server and setup cluster configuration * Transfer input files and job executable to master node * Start the Hadoop job by running a script in the master node * When the job is finished, delete servers from cloud and retrieve output if the job is successful Approach Single node method

* Stage 1 or Deploy Hadoop Node: Launch servers in cloud, connect to master node, setup Hadoop cluster and save Hadoop cluster configuration * Stage 2 or Execute Node: Upload input files and job executable to master node, execute job and get result back * Stage 3 or Destroy Hadoop Node: Destroy cluster to free up resources Approach Three node method

Current implementations 1. EGI FedCloud implementation using OCCI 2. CloudBroker Platform implementation to access multiple heterogeneous clouds (e.g. Amazon, CloudSigma, OpenStack, OpenNebula)

EGI FedCloud Implementation

CloudBroker Implementation

EGI FedCloud Implementation * Hadoop application is registered at EGI AppDB : https://appdb.egi.eu/store/vappiance/hadoop.2.7.1 * Python API to interact with OCCI interface to create, monitor and destroy VMs. * A grid certificate is needed to manage VMs. * JSON catalog including all available EGI FedCloud sites and flavours. * EGI Block Storage (via OCCI interface) to host HDFS, which offers more storage than the one provided by the VM disk.

General purpose, workflow-oriented gateway framework Supports the development and execution of workflow-based applications Enables the multi- cloud and multi- grid execution of any workflow Supports the fast development of gateway instances by a customization technology Implementation WS-PGRADE/gUSE

* Each box describes a task * Each arrow describes information flow such as input files and output files * Special node describes parameter sweeps Implementation WS-PGRADE/gUSE

Implementation SHIWA workflow repository Workflow repository to store directly executable workflows Supports various workflow system including WS- PGRADE, Taverna, Moteur, Galaxy etc. Fully integrated with WS- PGRADE/gUSE

Implementation Supported storage solutions Local (user’s machine): * Bottleneck for large files * Multiple file transfers: local machine – WS-PGRADE – Bootstrap node – Master node – HDFS sftp/ftp/https/http: * Two stage file transfer: Server – Master node – HDFS Swift: * Direct transfer from Swift to HDFS * Using Hadoop’s distributed copy application Amazon S3: * Direct transfer from S3 to HDFS * Using Hadoop’s distributed copy application Input/output locations can be mixed and matched in one workflow

How to use it? 1. Create an account on the EGI FedCloud WS-PGRADE Gateway: https://fedcloud-gateway.lpds.sztaki.hu/ 2. Import the Hadoop workflow(s) to your account from the SHIWA Workflow Repository 3. Download and customise sample configuration files 4. Configure workflow by uploading configuration files and Hadoop source/executables 5. Submit See demonstration and user manual for further details

How to use it? One node method

How to use it? Three node method

How to use it? Cluster configuration file – cluster.cfg: * Cluster section * infrastructure: Infrastructure tag (e.g. fedcloud, cloudbroker) * cloud: Resource cloud (e.g. CESNET, BIFI) * app: Hadoop version (e.g. Hadoop-2.7.1) * flavour: Size of the VM (e.g. Small, Medium, Large and XLarge) * nodes: Number of Hadoop nodes * volume: Size of block storage in GB for each node (only available for fedcloud infrastructure) * Credentials section * myproxy_server: Hostname of myproxy server (e.g. myproxy1.egee.cesnet.cz) * myproxy_password: Myproxy password * username: Myproxy user name

How to use it? Cluster configuration file example: * [cluster] * infrastruture=fedcloud * cloud=CESNET * app=Hadoop-2.7.1 * flavour=Small * nodes=2 * * [credentials] * myproxy_server=myproxy1.egee.cesnet.cz * myproxy_password=****** * username=josecarlos.blanco

How to use it? Job configuration file – job.cfg: * input_data: Where to get input data from. Currently supported input data sources are local (uploaded directly by the user to the portal server), s3/sn3/s3a, hftp, hdfs, swift, http/https and sftp/ftp. * output_data: Where to transfer output data files. Current supported output data destinations are local, s3/sn3/s3a, hftp, hdfs, swift, http/https and sftp/ftp. * class_job: The main class name of the Hadoop application to be executed on the cluster, complete with package info (e.g. org.myorg.WordCount) * jar_file: jar file of the Hadoop application to be executed (e.g. WordCount.jar). * map_tasks: Number of map tasks for job. * reduce_tasks: Number of reduce tasks for job

How to use it? Job configuration file example: * [job] * input_data = local * output_data = local * job_class = WordCount * jar_file = wordcount.jar * map_tasks = 2 * reduce_tasks = 1

How to use it? How to create the data bundle – data.tar? * Copy Hadoop job executable (jar file) OR its source code files together with the build.sh script to compile it into a folder * If your input files are local: compress your Hadoop input files into an input.tar and copy this compressed file to the same folder where your executable/source code is. * Compress the content of your folder including the job executable or its code sources and the build.sh script, and the compressed job input files (if input_data is local) as a tar file called Data.tar.

Any questions? Contact: Carlos Blanco: josecarlos.blanco@unican.es Tamas Kiss: kisst@wmin.ac.uk

Demo overview * Workflow * Hadoop Three Node: deploy, execute and destroy * Cluster * Two nodes * Flavour Small * 10 GB of block storage for each node ( HDFS ) * CESNET site * Job * WordCount as a simple example of MapReduce application * Input data: Amazon S3 ( s3n://KeyID:SecretKey@testegi/file ) * Output data : local * WS-PGRADE portal * guse-fedcloud-gateway.sztaki.hu/

Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.

Similar presentations

Presentation on theme: "Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University.

Similar presentations

Presentation on theme: "Hadoop on the EGI Federated Cloud Dr Tamas Kiss, CloudSME Project Director University of Westminster, London, UK Carlos Blanco – University."— Presentation transcript:

Similar presentations

About project

Feedback