Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
Bookshelf.EXE - BX A dynamic version of Bookshelf –Automatic submission of algorithm implementations, data and benchmarks into database Distributed computing.
Languages for Dynamic Web Documents
Understanding and Managing WebSphere V5
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
Projects. High Performance Computing Projects Design and implement an HPC cluster with one master node and two compute nodes. (Hint: use Rocks HPC Cluster.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
WSRF Supported Data Access Service (VO-DAS)‏ Chao Liu, Haijun Tian, Dan Gao, Yang Yang, Yong Lu China-VO National Astronomical Observatories, CAS, China.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Last News of and
How computer’s are linked together.
Installation and Development Tools National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The SEASR project and its.
1 Welcome to CSC 301 Web Programming Charles Frank.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The LGI Pilot job portal EGI Technical Forum 20 September 2011 Jan Just Keijser Willem van Engen Mark Somers.
KOREL-VO - The small RESTful "cloud" environment for stellar analysis Petr Škoda Jan Fuchs Petr Hadrava Astronomical Institute Academy of Sciences Ondřejov.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.
Architecture and design
SSA Revisited (Also SDM and SODA suggestions)
Big Data is a Big Deal!.
JRA2: Acceptance Testing senarious
Yarn.
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
Foundations of Data Science
Introduction to Distributed Platforms
Big Data A Quick Review on Analytical Tools
Chapter 10 Data Analytics for IoT
Work report Xianghu Zhao Nov 11, 2014.
Nope OS Prepared by, Project Guides: Ms. Divya K V Ms. Jucy Vareed
Spark Presentation.
Bridges and Clouds Sergiu Sanielevici, PSC Director of User Support for Scientific Applications October 12, 2017 © 2017 Pittsburgh Supercomputing Center.
Introduction to HDFS: Hadoop Distributed File System
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
湖南大学-信息科学与工程学院-计算机与科学系
HDFS on Kubernetes -- Lessons Learned
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
HDFS on Kubernetes -- Lessons Learned
Lecture 16 (Intro to MapReduce and Hadoop)
Google Sky.
Distributing META-pipe on ELIXIR compute resources
CEA Experiences Paul Harrison ESO.
Bryon Gill Pittsburgh Supercomputing Center
Big-Data Analytics with Azure HDInsight
Enol Fernandez & Giuseppe La Rocca EGI Foundation
MapReduce: Simplified Data Processing on Large Clusters
Web Application Development Using PHP
Presentation transcript:

VO-CLOUD upgrade Integration of Spark, Jupyter and HDFS in a UWS-driven cloud service Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences Ondřejov Czech Republic Supported by grant COST LD-15113 of the Ministry of Education, Youth and Sports of the Czech Republic And National Science Foundation of China IVOA Interoperability meeting , GWS Session 1 Shanghai, China, 16th May 2017

Concept of scientific „CLOUD“ ITERATIVE REPEATING of SAME computation (workflow) Machine Learning (of emission line profiles of LAMOST) LARGE stable INPUT data + small changing PARAMS Many runs on SAME data (tuning required) Graphics visualization from postprocessed output (text) files Using WWW browser - supercomputing in PDA/mobil

VO-CLOUD Architecture Distributed engine MASTER (frontend) Database of users and their experiments Visualization Scheduling Load balancing SHARED DATA STORAGE - controlled access (Big Data) WORKERS (backend) Computation [+ output for visualization]

Getting spectra + store Sources of Spectra Getting spectra + store (restricted access – big files) Files UPLOAD from given local directory (recursive) DOWNLOAD by http + index, FTP (recursive) VOTable UPLOAD VOTable (e.g. prepared in TOPCAT - meta) REMOTE VOTable SSAP query + Accref + DataLink + SODA SAMP control - send to SPLAT

Machine Learning of BIG Archive

VO-CLOUD deployment

SOM Worker example

VO-CLOUD spectra visualisation Big Data visualization (thousands spectra) Implemented in Python using Matplotlib Can visualise multiple selected spectra files Figure generated on server-side and then transfered to client User can use panning, zooming and export to different formats Uses WebSocket for server-client communication

VO-CLOUD spectra visualisation

VO-CLOUD spectra visualisation

VO-CLOUD spectra visualisation

Hadoop cluster infrastructure HDFS – Distributed filesystem utilized as a storage in Hadoop/Spark jobs NameNode – One process per cluster. Contains information about all files saved inside HDFS DataNode – One process per each node in cluster. Stores individual file data blocks. YARN – Resource manager and scheduler for Hadoop/Spark jobs. ResourceManager – Main process managing resources and scheduling jobs. One process per cluster. NodeManager – Process executing assigned work on each node. One process per each cluster node. Problem of millions of small files (FITS) – Apache AVRO (Sequence files)

Hadoop cluster deployment

Spark Worker in VO-CLOUD Accepts JSON configuration Downloads requested files from the Master server to HDFS Executes spark-submit script using implicit parameters or parameters present in JSON configuration Awaits spark-submit script completion Downloads requested files from the HDFS Master FILESYSTEM – NFS emulation for HDFS

JupyterHub Consists of Proxy, Hub and individual Jupyter Notebook server instances One Jupyter Notebook server instance per authenticated user JupyterHub runs as Docker container JupyterHub spawns additional Docker containers – one for each Jupyter Notebook server instance Users are isolated from each other and from hosting system itself

JupyterHub authentication VO-CLOUD generates temporal token and passes it to user‘s web browser Web browser uses username and token for JupyterHub authentication JupyterHub uses vocloud-authenticator package for authentication Authenticator asks VO-CLOUD whether token is valid for relevant username JupyterHub spawns new Docker container with the new Jupyter Notebook server instance for the user

JupyterHub deployment

JupyterHub example

JupyterHub example

JupyterHub example

Conclusions VO-CLOUD is now very powerful machine learning environment capable of visualization of Big data It can spawn jobs on remote Spark cluster Provides sandbox for playing with big data in Jupyter notebook ON BIG SERVER (memory, CPU, GPU) but Still missing important capabiliites AVRO not part of Spark-worker Using Docker but so far not deployable as a docker (compose) Combines Java EE + Python (+ Scala) Focused on Machine learning of 1D vectors (spectra, time series) employing VO technology and protocols

https://github.com/vodev/vocloud Source Code https://github.com/vodev/vocloud