Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

Slides:



Advertisements
Similar presentations
Apache Bigtop Working Group Cluster stuff. Cloud computing.
Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
© 2013 MediaCrossing, Inc. All rights reserved. Going Live: Preparing your first Spark production deployment Gary Malouf Architect,
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.
Hadoop Ecosystem Overview
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
An Introduction to HDInsight June 27 th,
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
Matthew Winter and Ned Shawa
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Nov 2006 Google released the paper on BigTable.
Breaking points of traditional approach What if you could handle big data?
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
"Data sources index" a web application to list projects in Hadoop Luca Menichetti.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
Learn Hadoop and Big Data Technologies. Hadoop  An Open source framework that stores and processes Big Data in distributed manner on a large groups of.
Scaling up R computation with high performance computing resources.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
This is a free Course Available on Hadoop-Skills.com.
BIG DATA/ Hadoop Interview Questions.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
Yarn.
Introduction to Distributed Platforms
Hadoop and Analytics at CERN IT
Dag Toppe Larsen UiB/CERN CERN,
Spark and YARN: Better Together
An Open Source Project Commonly Used for Processing Big Data Sets
Dag Toppe Larsen UiB/CERN CERN,
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Enterprise security for big data solutions on Azure HDInsight
Data Analytics and CERN IT Hadoop Service
Visual Analytics Sandbox
Server & Tools Business
Data science and machine learning at scale, powered by Jupyter
Introduction to Apache
Overview of big data tools
Spark and Scala.
The Student’s Guide to Apache Spark
Big-Data Analytics with Azure HDInsight
Server & Tools Business
Pig Hive HBase Zookeeper
Presentation transcript:

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti

Analytics Working Group Cross-group activity (IT) which main goal is to perform analyses using monitoring data coming from different CERN IT services. Members of the AWG have the commitment to share their data with others using the Hadoop cluster (HDFS) as common repository. This is implicit if their applications are already running inside Hadoop. An analysis task can therefore be performed inside the cluster using any Hadoop compliant technology such as MapReduce, Spark, Pig, HBase, Hive or Impala 2

Analysis with Spark Why? Spark is the most popular Big Data project also because, including it in the analysis process, it is possible to gather together many steps required by the analysis workflow. To submit a job is not straightforward: spark-submit –executors-num 10 –mem-executor 4G –executor-cores 4 –class MyMainClass myCode.jar args Every change to the code requires to compile and to submit the jar again Impossible to execute only the modified parts Developing in Spark within a web notebook will improve the overall analysis process and the learning curve. 3

Spark notebooks – current status 4 Requirement: to fully exploit Hadoop and Spark, the notebook kernel has to run in a machine that is a client of the Hadoop cluster. Read/Write HDFS directly (avoid data migration) Spark application can scale up/down within the cluster Current solution: IPython, Jypiter and Apache Zeppelin (installed manually) running in Openstack nodes with puppetized Hadoop client installation. Jupiter runs PySpark (Python API for Spark) as a kernel. Zeppelin (multi-interpreter, interactive, web notebook solution) provides built-in Spark integration.

Limitations and issues 5 Each user has to create it’s own virtual machine and install the notebook platform she/he would like to use No possibility to properly share notebooks (currently gitlab) Running in “yarn mode” (distributed job in the cluster) Not fully supported Still not clear how to manage the distributed application Each Hadoop client needs a Kerberos token used by Jupiter/Zeppelin instance to access HDFS and run jobs (users cannot share the instance).

Interests and ideas 6 We are supporting users to run Spark with notebooks, however we will not provide this as a service. Work in progress to create a puppet module to install Jupyter/Zeppelin. We hope in a future this can be integrated in a proper service to provide ready-to-use Spark notebooks. To solve the sharing/portability problem (to avoid dependencies issue, both missing library or version). Features: Additional interpreters for Spark components (SparkSQL) Notebook versioning Schedulers Notebook web visualization