Hadoop Tutorials Spark

Slides:



Advertisements
Similar presentations
Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Spark: Cluster Computing with Working Sets
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Patrick Wendell Databricks Deploying and Administering Spark.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
Running Apache Spark on HPC clusters
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
Introduction to Distributed Platforms
Apache hadoop & Mapreduce
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Pyspark 최 현 영 컴퓨터학부.
Hadoop Clusters Tess Fulkerson.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Kishore Pusukuri, Spring 2018
Introduction to Apache Spark
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Spark and Scala.
Spark and Scala.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Introduction to Spark.
Apache Hadoop and Spark
CS5412 / Lecture 25 Apache Spark and RDDs
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
Big-Data Analytics with Azure HDInsight
COS 518: Distributed Systems Lecture 11 Mike Freedman
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Pig Hive HBase Zookeeper
Presentation transcript:

Hadoop Tutorials Spark Kacper Surdy Prasanth Kothuri

About the tutorial The third session in Hadoop tutorial series this time given by Kacper and Prasanth Session fully dedicated to Spark framework Extensively discussed Actively developed Used in production Mixture of a talk and hands-on exercises

What is Spark A framework for performing distributed computations Scalable, applicable for processing TBs of data Easy programming interface Supports Java, Scala, Python, R Varied APIs: DataFrames, SQL, MLib, Streaming Multiple cluster deployment modes Multiple data sources HDFS, Cassandra, HBase, S3 why is this good? TB of data, massive computations, scalability

Compared to Impala Similar concept of the workload distribution Overlap in SQL functionalities Spark is more flexible and offers richer API Impala is fine tuned for SQL queries Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1 Node 2 Node 3 Node 4 Node 5 Node X

Evolution from MapReduce MapReduce paper Hadoop Summit Apache Spark top-level 2002 2004 2006 2008 2010 2012 2014 quote from google 'we stopped using mapreduce since' mapreduce decomissioned 2014 MapReduce at Google Hadoop at Yahoo! Spark paper Based on a Databricks' slide

Deployment modes Local mode Cluster modes: Standalone Apache Mesos Make use of multiple cores/CPUs with the thread-level parallelism Cluster modes: Standalone Apache Mesos Hadoop YARN typical for hadoop clusters drop facilitate, (use your cores or something) with centralised resource management

How to use it Interactive shell spark-shell pyspark Job submission spark-submit (use also for python) Notebooks jupyter zeppelin SWAN (centralised CERN jupyter service) terminal Interfaces -> how to use it? web interface

Example – Pi estimation import scala.math.random val slices = 2 val n = 100000 * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) + a slide with instructions https://en.wikipedia.org/wiki/Monte_Carlo_method

Example – Pi estimation Login to on of the cluster nodes haperf1[01-12] Start spark-shell Copy or retype the content of /afs/cern.ch/user/k/kasurdy/public/pi.spark

Spark concepts and terminology

Transformations, actions (1) Transformations define how you want to transform you data. The result of a transformation of a spark dataset is another spark dataset. They do not trigger a computation. examples: map, filter, union, distinct, join Actions are used to collect the result from a dataset defined previously. The result is usually an array or a number. They start the computation. examples: reduce, collect, count, first, take

Transformations, actions (2) import scala.math.random val slices = 2 val n = 100000 * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) 1 2 3 … n-1 n Map function 1 … diagram Reduce function 4 * count = 157018 / n = 200000

Driver, worker, executor typically on a cluster import scala.math.random val slices = 2 val n = 100000 * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) Worker Executor Driver Worker Cluster Manager SparkContext Executor Worker Executor

Stages, tasks Task - A unit of work that will be sent to one executor Stage - Each job gets divided into smaller sets of “tasks” called stages that depend on each other aka synchronization points

Your job as a directed acyclic graph (DAG) full name https://dwtobigdata.wordpress.com/2015/09/29/etl-with-apache-spark/

Monitoring pages If you can, scroll the console output up to get an application ID (e.g. application_1468968864481_0070) and open http://haperf100.cern.ch:8088/proxy/application_XXXXXXXXXXXXX_XXXX/ Otherwise use the link below to find you application on the list http://haperf100.cern.ch:8088/cluster

Monitoring pages

Data APIs