The Hadoop Stack, Part 3 Introduction to Spark

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
The Hadoop Stack, Part 2 Introduction to HBase CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
An Introduction to HDInsight June 27 th,
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Functional Programming in Scheme and Lisp.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Spark: Cluster Computing with Working Sets
Big Data is a Big Deal!.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Spark Presentation.
Map Reduce.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Sajitha Naduvil-vadukootu
湖南大学-信息科学与工程学院-计算机与科学系
Overview of big data tools
TIM TAYLOR AND JOSH NEEDHAM
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Introduction to Spark.
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

The Hadoop Stack, Part 3 Introduction to Spark CSE 40822 – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame

Three Case Studies Workflow: Pig Latin Storage: HBase Execution: Spark A dataflow language and execution system that provides an SQL- like way of composing workflows of multiple Map-Reduce jobs. Storage: HBase A NoSQl storage system that brings a higher degree of structure to the flat-file nature of HDFS. Execution: Spark An in-memory data analysis system that can use Hadoop as a persistence layer, enabling algorithms that are not easily expressed in Map-Reduce.

References Matei Zaharia et al, Spark: Cluster Computing with Working Sets, USENIX HotCloud 2010. http://dl.acm.org/citation.cfm?id=1863103.1863113 Holden Karau et al., Learning Spark : Lightning-Fast Big Data Analytics, O’ Reilly 2014. http://shop.oreilly.com/product/0636920028512.do Apache Spark Documentation http://spark.apache.org

Overview The Map-Reduce paradigm is fundamentally limited in expressiveness. Hadoop implementation of Map-Reduce is designed for out-of-core data, not in-memory data. Idea: Layer an in-memory system on top of Hadoop. Achieve fault-tolerance by re-execution instead of replication.

Map-Reduce Limitations As a general programming model: It is perfect…. If your goal is to make a histogram from a large dataset! Hard to compose and nest multiple operations. No means of expressing iterative operations. Not obvious how to perform operations with different cardinality. Example: Try implementing All-Pairs efficiently. As implemented in Hadoop (GFS): All datasets are read from disk, then stored back on to disk. All data is (usually) triple-replicated for performance. Optimized for simple operations on a large amount of data. Java is not a high performance programming language.

A Common Iterative Pattern in Data Mining X = initial value for( i=0; ; i++ ) { set Si+1 = apply F to set Si value X = extract statistic from Si+1 if( X is good enough ) break; } On Board: Implement in Map-Reduce Can we do better?

The Working Set Idea Peter Denning, “The Working Set Model for Program Behavior”, Communications of the ACM, May 1968. http://dl.acm.org/citation.cfm?id=363141 Idea: conventional programs on one machine generally exhibit a high degree of locality, returning to the same data over and over again. The entire operating system, virtual memory system, compiler, and micro architecture are designed around this assumption! Exploiting this observation makes programs run 100X faster than simply using plain old main memory in the obvious way. (But in Map-Reduce, access to all data is equally slow.)

The Working Set Idea in Spark The user should identify which datasets they want to access. Load those datasets into memory, and use them multiple times. Keep newly created data in memory until explicitly told to store it. Master-Worker architecture: Master (driver) contains the main algorithmic logic, and the workers simply keep data in memory and apply functions to the distributed data. The master knows where data is located, so it can exploit locality. The driver is written in a functional programming language (Scala), so let’s detour to see what that means.

Detour: Pure Functional Programming Functions are first class citizens: The primary means of structuring a program. A function need not have a name! A function can be passed to another program as a value. A pure function has no side effects. In a pure functional programming language like LISP There are no variables, only values. There are no side effects, only values. Hybrid languages that have functional capabilities, but do not prohibit non-functional idioms: Scala, F#, JavaScript…

By the way, Map-Reduce is Inspired by LISP: map( (lambda(x)( * x x )) (1 2 3 4) ) reduce( (lambda(x y) (+ x y)) (1 2 3 4) )

Functions in Scala: Example code: def oncePerSecond(callback: () => Unit) { while( true ) { callback(); Thread sleep 1000 } } def main(args: Array[String]) { oncePerSecond( () =>println("time flies like an arrow...") ) Define a function in the ordinary way: def name (arguments) { code } Construct an anonymous func as a value: ( arguments ) => code Accept an anonymous func as a parameter: name: ( arguments ) => code A Scala Tutorial for Java Programmers: http://www.scala-lang.org/docu/files/ScalaTutorial.pdf

Parallel Operations in Scala val n = 10; for( i <- 1 to n ) { // run code each value of i in parallel } var items = List(1,2,3); for ( i <- items ) { // run code for each value of i in parallel

Back to Spark, Using Scala A program to count all the error lines in a large text file: val file = spark.textFile(“hdfs://path/to/file”); val errs = file.filter(_.contains(“ERROR”)); val ones = errs.map( _ => 1 ); val count = ones.reduce( _+_ ); val file = spark.textFile(“hdfs://path/to/file”); val errs = file.filter( (x) => x.contains(“ERROR”)); val ones = errs.map( (x) => 1 ); val count = ones.reduce( (x,y) => x+y ); _ means “the default thing that should go here.” On Board: Implement in Spark

Logistic Regression in Spark val points = spark.textFile( … ).map(parsePoint).cache() var w = Vector.random(D) for( i <- 1 to ITERATIONS ) { val grad = spark.acculmulator( new Vector(D) ) for( p <- points ) { val s = (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y grad += s * p.x } w -= grad.value

Fault Tolerance via Recomputation (Work out on the board.)

Result: Spark is 10-100X faster than Hadoop on equivalent iterative problems. (It does everything in memory instead of disk.)