Spark, Shark and Spark Streaming Introduction Part2

Spark, Shark and Spark Streaming Introduction Part2
Tushar Kale June 2015 Spark Shark Streaming Introduction

This Talk Introduction to Shark, Spark and Spark Streaming
Architecture Deployment Methodology Performance References Spark Shark Streaming Introduction

Resource Management Layer
Data Processing Stack Data Processing Layer Resource Management Layer Storage Layer

Hadoop Stack Data Processing Layer Hadoop MR Hive Pig HBase Storm … Resource Management Layer Hadoop Yarn Storage Layer HDFS, S3, …

BDAS Stack Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase Data Processing Layer Mesos Resource Management Layer HDFS, S3, … Tachyon Storage Layer

How do BDAS & Hadoop fit together?
Spark Streaming Shark SQL Graph X ML library BlinkDB MLbase Spark Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase Hadoop MR Hive Pig HBase Storm Mesos Mesos Hadoop Yarn HDFS, S3, … Tachyon

Mesos Spark Streaming Shark BlinkDB GraphX MLlib MLBase HDFS, S3, … Tachyon Apache Mesos Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) Twitter’s large scale deployment 6,000+ servers, 500+ engineers running jobs on Mesos Third party Mesos schedulers AirBnB’s Chronos Twitter’s Aurora Mesospehere: startup to commercialize

Apache Spark Distributed Execution Engine
Streaming. Shark BlinkDB GraphX MLlib MLBase Apache Spark Mesos HDFS, S3, … Tachyon Distributed Execution Engine Fault-tolerant, efficient in-memory storage (RDDs) Powerful programming model and APIs (Scala, Python, Java) Fast: up to 100x faster than Hadoop Easy to use: 5-10x less code than Hadoop General: support interactive & iterative apps Two major releases since last AMPCamp

Spark Streaming Large scale streaming computation
Shark BlinkDB GraphX MLlib MLBase Spark Streaming Mesos HDFS, S3, … Tachyon Large scale streaming computation Implement streaming as a sequence of <1s jobs Fault tolerant Handle stragglers Ensure exactly one semantics Integrated with Spark: unifies batch, interactive, and batch computations Alpha release (Spring, 2013)

Shark Hive over Spark: full support for HQL and UDFs
Streaming Shark BlinkDB GraphX MLlib MLBase Shark Mesos HDFS, S3, … Tachyon Hive over Spark: full support for HQL and UDFs Up to 100x when input is in memory Up to 5-10x when input is on disk Running on hundreds of nodes at Yahoo! Two major releases along Spark

Unified Programming Models
Unified system for SQL, graph processing, machine learning All share the same set of workers and caches

BlinkDB Trade between query performance and accuracy using sampling
Spark Streaming Shark BlinkDB GraphX MLlib MLBase BlinkDB Mesos HDFS, S3, … Tachyon Trade between query performance and accuracy using sampling Why? In-memory processing doesn’t guarantee interactive processing E.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing 512GB 16 cores 40-60GB/s doubles every 18 months doubles every 36 months

Spark Streaming Shark BlinkDB GraphX MLlib MLBase Key Insights Mesos HDFS, S3, … Tachyon Input often noisy: exact computations do not guarantee exact answers Error often acceptable if small and bounded Main challenge: estimate errors for arbitrary computations Alpha release (August, 2013) Allow users to build uniform and stratified samples Provide error bounds for simple aggregate queries

GraphX Combine data-parallel and graph-parallel computations
Spark Streaming Shark BlinkDB GraphX MLlib MLBase GraphX Mesos HDFS, S3, … Tachyon Combine data-parallel and graph-parallel computations Provide powerful abstractions: PowerGraph, Pregel implemented in less than 20 LOC! Leverage Spark’s fault tolerance Alpha release: expected this fall

MLlib and MLbase MLlib: high quality library for ML algorithms
Spark Streaming Shark BlinkDB GraphX MLlib MLBase MLlib and MLbase Mesos HDFS, S3, … Tachyon MLlib: high quality library for ML algorithms Will be released with Spark 0.8 (September, 2013) MLbase: make ML accessible to non-experts Declarative interface: allow users to say what they want E.g., classify(data) Automatically pick best algorithm for given data, time Allow developers to easily add and test new algorithms Alpha release of MLI, first component of MLbase, in September, 2013

Tachyon In-memory, fault-tolerant storage system
Spark Streaming Shark BlinkDB GraphX MLlib MLBase Tachyon Mesos HDFS, S3, … Tachyon In-memory, fault-tolerant storage system Flexible API, including HDFS API Allow multiple frameworks (including Hadoop) to share in-memory data Alpha release (June, 2013)

Compatibility to Existing Ecosystem
Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … GraphLab API Spark Streaming Shark SQL BlinkDB GraphX MLlib MLBase Hive API Mesos Resource Management Layer Support Hadoop, Storm, MPI HDFS, S3, … Tachyon Storage Layer HDFS API

Summary BDAS: address next Big Data challenges
Unify batch, interactive, and streaming computations Easy to develop sophisticate applications Support graph & ML algorithms, approximate queries Witnessed significant adoption 20+ companies, 70+ individuals contributing code Exciting ongoing work MLbase, GraphX, BlinkDB, … Batch Interactive Streaming Spark

RDDs Three methods for creation Parallelizing an existing collection
Referencing a dataset From another RDD Dataset from any storage supported by Hadoop HDFS Cassandra HBase Amazon S3 Others File types supported Text files SequenceFiles Hadoop InputFormat ©2015 IBM Corporation

Scala and Python Spark comes with two shells Scala Python
APIs available for Scala, Python and Java Appropriate versions for each Spark release Spark’s native language is Scala, more natural to write Spark applications using Scala. This presentation will focus on code examples in Scala ©2015 IBM Corporation

Spark’s Scala and Python Shell
Powerful tool to analyze data interactively The Scala shell runs on the Java VM Can leverage existing Java libraries Scala: To launch the Scala shell (from Spark home directory): ./bin/spark-shell To read in a text file: scala> val textFile = sc.textFile("README.txt") Python: To launch the Python shell (from Spark home directory): ./bin/pyspark >>> textFile = sc.textFile("README.txt") ©2015 IBM Corporation

Scala ‘Scalable Language’
Object oriented, functional programming language Runs in a JVM Java Interoperability Functions are passable objects Two approaches Anonymous function syntax x => x + 1 Static methods in a global singleton object object MyFunctions { def func1 (s: String): String = {…} } myRdd.map(MyFunctions.func1) ©2015 IBM Corporation

Code Execution (1) ‘spark-shell’ provides Spark context as ‘sc’
// Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

Code Execution (2) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt RDD: quotes DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

Code Execution (3) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt RDD: quotes RDD: danQuotes DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

Code Execution (4) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible

Code Execution (5) HadoopRDD 1
// Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark HadoopRDD DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool DAN Scala is awesome Spark Scala 1

RDD Transformations Transformations are lazy evaluations
Returns a pointer to the transformed RDD Transformation Meaning map(func) Return a new dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should return a Seq rather than a single item join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func sortByKey([ascending],[numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V) pairs sorted by keys in ascending or descending order. So a quick recap. Transformations are essentially lazy evaluations. Nothing is executed until an action is called. You Each transformation function basically updates the graph and when an action is called, the graph is executed. Transformation returns a pointer to the new RDD. Here are a few transformation functions available. You can find out more on Spark’s website. The map function returns a new dataset formed by passing each element of the source through the given function. The filter function returns a new dataset formed by selecting the element on which the given filter returns true. The flatMap function is similar to map, but each input can be mapped to 0 or more output items. So the returned pointer will be to a Seq rather than a single item. The join function combines two sets of type (K,V) and (K,W) and returns a dataset of (K, (V,W)) pairs. The reduceByKey function aggregates on each key by using the given reduce function. The sortByKey function sorts the dataset. Full documentation at

RDD Actions Actions returns values Action Meaning collect()
Return all the elements of the dataset as an array of the driver program. This is usually useful after a filter or another operation that returns a sufficiently small subset of data. count() Return the number of elements in a dataset. first() Return the first element of the dataset take(n) Return an array with the first n elements of the dataset. foreach(func) Run a function func on each element of the dataset. Action returns values. The collect function returns all the elements of the dataset as an array of the driver program. This is usually useful after a filter or another operation that returns a significantly small subset of data. The count function returns the number of elements in a dataset. The first function returns the first element of the dataset The take(n) function returns an array with the first n elements. Note that this is currently not executed in parallel. The driver computes all the elements. The foreach(func) function run a function func on each element of the dataset. Full documentation at

RDD Persistence Each node stores any partitions of the cache that it computes in memory Reuses them in other actions on that dataset (or datasets derived from it) Future actions are much faster (often by more than 10x) Two methods for RDD persistence: persist() and cache() Storage Level Meaning MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of it will be cached. The other will be recomputed as needed. This is the default. The cache() method uses this. MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk when needed. MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more CPU intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects. DISK_ONLY Store only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as above, but replicate each partition on two cluster nodes OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. One of the key capability of Spark is its speed through persisting or caching. Each node stores any partitions of the cache and computes it in memory. When a subsequent action is called on the same dataset, or a derived dataset, it uses it from memory instead of having to retrieve it again. Future actions in such cases are often 10 times faster. The first time a RDD is persisted, it is kept in memory on the node. Caching is fault tolerant because if it any of the partition is lost, it will automatically be recomputed using the transformations that originally created it. There are two methods to invoke RDD persistence. persist() and cache(). The persist() method allows you to specify a different storage level of caching. For example, you can choose to persist the data set on disk, persist it in memory but as serialized objects to save space, etc. The cache() method is just the default way of using persistence by storing deserialized objects in memory. The table here shows the storage levels and what it means. Basically, you can choose to store in memory or memory and disk. If a partition does not fit in the specified cache location, then it will be recomputed on the fly. You can also decide to serialized the objects before storing this. This is space efficient, but will require the RDD to deserialized before it can be read, so it takes up more CPU workload. There’s also the option to replicate each partition on two cluster nodes. Finally, there is an experimental storage level storing the serialized object in Tachyon. This level reduces garbage collection overhead and allows the executors to be smaller and to share a pool of memory. You can read more about this on Spark’s website.

Quick Introduction to Data Frames
Experimental API introduced in Spark 1.3 Distributed Collection of Data organized in Columns Targeted at Python ecosystem Equivalent to Tables in Databases or DataFrame in R/PYTHON Much richer optimization than any other implementation of DF Can be constructed from a wide variety of sources and APIs

Create a DataFrame val df = sqlContext.jsonFile("/home/ned/attendees.json") df.show() df.printSchema() df.select ("First Name").show() df.select("First Name","Age").show() df.filter(df("age")>40).show() df.groupBy("age").count().show()

Create a DataFrame from an RDD
case class attendees_class (first_name: String, last_name:String, age:Int) Val attendees=sc.textFile("/home/ned/attendees.csv").map(_.split(",")).map(p=>at tendees_class(p(0),p(1),p(2).trim.toInt)).toDF() people.registerTempTable("attendees") val youngppl=sqlContext.sql("select first_name,last_name from attendees where age <35") youngppl.map(t=>"Name: " +t(0)+ " " + t(1)).collect().foreach(println)

SparkContext in Applications
The main entry point for Spark functionality Represents the connection to a Spark cluster Create RDDs, accumulators, and broadcast variables on that cluster In the Spark shell, the SparkContext, sc, is automatically initialized for you to use In a Spark program, import some classes and implicit conversions into your program: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf The SparkContext is the main entry point to everything Spark. It can be used to create RDDs and shared variables on the cluster. In the Spark Shell, when you start it up, the SparkContext is automatically initialized for you. In a Spark program, you must first import some classes and implicit conversions. The three import statements for Scala are shown on the slide here. You will see the import statements for Python and Java in the next couple of slides. You can look up the API if you wish to learn more about them, but for our purpose, it is enough to just understand that you must have these three statements.

A Spark Standalone Application in Scala
Import statements Transformations and Actions SparkConf and SparkContext Here you will see how to create and run a standalone applications. First you will see how to do this in Scala. The next following sets of slides will show Python and Java. The application shown here counts the number of lines with ‘a’ and the number of lines with ‘b’. You will need to replace the YOUR_SPARK_HOME with the directory where Spark is installed. Unlike the Spark shell, you have to initialize the SparkContext in a program. First you must create a SparkConf to set up your application’s name. Then you create the SparkContext by passing in the SparkConf object. Next, you create the RDD by loading in the textFile, and then caching the RDD. Since we will be apply a couple of transformation on it, caching will help speed up the process, especially if the logData RDD is large. Finally, you get the values of the RDD by executing the count action on it. End the program by printing it out onto the console.

Running Standalone Applications
Define the dependencies Scala - simple.sbt Create the typical directory structure with the files Create a JAR package containing the application’s code. Scala: sbt package Use spark-submit to run the program Scala: ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala To run the application, you will need to first define the dependencies. In Scala, it is defined in the simple.sbt file. In Java, it is defined in the pom.xml file. In Python, you don’t need to define any dependencies for this simple application, but if you used third party libraries, then you can use the –py-files argument to handle that. Next, you place your files in the typical directory structure as shown for Scala and Java. Python does not need to do this. Finally, you have to create the JAR package using the appropriate tool and then run the spark-submit to execute the application.

filter (func = _.contains(...))
Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = _.contains(...)) map (func = _.split(...)) Spark Shark Streaming Introduction

Which Language Should I Use?
Standalone programs can be written in any, but interactive shell is only Python & Scala Python users: can do Python for both Java users: consider learning Scala for shell Performance: Java & Scala are faster due to static typing, but Python is often fine

More details: scala-lang.org
Scala Cheat Sheet Variables: var x: Int = 7 var x = // type inferred val y = “hi” // read-only Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x // last line returned } Collections and closures: val nums = Array(1, 2, 3) nums.map((x: Int) => x + 2) // {3,4,5} nums.map(x => x + 2) // same nums.map(_ + 2) // same nums.reduce((x, y) => x + y) // 6 nums.reduce(_ + _) // same Java interop: import java.net.URL new URL(“ More details: scala-lang.org

Spark in Scala and Java // Scala:
val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Behavior with Less RAM

Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency Tested with 100 text streams on 100 EC2 instances with 4 cores each

Performance and Generality (Unified Computation Models)
Streaming (SparkStreaming) Interactive (SQL, Shark) Batch (ML, Spark)

Example: Video Quality Diagnosis
Top 10 worse performers identical! Here are the results of the query we execute on real anonymized data that show the top ten combination of ISP and city that perform the worse. This query provides exact results, runs over an input of 17TB, and takes 772 seconds. Now, on the right hand side we run the same query on a sample of 1.7GB that fits in the memory. This query takes only 2 sec, and provides approximate results; the errors for 95% confidence intervals are shown, the top ten worse ISP city combinations and their rankings are the same! So in this case we basically provide the same result, but 400x faster! 440x faster! Latency: sec (17TB input) Latency: 1.78 sec (1.7GB input) Spark Shark Streaming Introduction

Architecture Deployment Methodology Implementation Next Steps References Spark Shark Streaming Introduction

https://amplab.cs.Berkeley.edu/software
fundamentals/

THANK YOU

Spark, Shark and Spark Streaming Introduction Part2

Similar presentations

Presentation on theme: "Spark, Shark and Spark Streaming Introduction Part2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark, Shark and Spark Streaming Introduction Part2

Similar presentations

Presentation on theme: "Spark, Shark and Spark Streaming Introduction Part2"— Presentation transcript:

Similar presentations

About project

Feedback