Introduction to Apache Spark

Introduction to Apache Spark
Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup

What is Apache Spark

Spark General Purpose Computing Framework faster than anything else
Used for large-scale data processing Runs everywhere Flexible (SQL, Streaming, GraphX, MLlib) Easy to use (Scala, Java, Python, and R APIs) . Daytona Gray Sort Contest . Can be used for any kind of computation (cluster computing) not only bigdata . Can be deployed on different platforms so we are not tight to any platform in particular . It is being used for different workflow. SQL, ML, GraphX, Streaming . Supported by functional programming paradigm but it also supports more classic approaches.

Daytona Grey Sort Contest
. The Daytona Gray Sort Contest 3X faster using 10X fewer machines

General Purpose val count = sc.parallelize(1 to NUM_SAMPLES).map{i => val x = Math.random() val y = Math.random() if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " * count / NUM_SAMPLES) . Monte Carlos method for PI . An approximation of PI running on a cluster. . The biggest the NUM_SAMPLES is the better approximation we can get. But also the higher this number is more power we need to calculate all those approximations! . We need a cluster computing to do good approximations.

. SQL to standard SQL Access to any data source.
The Stack . Spark Components . SQL to standard SQL Access to any data source. . SQL to integrate different data sources . Streaming to Streaming the IoT. MLlib to create apps that learn from the data, and predict the future. . GraphX and Graph data frames

The Spark Context (sc) The SparkContext class tells to Spark how to access to the cluster. val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) . The SparkContext is how we tell Spark how it communicates with the environment where it runs. . The Context controls logs, serialization, fault tolerance, etc…. . Only one Context can be running on Spark at a time. . The Context can be created locally for testing purpose

Resilient Distributed Datasets RDD . Resilient Distributed Datasets
Resilient: If one part is lost, it will be recomputed and recover. RDD are defined by a DAG (Directed Acyclic Graph). If part of the collection is lost, the DAG for that part of the collection is executed and the data reconstructed on a different Spark Executor Distributed: RDD is a collection of a type. (they are type annotated) partitioned across the cluster.

Transformations on RDD
map distinct join flatMap groupByKey filter reduceByKey union aggregateByKey intersection sortByKey . These transformations do not execute anything on the RDD (our data set). . They return new RDD based on the transformation. . The actually change the DAG to represent the transformations. . The DAG will be executed later in the application when an action is invoked.

Actions on RDD reduce countByKey take collect saveAsTextFile first
takeSample . These are actions that cause the DAG to execute, applying the the transformation on it. . Every time an action execute the entire DAG will be executed again. Performance tuning by Caching.

RDD Execution Graph - Logical View
Each Transformation create a new RDD

Physical View RDD Partitions are distributed across the cluster
. Shuffle reaccomodates the data creating hight network traffic. . The execution plan is divided into stages.

RDD RDD are created from any kind of sources (Text Files, HDFS, Raw Sockets, AWS S3, Azure Blob Storage, Cassandra, etc...) RDD are lazy when calling Transformations on them. RDD are represented by Spark as DAG (re- computation) immutable, deterministically re-computable, distributed dataset

Yarn Mesos AWS EMR Azure HDInsight Stand alone Deployment Platforms
. Yarn, extended used by Hadoop. . Allows dynamic resource allocation. . We can avoid wiring parameter in spark-submit . It helps in performance tuning . Mesos is similar to Yarn but built under different ideas. . I would say that Mesos is different from Yarn since it can manage the entire data center resources and Yarn is for Hadoop Jobs as a new version of MapReduce v1 . AWS EMR is a managed Amazon framework where Spark can be deployed. . It runs on Amazon managed infrastructure (EC2 Instances). . Very scalable by nature . Interesting deployment choice if the data is store on Amazon S3. . Easily integrated with other Amazon resources like Cloudwatch, Redshift, DynamoDB, etc, … . HDInsight, run on Windows Azure. . It has similar capabilities that EMR but using Azure resources. . We found the automation of the deployment is a little more complicated than using AWS. .Standalone uses Spark own scheduler to be deploy on single computer or a cluster managed by spark. . It is not recommended to use unless we are use only Spark applications that do not interact with other Hadoop resources . it is very useful when testing and doing quick data exploration (we are going to use it today)

The ABC Example

Word Counting val rdd = sc.textFile("path to the file") val counts: RDD[(String, Int)] = rdd.flatMap(line => line.split(" ")) map(word => (word, 1)) reduceByKey(_ + _) The Word Counting app

DEMO . sbt package . run example using spark
. run the same example in shell mode

But there is a word counting on Hadoop!

. Very complicated job for a very simple operation.
. If we look deeper at it, it seems like two different jobs (one for mapping and one for reducing) . Our application domain problems should be restricted by the framework

Specialized System build around Hadoop
Specialized System build around Hadoop. Because they are specialized systems, we need specialized people to maintain them. Normally they are huge systems very complicated to use with little flexibility out of their own focus. . Spark resolve this problem in a different way. Instead of building specialized systems it has specialized libraries that all of them share the same API.

Spark SQL Allows us to represent our data in a tabular format
We can run SQL queries on it We can easily integrate different sources and run parallel queries on all of them at once We can use standard tools that use SQL to query any kind of data at scale . Spark SQL run on top of Spark Core having RDD has the common data abstraction. . SqlContext and HiveContext Hive is a Data warehouse system for Hadoop . Data Frames are the main abstraction. . They represent our data in tabular format as a Table . using the sql context we can query the data using sql or a language integrated query similar to LINQ in .NET . We can expose our data sources through an SQL Endpoint where Spark SQL works as a distributed SQL Engine.

Built-in Data Sources Json Parquet Text Files ORC SerDer Hive Tables
JDBC . With Spark SQL we can access different data sources but now using SQL access style. . We can join data sets and query them all at once using the same API . Data Frames are optimized for columnar storage access so we could push down filters into the storage level itself gaining huge performance improvements

Third-Party Data Sources
Cassandra Impala Drill CSV Files Other Customs ( Read my blog to see how to implement your own ) . The Data Frame API can be extended. (read my blog post)

Spark SQL Important Abstractions
Data Frames Data Sets val people = sqlContext.read.json(path).as[Person] . Data Frame is not type annotated. . Data Set were built to solve this problem. . They have type annotation. . They use optimized encoders . They can be optimized by the Catalyst optimizer

Spark Data Set API Strongly Typed Tabular Representation
Uses Schema for data representation (typed schema) Encoder optimizations for faster data access

Spark SQL, a Distributed Query Engine

Spark Data Frames val sc = new SparkContext(config)
val sql = new HiveContext(sc) val transactionsDF = val df = sqlContext .read .format("com.nico.datasource.dat") .load(“~/transactions/“) transactionsDF.registerTempTable("some_table") ———————————————————————— More at:

Spark Streaming StreamingContext
Built-in File Streaming, Raw Socket Streaming. Libraries for Twitter, Kafka, AWS Kinesis, Flume, etc… Can be extended to stream from any source Batch Processing (micro batches) Streams can look back to the future Windowed Operations stream.countByWindow(Seconds(20)) . The StreamingContext is the entry point to create Streams from different input sources

Streaming Architecture Overview
In any stream processing system, broadly speaking, there are three steps in processing the data. Receiving the data: The data is received from sources using Receivers or otherwise. Transforming the data: The received data is transformed using DStream and RDD transformations. Pushing out the data: The final transformed data is pushed out to external systems like file systems, databases, dashboards, etc.

Streaming Internals DStream: Discretized Stream
Each batch is a DStream. Each DStream is continuous Sequence of RDD of the same type Operations in DStream work in the same way that work in RDDs

From DStream to RDD

What you can get from Spark Streaming?
Millions of events per seconds (Billions if right deployment) Concise API that is used in any other component of Spark Fault Tolerance Exactly-one semantic out of the box (for DFS and Kafka) Integration with Spark SQL, MLlib, GraphX . About 500K events/s with Rabbit MQ . About 26 Millions with 2 Kafka Brokers can be scale out adding power to Kafka since it does scale better than Rabbit MQ . Kafka can distribute topics and doesn't have to deal with delivery semantics which makes it quite faster than Rabbit MQ . MapR Streams! . Uses Kafka API making transition painless . Billions of events with a 4 Node MapR cluster . Designed to Scala. Can Replicate topics and partitions cross data centers 1. At most once: Each record will be either processed once or not processed at all. 2. At least once: Each record will be processed one or more times. This is stronger than at-most once as it ensure that no data will be lost. But there may be duplicates. 3. Exactly once: Each record will be processed exactly once - no data will be lost and no data will be processed multiple times. This is obviously the strongest guarantee of the three. write ahead logs

Be careful Not everyone needs streaming
Processing time must be smaller than batch time (Back pressure) You might get out-of-order data Applications need fine tuning since they have to run all the time You need to careful planning your deployment strategy

Twitter Streaming Demo?

It is better if we create our own streaming server!
Python Application that works as a streaming service

. Python Server that streams data over a socket once a client gets connected.
. Better example like the Twitter, because we can see of the server works. . Cool since I created my own streaming server Why not to use the socket receiver of Spark?

Demo python server: ~/PycharmProjects/socket_server/
client: ~/custom-spark-streaming-receiver/

dstream.foreachRDD { rdd => val connection = createNewConnection()
Gotchas dstream.foreachRDD { rdd => val connection = createNewConnection() rdd.foreach { record => connection.send(record) } }

dstream.foreachRDD { rdd => rdd.foreach { record =>
Gotchas dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() connection.send(record) connection.close() }

dstream.foreachRDD { rdd =>
Gotchas dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() }

MLlib & GraphX to be continued...
Classifiers: Decision Trees, Bayesian. Linear regression. Clustering: Kmean FP-Growth association rules and correlation PCA Principal Component Analisys ( Important Dimensionality reduction algorithm used in faced recognition ) GraphX, Vertices, Edges, Graph Frames. Shortest Path (Dijsktra), Articulation Point

Question?

Introduction to Apache Spark

Similar presentations

Presentation on theme: "Introduction to Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Apache Spark

Similar presentations

Presentation on theme: "Introduction to Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback