Introduction to Apache Spark

Introduction to Apache Spark
Last update based on Spark 2.0 done on July 29, 2016. IBM Analytics © 2016 IBM Corporation

Spark for Big Data Analytics
Spark is often considered as a new, faster, and more advanced engine for Big Data analytics that could soon overthrow Hadoop as the most widely used Big Data tool. In fact, there is already a visible trend for many businesses to opt for Spark rather than Hadoop in their daily data processing activities. Undoubtedly, Spark has several selling points that make it a more attractive alternative to the slightly complicated, and sometimes clunky Hadoop. Reduce the processing time by up to 100 times when run in memory It can use a variety of data sources from standard relational databases It's a very flexible tool that can run as a standalone application, but also can be deployed on top of Hadoop and HDFS Spark is easy to use and may appeal to a large number of developers and data scientists as it supports multiple popular programming languages such as Java, Python, Scala, and of course R language

Spark Background Since 2012, the Apache Spark project emerged from a university run project initiated by Matei Zaharia at the University of California Berkeley to largest open-sourced Big Data solution, with hundreds of contributors and a vibrant community of active users Over years, it evolved into general application that supports not only simple data processing of massive datasets, but also allow users to carry out complex analytics using a large selection of in-built machine learning algorithms through its MLlib library, graph data analysis through the GraphX component, SQL querying using Spark SQL library, and ability to build streaming data applications through Spark Streaming.

Key Reasons for the Interest in Spark
Performant In-memory architecture greatly reduces disk I/O Anywhere from x faster for common tasks Productive Concise and expressive syntax, especially compared to prior approaches Single programming model across a range of use cases and steps in data lifecycle Integrated with common programming languages – Java, Python, Scala, R New tools continually reduce skill barrier for access (e.g. SQL for analysts) Note that this is a marketing slide. We can correct some of the perception it gives. Performance: The x faster is not qualified: faster than what? It turns out that this information comes from the home page of the Apache Spark site. It is compared to Hadoop MapReduce. We should not set this type of expectation since this is one very specific benchmark that is highly iterative. We don’t need to sell Spark. What is important is to demonstrate our commitment to Spark. It would be much better to Tell customers that they should see performance improvements compared to MapReduce but never support those multipliers. Productivity: One of the reason for increased productivity is the fact that Spark comes with a set of classes and methods that allow a programmer to get to the desired result faster. MapReduce did not have those pre-built capabilities. Note that having pre-built capabilities does not remove the learning curve to become competent with the product. Finally, mentioning “SQL” as part of new tools speaks to the state of open-source projects: they are mainly for technologists (programmers). A lot more needs to be done to make it easy to use. Leverage existing investment: You can still use your Hadoop (BigInsights) environment. Improves with age: Spark is the largest and most active Apache project. As we’ll see later, the product has evolved quickly and will certainly continue to do so for quite a while. Leverages existing investments Works well within existing Hadoop ecosystem Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities IBM Analytics © 2015 IBM Corporation

Motivation for Apache Spark
Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O HDFS Read Iteration 1 HDFS Write HDFS Read Iteration 2 HDFS Write Input CPU Memory CPU Memory Result The single most defining attribute of Spark is its departure from slow I/O operations against spinning disks and hard drives. For traditional Hadoop/MapReduce, the slow performance of disks for frequent read and write operations is a significant, speed-hampering bottleneck. MapReduce jobs are very much dependent on high throughput of read & write requests; every time the system needs to perform a look-up or a commit to a spinning disk, there is some latency associated with that process. Over time, the cumulative effect of this bottleneck on performance can be significant; more importantly, it limits the types of use cases and data streams that traditional Hadoop distributions have been able to work with. To illustrate this example, let's consider a complex MapReduce job that first performs an HDFS look-up from a spinning disk. Every job, task, or "iteration"– we can use each of these terms interchangeably –begins with a read from Hadoop (HDFS) storage. That data is transferred over the network and is swapped between the cluster's CPU(s) and memory (RAM) while the MapReduce product is calculated. At the conclusion of the job, this result MUST be written back to disk: an HDFS write operation is performed, incurring the I/O penalty we just described. Note that every Iteration requires an HDFS read and HDFS write step. One thing you may have noticed is that we could look at one Iteration and see the same pattern of operations as may be required to "chain" together multiple Iterations.

Motivation for Apache Spark
Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O Solution: Keep more data in-memory with a new distributed execution engine HDFS Read Iteration 1 HDFS Write HDFS Read Iteration 2 HDFS Write Input CPU Memory CPU Memory Result Spark's solution is to remove storage on disks from the execution framework altogether. Ultimately, you may want to persist the final product of your computations to disk - or perhaps you won't require this at all. It's up to the determination of the programmer to decide what best fits their use case and whether they wish to persist the results for long-term storage. The key is that the execution & processing framework excludes these performance-hampering reads & writes from spinning disks altogether. Once data has been loaded into Spark (for example, that initial load we see in the example above, just prior to Iteration 1), data is swapped between the CPU(s) and memory (RAM) as results are calculated. The output of that Iteration, however, does not need to be written back to disk – it can be stored temporarily (cached) in memory! And where Spark is truly unique is its ability to chain together multiple jobs (Iterations) without ever needing to write or read data from disk. The output of Iteration 1 can be piped directly as input into Iteration 2, and so on. This is incredibly powerful, performance, and opens the door to the type of use cases that Spark is able to address: interactive query, micro-batch streaming, and more are now possible, as a result of bypassing Hadoop's disk-dependency. Zero Read/Write Disk Bottleneck Chain Job Output into New Job Input HDFS Read Iteration 1 Iteration 2 Input CPU Memory CPU Memory faster than network & disk

Spark Common Use Cases Interactive Query Large-Scale Batch
Enterprise-scale data volumes accessible to interactive query for business intelligence (BI) Faster time to job completion allows analysts to ask the “next” question about their data & business Interactive Query Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.) Nightly ETL processing from production systems Large-Scale Batch Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks) Data mining across various types of data Complex Analytics Web server log file analysis (human-readable file formats that are rarely read by humans) in near-real time Responsive monitoring of RFID-tagged devices Event Processing Predictive modeling answers questions of "what will happen?" Self-tuning machine learning, continually updating algorithms, and predictive modeling Model Building One of the productivity aspects of Spark is its ability to be used in multiple “programming models”. This slide should be self explanatory but still requires a few comments: Complex analytics: “Nowcasting” is relative to what people are used to. If people are used to getting results in one hour and now they can get it in 10 minutes, it could be seen as “Nowcasting”. The event processing is still based on batch processing. It simply uses smaller batches (micro-batches) to achieve a semblance of streaming analytics. Latency is one key criteria to decide if this is appropriate. Model building refers to the SparkML and SparkMLlib libraries that include multiple popular machine learning algorithms. IBM Analytics © 2015 IBM Corporation

Common Applications for Spark
Spark is implemented inside many types of products, across a multitude of industries and organizations As you can see from the survey results above – the most common use cases for Spark range from business intelligence and data warehousing to provide business insight to even deeper analytics such as building recommendation systems, which often leverage machine learning and log processing for areas like 360 customer view. Other common uses include using multiple data sources – such as clickstream data, social media data, and customer history to provide real-time customization which provides an online consumer with offers that appeal to their interests. Source: 2015 Databricks Spark Survey IBM Big Data & Analytics © 2016 IBM Corporation

SPARK – The Analytics Operating System “Enabling New Classes of Intelligent Applications Embedded with Analytics” Spark unifies data, enabling real-time insights Spark processes and analyzes data from any data source Spark is complementary to Hadoop, but faster with in-memory performance Build models quickly. Iterate faster. Apply intelligence . Bottom line, IBM sees Spark a the analytics operating system.

Spark Analytics SparkML toolkit Data exchange Co-existence
Streaming scoring Model can be refreshed Data exchange File system / HDFS Message queues Co-existence Use the same cluster for both Classification SparkLinearSVM SparkNaiveBayes Clustering SparkClusteringKMeans Collaborative filtering SparkCollaborativeFilteringALS Regression SparkIsotonicRegression SparkLinearRegression SparkLogisticRegression Tree SparkDecisionTree SparkEnsembleGradientBoostedTrees SparkEnsembleRandomForest Spark is complementary to IBM Streams. Spark is at its core a batch system that operates on data-at-rest. IBM Streams can be used to interface with the required enterprise data sources and prepare the data before putting it in an enterprise repository such as HDFS. From there Spark can be used to do the required analytics. One of those analytics could be to create machine learning models that can be deployed in IBM Streams to score the incoming streaming data. IBM Streams can also complement Spark Streaming by providing an interface to more data sources and the ability to pre-process the data and aggregate it so Spark Streamig can keep up with the data. In Brief, IBM Stream complement both Spark and Spark Streaming by integrating and adding more analytics capabilities that can be essential in the streaming analytics. IBM Analytics © 2015 IBM Corporation

Spark Programming Languages
Scala Functional programming Spark written in Scala Scala compiles into Java byte code Java New features in Java 8 makes for more compact coding (lambda expressions) Python Always a bit behind Scala in functionality. Becoming the language of choice. R Language 2014 2015 2016 Scala 84% 71% 65% Java 38% 31% 29% Python 58% 62% R unknown 18% 20% From Matei Zaharia presentation at IBM on Nov 2, 2015 and Big Analytics Blog Sept 27, 2016. See: The Databricks survey also shows that the programming languages usage is changing. This is not to say that Scala is becoming less popular. It is likely because the type of people adopting Spark are coming more and more from a data science background which favors the use of Python and R. A word of caution: Spark is written in Scala which is a language that compiles into Java byte code. This means that any functionality added to Spark Are automatically available in Scala (and likely Java). This is not the same for Python and R. On the bright side, the functionality gap between the languages is shrinking. We may get to a point where the differences wont matter. Surveys done by Databricks, summer 2015, 2016

Web-Based Notebooks Notebooks: “interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media” Zeppelin Apache incubator project Current version: (Oct 15, 2016) Suport multiple interpreters Scala, Python, SparkSQL Jupyter (part of Bluemix Apache Spark Service and DSX) Based on IPython Current version: (Dec 21, 2016) Supports multiple interpreters Python, Scala, Spark R Definition taken from: The Spark community introduced (re-introduced?) the concept of notebook. These notebooks provide a much more compelling environment that using, let’s say, the Spark shell. The leading notebooks appear to be the Apache incubator project Zeppelin and the IPython/Jupyter notebook. Currently the Bluemix Apache Spark service comes with the Jupyter notebook. The Ipython-based notebooks are popular since they are Already known by Python users. Still, Zeppelin provides compelling functionality. It is easy to imagine having Spark environments Where we could have a choice of notebook.

Spark Application Architecture
A Spark application is initiated from a driver program Spark execution modes: Standalone with the built-in cluster manager Use Mesos as the cluster manager Use YARN as the cluster manager Standalone cluster on Amazon EC2 Spark is not tied to Hadoop. It can be installed as a stand-alone product or be used in conjunction with Mesos and Yarn. The key to spark processing is still to have a distributed file system so all the file operations, including I/O can be done in parallel. A Spark application is initiated by a driver program that eventually uses executors to execute tasks that represent the processing of a directed graph. More on directed graphs in a few slides.

Spark Streaming Component of Spark
Project started in 2012 First alpha release in Spring 2013 Out of alpha with Spark 0.9.0 Discretized Stream (DStream) programming abstraction Represented as a sequence of RDDs (micro-batches) RDD: set of records for a specific time interval Supports Scala, Java, and Python (with limitations) Fundamental architecture: batch processing of datasets Spark Streaming is an extension of the core Spark API. The project was started in 2012, released as alpha in the spring of 2013. It came out of alpha with the Spark release in February 2014. It defines a new abstraction: Discretized Streams or DStreams. A DStream represents a sequence of RDDs where each RDD is a collection or records that accumulated within a specific time interval. Just like Spark, it supports Scala, Java and, with limitations, Python. We have to understand that Spark Streaming is a construct that depends on Spark. Spark Streaming consists of the submissions of Spark jobs at specific intervals. In this context, as the submission rate increases, it becomes a higher percentage of the overall processing load. We’ll come back to this as we dig deeper into Spark Streaming.

Current Spark Streaming I/O
Input Sources Kafka, Flume, Twitter, ZeroMQ, MQTT, TCP sockets Basic sources: sockets, files, Akka actors Other sources require receiver threads Time-based stream processing requires application-level adjustments adding to complexity Spark Streaming may depend on Spark to do the processing but it needs interfaces to the outside world. Currently, there are several data sources that are supported as mentioned on the slide. There is a distinction between basic data sources and other sources. In the case of a source such as a file, Spark Streaming can read it directly. For other data sources, an execution thread must be allocated for each instance of them. The output targets are supported through Spark with a few functions already defined for specific ones and The foreachRDD that can be used to process each record and send it through a standard programming Interface.

MLlib Machine Learning
Structured Streaming New streaming engine built on the Spark SQL engine Alpha release in Spark 2.0, still alpha in (experimental) Reduces programming differences between Spark Streaming and Spark Includes improvements on state maintenance Somewhat early to evaluate Still micro-batch paradigm Spark Core Spark SQL Spark Streaming MLlib Machine Learning GraphX Graphing Structured Streaming This is a new interface that was introduced in Spark 2.0. It provides some improvement over the Spark Streaming Dstreams but due to the lack of functionality, it is too early to really evaluate it. An important thing to note is that it is still based on micro-batch so it is still not a streaming engine. IBM Analytics © 2015 IBM Corporation

SparkSQL Provide for relational queries expressed in SQL, HiveQL and Scala Seamlessly mix SQL queries with Spark programs DataFrame/Dataset provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files Leverages Hive frontend and metastore Compatibility with Hive data, queries, and UDFs HiveQL limitations may apply Not ANSI SQL compliant Little to no query rewrite optimization, automatic memory management or sophisticated workload management Graduated from alpha status with Spark 1.3 DataFrames API marked as experimental Standard connectivity through JDBC/ODBC IBM Analytics © 2015 IBM Corporation

Spark Machine Learning
Spark ML for machine learning library RDD-based package spark.mllib now in maintenance mode The primary API is now the DataFrame-based package spark.ml Parity of spark.ml estimated for Spark 2.2 Provides common algorithm and utilities Classification Regression Clustering Collaborative filtering Dimensionality reduction Basically spark ML provides you with a toolset to create "pipelines" of different machine learning related transformations on your data. It makes it easy to for example to chain feature extraction, dimensionality reduction, and the training of a classifier into 1 model, which as a whole can be later used for classification. MLlib, is older and has been in development longer, it has more features because of this.

Collaborative Filtering
Spark MLLib Basic Statistics Clustering Frequent Pattern Mining Summary Statistics K-means FP-growth Correlations Gaussian mixture Association rules Stratified sampling Power iteration clustering PreficSpan Hypothesis testing Latent Dirichlet allocation Optimization (developer) Streaming significant testing Bisecting k-means Stochastic gradient descent Random data generation Streaming k-means Limited memory BFGS Classification&Regression Collaborative Filtering Others Linear models Alternating least squares Evaluation metrics Naïve Bayes Dimensionality Reduction PMML model export Decision trees Singular value decomposition Feature extraction and transformation Ensembles trees Principal component analysis Isotonic regression From version 2.1.0:

Collaborative Filtering
Spark ML Classification/Regression Clustering Feature Extractors Logistic regression K-means TF-IDF Decision tree classifier Gaussian mixture Word2Vec Random forest (class/reg) Latent Dirichlet allocation CountVectorizer Gradient boosted tree Bisecting k-means Feature Transformers Multilayer perceptron classifier Collaborative Filtering Tokenizer One-vs-rest classifier Alternating least squares StopWordRemover Linear regression Feature Selectors N-gram Generalized linear reg. VectorSlicer Binarizer Naïve Bayes RFormula PCA Decision trees (class/reg) ChiSqSelector PolynomialExpansion Survival regression StringIndexer Isotonic regression more --- From version 2.1.0:

Spark GraphX Flexible Graphing GraphX unifies ETL, exploratory analysis, and iterative graph computation You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API GraphX is Apache Spark's API for graphs and graph-parallel computation. In addition to a highly flexible API, GraphX comes with a variety of graph algorithms In short GraphX gives you a proper representation where a you can calculate the shortest path, like the "six degrees" or connections between people on LinkedIn or Facebook. IBM Analytics © 2015 IBM Corporation

Sparklyr: Using Spark with R Markdown
sparklyr provides bindings to Spark’s distributed machine learning library. In particular, sparklyr allows you to access the machine learning routines provided by the spark.ml package. Together with sparklyr’s dplyr interface, you can easily create and tune machine learning workflows on Spark, orchestrated entirely within R. Sparklyr provides three families of functions that you can use with Spark machine learning: Machine learning algorithms for analyzing data (ml_*) Feature transformers for manipulating individual features (ft_*) Functions for manipulating Spark DataFrames (sdf_*) Algorithms Spark’s machine learning library can be accessed from sparklyr through the ml_* set of functions: For details:

Sparklyr and dplyr as an R Interface for Spark ML
Interactively manipulate Spark data using both dplyr and SQL (via DBI). Filter and aggregate Spark datasets then bring them into R for analysis and visualization. Orchestrate distributed machine learning from R using either Spark ML or H2O SparkingWater extension. Integrated support for establishing Spark connections and browsing Spark DataFrames within the RStudio IDE. IBM has incorporated sparklyr into its Data Science Experience as well.

Analytic Workflow with dplyr
Perform SQL queries through the sparklyr dplyr interface, Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set, Choose an appropriate machine learning algorithm from the ml_* family of functions to model your data, Inspect the quality of your model fit, and use it to make predictions with new data. Collect the results for visualization and further analysis in R

Spark ML machine learning library Can be accessed from sparklyr through the ml_* set of functions:
Description ml_kmeans K-Means Clustering ml_linear_regression Linear Regression ml_logistic_regression Logistic Regression ml_survival_regression Survival Regression ml_generalized_linear_regression Generalized Linear Regression ml_decision_tree Decision Trees ml_random_forest Random Forests ml_gradient_boosted_trees Gradient-Boosted Trees ml_pca Principal Components Analysis ml_naive_bayes Naive-Bayes ml_multilayer_perceptron Multilayer Perceptron ml_lda Latent Dirichlet Allocation ml_one_vs_rest One vs Rest

Manipulating Data with dplyr
dplyr is an R package for working with structured data both in and outside of R. dplyr makes data manipulation for R users easy, consistent, and performant. With dplyr as an interface to manipulating Spark DataFrames, you can: Select, filter, and aggregate data Use window functions (e.g. for sampling) Perform joins on DataFrames Collect data from Spark into R You can read data into Spark DataFrames using the following spark_read_csv Reads a CSV file and provides a data source compatible with dplyr spark_read_json Reads a JSON file and provides a data source compatible with dplyr spark_read_parquet Reads a parquet file and provides a data source compatible with dplyr Regardless of the format of your data, Spark supports reading data from a variety of different data sources. These include data stored on HDFS (hdfs:// protocol), Amazon S3 (s3n:// protocol), or local files available to the Spark worker nodes (file:// protocol)

Initializing the environment from Rstudio
install.packages("sparklyr") install.packages(”dplyr") library(sparklyr) library(dplyr) spark_install(version = "1.6.2") sc <- spark_connect(master = " yarn-client”, version=“1.6.0”) sc

Spark ML example: linear regression
Based on ml_linear_regression to fit a linear regression model. Can demo it with the built-in mtcars dataset Will attempt to predict a car’s fuel consumption (mpg) based on its weight (wt) and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

Code for the linear regression
# copy mtcars into spark mtcars_tbl <- copy_to(sc, mtcars) # transform our data set, and then partition into 'training', 'test' partitions <- mtcars_tbl %>% filter(hp >= 100) %>% mutate(cyl8 = cyl == 8) %>% sdf_partition(training = 0.5, test = 0.5, seed = 1099) # fit a linear model to the training dataset fit <- partitions$training %>% ml_linear_regression(response = "mpg", features = c("wt", "cyl")) fit Reference:

Results of linear regression with ML
> summary(fit) Call: ml_linear_regression(., response = "mpg", features = c("wt", "cyl")) Deviance Residuals:: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** wt * cyl Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-Squared: Root Mean Squared Error: 1.422 One difference is that the Spark output does not contain the original data (for good reason!), which means that there is no plot method defined on the object. # DOES NOT WORK! #plot(fit)

Links and tutorials: R and SparkML

Introduction to Apache Spark

Similar presentations

Presentation on theme: "Introduction to Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Apache Spark

Similar presentations

Presentation on theme: "Introduction to Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback