Fast and Expressive Big Data Analytics with Python

Name: Fast and Expressive Big Data Analytics with Python
Uploaded: 2017-10-17T13:33:03+00:00
Duration: PTM13S5
Channel: Esther Flynn
Description: Fast and Expressive Big Data Analytics with Python

Fast and Expressive Big Data Analytics with Python
UC BERKELEY Matei Zaharia UC Berkeley / MIT spark-project.org

What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Scala, Java, Python Interactive shell Up to 100× faster (2-10× on disk) Often 5× less code

Project History Started in 2009, open sourced 2010
17 companies now contributing code Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, … Entered Apache incubator in June Python API added in February TODO: Apache incubator logo

An Expanding Stack Spark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Spark Spark Streaming (real-time) GraphX (graph) … Shark (SQL) MLbase (machine learning) More details: amplab.berkeley.edu

This Talk Spark programming model Examples Demo Implementation Trying it out

Why a New Programming Model?
MapReduce simplified big data processing, but users quickly found two problems: Programmability: tangle of map/red functions Speed: MapReduce inefficient for apps that share data across multiple steps Iterative algorithms, interactive queries

Data Sharing in MapReduce
HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Input Input query 1 query 2 query 3 result 1 result 2 result 3 HDFS read Each iteration is, for example, a MapReduce job Slow due to data replication and disk I/O

10-100× faster than network and disk
What We’d Like iter. 1 iter. 2 Input query 1 one-time processing query 2 query 3 Input Distributed memory 10-100× faster than network and disk

Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Collections of objects that can be stored in memory or disk across a cluster Built via parallel transformations (map, filter, …) Automatically rebuilt on failure

Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk data)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver results tasks Block 1 Action messages.filter(lambda s: “foo” in s).count() Add “variables” to the “functions” in functional programming messages.filter(lambda s: “bar” in s).count() Cache 2 . . . Cache 3 Block 2 Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk data) Result: full-text search of Wikipedia in 2 sec (vs 30 s for on-disk data) Block 3

Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data messages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2]) HadoopRDD path = hdfs://… FilteredRDD func = lambda s: … MappedRDD func = lambda s: …

Example: Logistic Regression
Goal: find line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target

Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Key idea: add “variables” to the “functions” in functional programming

Logistic Regression Performance
110 s / iteration first iteration 80 s further iterations 5 s 100 GB of data on 50 m1.xlarge EC2 machines

Demo This will show interactive search on 50 GB of Wikipedia data

Supported Operators map filter groupBy union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup flatMap take first partitionBy pipe distinct save ...

Other Engine Features General operator graphs (not just map-reduce) Hash-based reduces (faster than Hadoop’s sort) Controlled data partitioning to save communication

Spark Community 1000+ meetup members 60+ contributors 17 companies contributing Alibab, tenzent At Berkeley, we have been working on a solution since This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.

This Talk Spark programming model Examples Demo Implementation Trying it out

Overview Spark core is written in Scala PySpark calls existing scheduler, cache and networking layer (2K-line wrapper) No changes to Python Spark worker Python child Your app Spark client PySpark Spark worker Python child

cs.berkeley.edu/~joshrosen
Overview Spark core is written in Scala PySpark calls existing scheduler, cache and networking layer (2K-line wrapper) No changes to Python Spark worker Python child Main PySpark author: Josh Rosen cs.berkeley.edu/~joshrosen Your app Spark client PySpark Spark worker Python child

Object Marshaling Uses pickle library for both communication and cached data Much cheaper than Python objects in RAM Lambda marshaling library by PiCloud

Job Scheduler Supports general operator graphs Automatically pipelines functions Aware of data locality and partitioning join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop = cached data partition

Interoperability Runs in standard CPython, on Linux / Mac
Works fine with extensions, e.g. NumPy Input from local file system, NFS, HDFS, S3 Only text files for now Works in IPython, including notebook Works in doctests – see our tests!

Getting Started Visit spark-project.org for video tutorials, online exercises, docs Easy to run in local mode (multicore), standalone clusters, or EC2 Training camp at Berkeley in August (free video): ampcamp.berkeley.edu

Getting Started Easiest way to learn is the shell: $ ./pyspark
>>> nums = sc.parallelize([1,2,3]) # make RDD from array >>> nums.count() 3 >>> nums.map(lambda x: 2 * x).collect() [2, 4, 6]

Writing Standalone Jobs
from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext(“local”, “WordCount”) lines = sc.textFile(“in.txt”) counts = lines.flatMap(lambda s: s.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x, y: x + y) counts.saveAsTextFile(“out.txt”)

Look for our training camp on August 29-30!
Conclusion PySpark provides a fast and simple way to analyze big datasets from Python Learn more or contribute at spark-project.org Look for our training camp on August 29-30! My

Behavior with Not Enough RAM

The Rest of the Stack Spark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS) Spark Spark Streaming (real-time) GraphX (graph) … Shark (SQL) MLbase (machine learning) More details: amplab.berkeley.edu

Performance Comparison
SQL Streaming Graph

Fast and Expressive Big Data Analytics with Python

Similar presentations

Presentation on theme: "Fast and Expressive Big Data Analytics with Python"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast and Expressive Big Data Analytics with Python

Similar presentations

Presentation on theme: "Fast and Expressive Big Data Analytics with Python"— Presentation transcript:

Similar presentations

About project

Feedback