Spark 1.1 and Beyond Patrick Wendell.

Slides:



Advertisements
Similar presentations
Spark Streaming Real-time big-data processing
Advertisements

Shark Hive SQL on Spark Michael Armbrust.
Spark Streaming Large-scale near-real-time stream processing
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Spark Lightning-Fast Cluster Computing UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Community Update
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Putting the Sting in Hive Page 1 Alan F.
Hadoop Ecosystem Overview
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
SparkR: Enabling Interactive Data Science at Scale
co-founder / data Artisans
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
Webinar: From Hadoop to Spark Introduction Hadoop and Spark Comparison From Hadoop to Spark.
Our Experience Running YARN at Scale Bobby Evans.
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark Streaming Large-scale near-real-time stream processing
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Matthew Winter and Ned Shawa
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Page 1 © Hortonworks Inc – All Rights Reserved What's new in Hive 2.0 Sergey Shelukhin.
Big Data Yuan Xue CS 292 Special topics on.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
1 Tree and Graph Processing On Hadoop Ted Malaska.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Image taken from: slideshare
Apache Spark: A Unified Engine for Big Data Processing
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Introduction to Spark Streaming for Real Time data analysis
ITCS-3190.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Introduction to Apache Spark
CMPT 733, SPRING 2016 Jiannan Wang
Overview of big data tools
Spark and Scala.
Spark and Scala.
Fast, Interactive, Language-Integrated Cluster Computing
Streaming data processing using Spark
Big-Data Analytics with Azure HDInsight
Lecture 29: Distributed Systems
Presentation transcript:

Spark 1.1 and Beyond Patrick Wendell

About Me Work at Databricks leading the Spark team Spark 1.1 Release manager Committer on Spark since AMPLab days

This Talk Spark 1.1 (and a bit about 1.2) A few notes on performance Q&A with myself, Tathagata Das, and Josh Rosen

A Bit about Spark… Spark Streaming real-time GraphX MLLib DStream’s: Streams of RDD’s RDD-Based Graphs RDD-Based Matrices RDD-Based Tables Spark Streaming real-time GraphX Graph (alpha) MLLib machine learning Spark SQL Spark RDD API HDFS, S3, Cassandra YARN, Mesos, Standalone

Spark Release Process ~3 month release cycle, time-scoped 2 months of feature development 1 month of QA Maintain older branches with bug fixes Upcoming release: 1.1.0 (previous was 1.0.2)

Master More stable V1.1.0 For any P.O.C or non production cluster, we always recommend running off of the head of a release branch. More features branch-1.1 V1.0.0 V1.0.1 branch-1.0

Spark 1.1 1,297 patches 200+ contributors (still counting) Dozens of organizations To get updates – join our dev list: E-mail dev-subscribe@spark.apache.org

Roadmap Around ~40% of mailing list traffic is about these libraries. Spark 1.1 and 1.2 have similar themes Spark core: Usability, stability, and performance MLlib/SQL/Streaming: Expanded feature set and performance Around ~40% of mailing list traffic is about these libraries.

Spark Core in 1.1 Performance “out of the box” Sort-based shuffle Efficient broadcasts Disk spilling in Python YARN usability improvements Usability Task progress and user-defined counters UI behavior for failing or large jobs

Spark SQL in 1.1 1.0 was the first “preview” release 1.1 provides upgrade path for Shark Replaced Shark in our benchmarks with 2-3X perf gains Can perform optimizations with 10-100X less effort than Hive.

Turning an RDD into a Relation // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects, register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")  

Querying using SQL // SQL statements can be run directly on RDD’s val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs and support // normal RDD operations. val nameList = teenagers.map(t => "Name: " + t(0)).collect() // Language integrated queries (ala LINQ) val teenagers =  people.where('age >= 10).where('age <= 19).select('name)

Spark SQL in 1.1 JDBC server for multi-tenant access and BI tools Native JSON support Public types API – “make your own” Schema RDD’s Improved operator performance Native Parquet support and optimizations

Spark Streaming Stability improvements across the board Amazon Kinesis support Rate limiting for streams Support for polling Flume streams Streaming + ML: Streaming linear regressions

What’s new in MLlib v1.1 Contributors: 40 (v1.0) -> 68 Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec Statistics: sampling (core), correlations, hypothesis testing, random data generation Performance and scalability: major improvement to decision tree, tree aggregation Python API: decision tree, statistics, linear methods

Performance (v1.0 vs. v1.1)

Sort-based Shuffle Old shuffle: Each mapper opens a file for each reducer and writes output simultaneously. Files = # mappers * # reducers New Shuffle: Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.

GroupBy Operator Spark groupByKey != SQL groupBy NO: people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum) YES: people.map(p => (p.zipCode, p.getIncome)) .reduceByKey(_ + _)

GroupBy Operator Spark groupByKey != SQL groupBy NO: people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum) YES: people.groupBy(‘zipCode).select(sum(‘income))

GroupBy Operator Spark groupByKey != SQL groupBy NO: people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum) YES: SELECT sum(income) FROM people GROUP BY zipCode;

Other efforts Spark Streaming real-time GraphX MLLib Spark RDD API Ooyala Job Server DStream’s: Streams of RDD’s RDD-Based Graphs RDD-Based Matrices RDD-Based Tables Hive on Spark Spark Streaming real-time GraphX Graph (alpha) MLLib machine learning Spark SQL Pig on Spark Spark RDD API HDFS, S3, Cassandra YARN, Mesos, Standalone

Looking Ahead to 1.2+ [Core] Scala 2.11 support Debugging tools (task progress, visualization) Netty-based communication layer [SQL] Portability across Hive versions Performance optimizations (TPC-DS and Parquet) Planner integration with Cassandra and other sources

Looking Ahead to 1.2+ [Streaming] Python Support Lower level Kafka API w/ recoverability [MLLib] Multi-model training Many new algorithms Faster internal linear solver

Q and A Josh Rosen PySpark and Spark Core Tathagata Das Spark Streaming Lead