Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark 1.1 and Beyond Patrick Wendell.
Spark Community Update
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Hadoop Ecosystem Overview
Outline | Motivation| Design | Results| Status| Future
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Session 1 - Introduction and Data Access Layer
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Our Experience Running YARN at Scale Bobby Evans.
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Matthew Winter and Ned Shawa
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Apache Spark: A Unified Engine for Big Data Processing
Connected Infrastructure
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Machine Learning Library for Apache Ignite
ITCS-3190.
Spark.
Microsoft Machine Learning & Data Science Summit
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop Tutorials Spark
Open Source distributed document DB for an enterprise
Spark Presentation.
Connected Infrastructure
Data Platform and Analytics Foundational Training
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Server & Tools Business
Introduction to Apache
Overview of big data tools
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Spark and Scala.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Pig Hive HBase Zookeeper
Presentation transcript:

Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond

About me Committer and PMC member of Apache Spark “Former” PhD student at Berkeley Release manager for Spark 1.0 Background in networking and distributed systems

Today’s Talk Spark background About the Spark release process The Spark 1.0 release Looking forward to Spark 1.1

What is Spark? Efficient General execution graphs In-memory storage Usable Rich APIs in Java, Scala, Python Interactive shell Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop 2-5× less code Up to 10× faster on disk, 100× in memory

30-Day Commit Activity

Spark Philosophy Make life easy and productive for data scientists Well documented, expressive API’s Powerful domain specific libraries Easy integration with storage systems … and caching to avoid data movement Predictable releases, stable API’s

Spark Release Process Quarterly release cycle (3 months) 2 months of general development 1 month of polishing, QA and fixes Spark 1.0 Feb 1  April 8 th, April 8 th + Spark 1.1 May 1  July 8 th, July 8 th +

Spark 1.0: By the numbers -3 months of development -639 patches JIRA issues contributors

API Stability in 1.X API’s are stable for all non-alpha projects Spark 1.1, 1.2, … will be Internal API that is User-facing API that might stabilize later

Today’s Talk About the Spark release process The Spark 1.0 release Looking forward to Spark 1.1

Spark 1.0 Features Core engine improvements Spark streaming MLLib Spark SQL

Spark Core History server for Spark UI Integration with YARN security model Unified job submission tool Java 8 support Internal engine improvements

History Server Configure with : spark.eventLog.enabled=true spark.eventLog.dir=hdfs://XX In Spark Standalone, history server is embedded in the master. In YARN/Mesos, run history server as a daemon.

Job Submission Tool Apps don’t need to hard-code master: conf = new SparkConf().setAppName(“My App”) sc = new SparkContext(conf)./bin/spark-submit \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster

Java 8 Support RDD operations can use lambda syntax class Split extends FlatMapFunction { public Iterable call(String s) { return Arrays.asList(s.split(" ")); } ); JavaRDD words = lines.flatMap(new Split()); JavaRDD words = lines.flatMap(s -> Arrays.asList(s.split(" "))); Old New

Java 8 Support NOTE: Minor API changes (a) If you are extending Function classes, use implements rather than extends. (b) Return-type sensitive functions mapToPair mapToDouble

Python API Coverage rdd operators intersection(), take(), top(), topOrdered() meta-data name(), id(), getStorageLevel() runtime configuration setJobGroup(), setLocalProperty()

Integration with YARN Security Supports Kerberos authentication in YARN environments: spark.authenticate= true ACL support for user interfaces: spark.ui.acls.enable = true spark.ui.view.acls = patrick, matei

Engine Improvements Job cancellation directly from UI Garbage collection of shuffle and RDD data

Documentation Unified Scaladocs across modules Expanded MLLib guide Deployment and configuration specifics Expanded API documentation

Spark RDDs, Transformations, and Actions Spark Streaming real-time Spark SQL Spark SQL MLLib machine learning MLLib machine learning DStream’s: Streams of RDD’s SchemaRDD’s RDD-Based Matrices

Spark SQL

Turning an RDD into a Relation // Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects, register it as a table. val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",").map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")

Querying using SQL // SQL statements can be run directly on RDD’s val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age "Name: " + t(0)).collect() // Language integrated queries (ala LINQ) val teenagers = people.where('age >= 10).where('age <= 19).select('name)

Import and Export // Save SchemaRDD’s directly to parquet people.saveAsParquetFile("people.parquet") // Load data stored in Hive val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ // Queries can be expressed in HiveQL. hql("FROM src SELECT key, value")

In Memory Columnar Storage Spark SQL can cache tables using an in- memory columnar format: - Scan only required columns - Fewer allocated objects (less GC) - Automatically selects best compression

Spark Streaming Web UI for streaming Graceful shutdown User-defined input streams Support for creating in Java Refactored API

MLlib Sparse vector support Decision trees Linear algebra SVD and PCA Evaluation support 3 contributors in the last 6 months

MLlib Note: Minor API change val data = sc.textFile("data/kmeans_data.txt") val parsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray) val clusters = KMeans.train(parsedData, 4, 100) val data = sc.textFile("data/kmeans_data.txt") val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble))) val clusters = KMeans.train(parsedData, 4, 100)

1.1 and Beyond Data import/export leveraging catalyst HBase, Cassandra, etc Shark-on-catalyst Performance optimizations External shuffle Pluggable storage strategies Streaming: Reliable input from Flume and Kafka

Unifying Experience SchemaRDD represents a consistent integration point for data sources spark-submit abstracts the environmental details (YARN, hosted cluster, etc). API stability across versions of Spark

Conclusion Visit spark.apache.org for videos, tutorials, and hands-on exercises.spark.apache.org Help us test a release candidate! Spark Summit on June 30 th spark-summit.orgspark-summit.org Meetup group meetup.com/spark-usersmeetup.com/spark-users