Spark Community Update

Name: Spark Community Update
Uploaded: 2017-12-18T20:43:58+00:00
Duration: PTM12S15
Channel: Leo Willis
Description: Spark Community Update

Spark Community Update
Matei Zaharia

An Exciting Year for Spark
May 2013 May 2014 developers contributing 60 200 companies contributing 17 50 total lines of code 49,000 155,000 commercial support none all major Hadoop distros

An Exciting Year for Spark
May 2013 May 2014 developers contributing 60 200 companies contributing 17 50 total lines of code 49,000 150,000 commercial support none all major Hadoop distros

Community Growth Feb ‘14 Sept ‘13 Feb ‘13 Oct ‘12 May‘14 Spark 1.0:
110 contributors Feb ‘14 Spark 0.9: 83 contributors Sept ‘13 Spark 0.8: 67 contributors Spark 0.7: 31 contributors This trend is fuelled by an hyperactive open source community. Spark 0.6: 17 contributors Feb ‘13 Oct ‘12 May‘14

Community Growth Activity in last 30 days

Videos, slides, registration: spark-summit.org
Events December 2-3, 2013 Talks from 22 organizations 450 attendees June 30-July 2, 2014 Talks from 50+ organizations Sign up now! Videos, slides, registration: spark-summit.org

Users & Presenters

Next-Gen MapReduce Influential bloggers:
“Leading successor of MapReduce” – Mike Olson, Cloudera “Two years ago and last year were about Hadoop; this year is about Spark” – Derrick Harris, GigaOM “Just about everybody seems to agree” that … “Spark will be the replacement of Hadoop MapReduce” – Curt Monash, DBMS2 Many pundits are already declaring Hadoop the next gen MapReduce.

What’s Happening Next?

Many Features Added to Core…
APIs Full parity in Java & Python, Java 8 lambda support Management High availability, YARN security Monitoring Greatly improved UI, metrics

But Most Action Now in Libraries
An expressive API is good, but even better to call your algorithm in 1 line!

Spark Streaming real-time
Additions to the Stack Spark SQL Shark SQL Spark Streaming real-time MLlib machine learning GraphX graph Spark Core

Overview Spark SQL = Catalyst optimizer framework + implementations of SQL & HiveQL on Spark Provides native support for executing relational queries (SQL) in Spark Alpha version in Spark 1.0 Led by another AMP alum: Michael Armbrust

Relationship to Shark modified the Hive backend to run over Spark, but had two challenges: Limited integration with Spark programs Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark: Borrows Hive data loading In-memory column store Adds RDD-aware optimizer Rich language interfaces

Hive Compatibility Interfaces to access data and code in the Hive ecosystem: Support for writing queries in HQL Catalog info from Hive MetaStore Tablescan operator that uses Hive SerDes Wrappers for Hive UDFs, UDAFs, UDTFs

Parquet Compatibility
Native support for reading data in Parquet: Columnar storage avoids reading unneeded data. RDDs can be written to parquet files, preserving the schema.

Abstraction: SchemaRDDs
Resilient Distributed Datasets (RDDs) are Spark’s core abstraction. Pro: Distributed coarse-grained transformations Con: Operations opaque to engine SchemaRDDs add: Awareness of names & types of data stored Optimization using database techniques

Examples Consider a text file filled with people’s names and ages:
Michael, 30 Andy, 31 Justin Bieber, 19 …

Turning an RDD into a Relation
// Define the schema using a case class. case class Person(name: String, age: Int) // Create an RDD of Person objects and register it as a table. val people = sc.textFile("people.txt") .map(_.split(",")) .map(p => Person(p(0), p(1).trim.toInt)) people.registerAsTable("people")

Querying Using SQL // SQL statements can be run by using the sql method provided // by sqlContext. val teenagers = sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19") // The results of SQL queries are SchemaRDDs but also // support normal RDD operations. // The columns of a row in the result are accessed by ordinal. val nameList = teenagers.map(t => "Name: " + t(0)).collect()

SQL + Machine Learning val trainingDataTable = sql("""
SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""") // Since sql returns an RDD, the results can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)

Joining Diverse Sources
val hiveContext = new HiveContext(sc) import hiveContext._ // Data in Hive hql("CREATE TABLE IF NOT EXISTS hiveTable (key INT, val STRING)") hql("LOAD DATA LOCAL INPATH 'kv.txt' INTO TABLE hiveTable") // Data in existing RDDs val rdd = sc.parallelize((1 to 100).map(i => Record(i, "val" + i))) rdd.registerAsTable("rddTable") // Data in Parquet hiveContext.loadParquetFile("f.parquet").registerAsTable("parqTable") // Query all sources at once! sql("SELECT * FROM hiveTable JOIN rddTable JOIN parqTable WHERE ...")

Spark SQL in Java public class Person implements Serializable { public String getName() {...} public void setName(String name) {...} public int getAge() {...} public void setAge(int age) {...} } JavaRDD<Person> people = sc.textFile("people.txt").map( line -> { String[] parts = line.split(","); new Person(parts[0], Integer.parseInt(parts[1])); } ); JavaSQLContext ctx = new JavaSQLContext(sc) JavaSchemaRDD peopleTable = ctx.applySchema(people, Person.class);

Spark SQL in Python from pyspark.context import SQLContext sqlCtx = SQLContext(sc) lines = sc.textFile("people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: {"name": p[0], "age": int(p[1])}) peopleTable = sqlCtx.applySchema(people) peopleTable.registerAsTable("people") teenagers = sqlCtx.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19") teenNames = teenagers.map(lambda p: "Name: " + p.name)

Spark SQL Research Catalyst framework: compact optimizer based on functional language techniques Pattern-matching, fixpoint convergence of rules Complex analytics: expose and optimize MLlib and GraphX algos in SQL

Learn More Visit spark.apache.org for the latest Spark news, docs & tutorials Join us at this year’s Summit: spark-summit.org

Overview Spark SQL = Catalyst optimizer framework + implementations of SQL & HiveQL on Spark Provides native support for executing relational queries (SQL) in Spark Alpha version in Spark 1.0 Led by another AMP alum: Michael Armbrust

Relationship to Shark modified the Hive backend to run over Spark, but had two challenges: Limited integration with Spark programs Hive optimizer not designed for Spark Spark SQL reuses the best parts of Shark Borrows: Hive data loading In-memory column store Adds: RDD-aware optimizer Rich language interfaces

Spark Community Update

Similar presentations

Presentation on theme: "Spark Community Update"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark Community Update

Similar presentations

Presentation on theme: "Spark Community Update"— Presentation transcript:

Similar presentations

About project

Feedback