Spark SQL.

Spark SQL

Some History (for Dremel and SparkSQL)
Parallel DB Systems have been around for years prior Historical DB companies supporting parallelism include: Teradata, Tandem, Informix, Oracle, RedBrick, Sybase, DB2

Common Complaints Complaints included
Too slow (especially for internet scale applications) Too much loading time Too monolithic and complex Instruction manuals of ~500 pages Too much heft for “internet scale” applications Too expensive Too hard to understand Poor support for complex non-relational ops

NoSQL The story of NoSQL This is the OLAP story, not the OLTP story
Online Analytical Processing not Online Transaction Processing OLTP story BigTable (06) => MegaStore (11) => Spanner, F1 (12) Less consistency => More consistency Contemporaries: PNUTS, Cassandra, HBase, CouchDB, Dynamo

Others: Pig, Hive, Impala
A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Spark (12) SparkSQL (14) Google Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Column Stores For OLAP, column stores are a lot better than row stores
Idea from the 80s, commercialized as Vertica in 2005. Key idea: store values for a single column together Why is this better for aggregation?

Column Stores For OLAP, column stores are a lot better than row stores
Key idea: store values for a single column together Why is this better for aggregation? Better compression; can pack similar values together better Can skip over unnecessary columns Much less data read from disk

Map-Reduce 2004: Google published MapReduce.
Parallel programming paradigm Pros: Fast fast fast Imperative Many real use-cases Cons: Checkpointing all intermediate results No real logic or optimization Very “rigid”, no room for improvement Many bottlenecks

NoSQL One OLAP story MapReduce (04) => Dremel (10)
Less using pdb principles => More using pdb principles By 2010, Google had restricted MapReduce to complex batch processing, with Dremel for interactive analytics Contemporaries: MapReduce: Hadoop (Yahoo) PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook) PSQL-not-on-MapReduce: Impala

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: ?

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: Tree-based computation SQL-based specification Column Store encoding Native JSON support

Spark vs. Dremel 2012: Berkeley Folks Similar to Dremel in that
the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

Disadvantages of MapReduce
1. Extremely rigid data flow M R Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Another Example: DryadLINQ
Get SM G S O Take string uri PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Execution Plan Graph

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

What did you think of this paper?

This paper Appeared at the “Industry” Track of SIGMOD
Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

Key Benefits of SparkSQL
Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn’t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source
Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

History lesson: earliest example of “bridging the gap”
What’s the earliest example of “bridging the gap” between procedural and relational?

History lesson: earliest example of “bridging the gap”
What’s the earliest example of “bridging the gap” between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL

Spark SQL.

Similar presentations

Presentation on theme: "Spark SQL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark SQL.

Similar presentations

Presentation on theme: "Spark SQL."— Presentation transcript:

Similar presentations

About project

Feedback