Spark SQL.

Slides:



Advertisements
Similar presentations
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Advertisements

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
HADOOP ADMIN: Session -2
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Nov 2006 Google released the paper on BigTable.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Big Data Yuan Xue CS 292 Special topics on.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
OMOP CDM on Hadoop Reference Architecture
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
CS 405G: Introduction to Database Systems
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Big Data A Quick Review on Analytical Tools
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
So, what was this course about?
Pig : Building High-Level Dataflows over Map-Reduce
Spark SQL.
Hadoopla: Microsoft and the Hadoop Ecosystem
RDDs and Spark.
Operational & Analytical Database
Dremel.
Projects on Extended Apache Spark
Project Project mid-term report due on 25th October at midnight Format
Central Florida Business Intelligence User Group
Pig Latin - A Not-So-Foreign Language for Data Processing
Introduction to Spark.
NoSQL Systems Overview (as of November 2011).
1 Demand of your DB is changing Presented By: Ashwani Kumar
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Overview of big data tools
Data analytics with Hadoop In the Microsoft Azure cloud
Pig : Building High-Level Dataflows over Map-Reduce
Interpret the execution mode of SQL query in F1 Query paper
April 13th – Semi-structured data
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Introduction to MapReduce
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Using Columnstore indexes in Azure DevOps Services. Lessons learned
MapReduce: Simplified Data Processing on Large Clusters
Pig and pig latin: An Introduction
Big Data Technology: Introduction to Hadoop
Presentation transcript:

Spark SQL

Some History (for Dremel and SparkSQL) Parallel DB Systems have been around for 20-30 years prior Historical DB companies supporting parallelism include: Teradata, Tandem, Informix, Oracle, RedBrick, Sybase, DB2

Common Complaints Complaints included Too slow (especially for internet scale applications) Too much loading time Too monolithic and complex Instruction manuals of ~500 pages Too much heft for “internet scale” applications Too expensive Too hard to understand Poor support for complex non-relational ops

NoSQL The story of NoSQL This is the OLAP story, not the OLTP story Online Analytical Processing not Online Transaction Processing OLTP story BigTable (06) => MegaStore (11) => Spanner, F1 (12) Less consistency => More consistency Contemporaries: PNUTS, Cassandra, HBase, CouchDB, Dynamo

Others: Pig, Hive, Impala A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Spark (12) SparkSQL (14) Google Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Column Stores For OLAP, column stores are a lot better than row stores Idea from the 80s, commercialized as Vertica in 2005. Key idea: store values for a single column together Why is this better for aggregation?

Column Stores For OLAP, column stores are a lot better than row stores Key idea: store values for a single column together Why is this better for aggregation? Better compression; can pack similar values together better Can skip over unnecessary columns Much less data read from disk

Others: Pig, Hive, Impala A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Spark (12) SparkSQL (14) Google Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Map-Reduce 2004: Google published MapReduce. Parallel programming paradigm Pros: Fast fast fast Imperative Many real use-cases Cons: Checkpointing all intermediate results No real logic or optimization Very “rigid”, no room for improvement Many bottlenecks

NoSQL One OLAP story MapReduce (04) => Dremel (10) Less using pdb principles => More using pdb principles By 2010, Google had restricted MapReduce to complex batch processing, with Dremel for interactive analytics Contemporaries: MapReduce: Hadoop (Yahoo) PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook) PSQL-not-on-MapReduce: Impala

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: ?

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: Tree-based computation SQL-based specification Column Store encoding Native JSON support

Spark vs. Dremel 2012: Berkeley Folks Similar to Dremel in that the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

Disadvantages of MapReduce 1. Extremely rigid data flow M R Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Another Example: DryadLINQ Get SM G S O Take string uri = @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Execution Plan Graph

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

Others: Pig, Hive, Impala A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Spark (12) SparkSQL (14) Google Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

What did you think of this paper?

This paper Appeared at the “Industry” Track of SIGMOD Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

Key Benefits of SparkSQL Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn’t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

History lesson: earliest example of “bridging the gap” What’s the earliest example of “bridging the gap” between procedural and relational?

History lesson: earliest example of “bridging the gap” What’s the earliest example of “bridging the gap” between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL