Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Shark:SQL and Rich Analytics at Scale
Shark Hive SQL on Spark Michael Armbrust.
Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Clydesdale: Structured Data Processing on MapReduce Jackie.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Hive: A data warehouse on Hadoop
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
The Power of Choice in Data-Aware Cluster Scheduling
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Introduction to Hadoop and HDFS
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
Hive Facebook 2009.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Fine-grained Partitioning for Aggressive Data Skipping Calvin SIGMOD 2014 UC Berkeley.
Intro to SQL Management Studio. Please Be Sure!! Make sure that your access is read only. If it isn’t, you have the potential to change data within your.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Big Data, Computation and Statistics Michael I. Jordan February 23,
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
Matthew Winter and Ned Shawa
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Session Name Pelin ATICI SQL Premier Field Engineer.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Recent Trends in Large Scale Data Intensive Systems
BlinkDB.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
BlinkDB.
Spark Presentation.
A paper on Join Synopses for Approximate Query Answering
Projects on Extended Apache Spark
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Introduction to Spark.
Spatial Online Sampling and Aggregation
StreamApprox Approximate Stream Analytics in Apache Flink
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
StreamApprox Approximate Stream Analytics in Apache Spark
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Fast, Interactive, Language-Integrated Cluster Computing
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica

Our Goal Support interactive SQL-like aggregate queries over massive sets of data

Our Goal Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.

Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ FILTERS, GROUP BY clauses Our Goal

Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id JOINS, Nested Queries etc. Our Goal

Support interactive SQL-like aggregate queries over massive sets of data blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id ML Primitives, User Defined Functions ML Primitives, User Defined Functions Our Goal

Hard Disks ½ - 1 Hour1 - 5 Minutes1 second ? Memory 100 TB on 1000 machines Query Execution on Samples

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table?

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/4 6Berkeley0.251/4 8NYC0.191/4 Uniform Sample /

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Query Execution on Samples What is the average buffering ratio in the table? IDCityBuff RatioSampling Rate 2NYC0.131/2 3Berkeley0.251/2 5NYC0.191/2 6Berkeley0.091/2 8NYC0.181/2 12Berkeley0.491/2 Uniform Sample $0.22 +/ /- 0.05

Speed/Accuracy Trade-off Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec

Execution Time Error 30 mins Time to Execute on Entire Dataset Interactive Queries 5 sec Speed/Accuracy Trade-off Pre-Existing Noise

Sampling Vs. No Sampling Fraction of full data Query Response Time (Seconds) x as response time is dominated by I/O 10x as response time is dominated by I/O

Sampling Vs. No Sampling Fraction of full data Query Response Time (Seconds) (0.02%) (0.07%)(1.1%)(3.4%) (11%) Error Bars

Latency: sec (17TB input) Latency: 1.78 sec (1.7GB input) Top 10 worse performers identical! 440x faster! Video Quality Diagnosis

What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

IDCityBuff Ratio 1NYC0.78 2NYC0.13 3Berkeley0.25 4NYC0.19 5NYC0.11 6Berkeley0.09 7NYC0.18 8NYC0.15 9Berkeley Berkeley NYC Berkeley0.10 Samples IDCityBuff RatioSampling Rate 2NYC0.132/7 8NYC0.252/7 6Berkeley0.092/5 12Berkeley0.492/5 Stratified Sample IDCityBuff RatioSampling Rate 2NYC0.131/3 8NYC0.251/3 6Berkeley0.091/3 11NYC0.191/3 Uniform Sample

Uniform Samples SELECT A.* FROM A WHERE rand() < 0.01 ORDER BY rand() Create a 1% random sample A_sample from A blinkdb> create table A_sample as select * from A samplewith 0.01; Keep track of per row scaling information

Stratified Samples Create a 1% stratified sample A_sample from A biased on col_A blinkdb> create table A_sample as select * from A samplewith 0.01 stratify on (col_A); Keep track of per row scaling information SELECT A.* from A JOIN (SELECT K, logic(count(*)) AS ratio FROM A GROUP BY K) USING K WHERE rand() < ratio;

What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

blinkdb> select count(1) from A_sample where event = “foo”; / (99% Confidence) Also supports: sum(),avg(),stdev(), var() Approximate Answers

What is BlinkDB? A data analysis (warehouse) system that … -creates and maintains a variety of random and stratified samples from underlying data -returns fast, approximate answers with error bars by executing queries on samples of data -is compatible with Apache Hive, AMP Lab’s Shark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)

BlinkDB Architecture Hadoop Storage (e.g., HDFS, Hbase, Presto) Meta store Meta store Hadoop/Spark/Presto SQL Parser Query Optimizer Physical Plan SerDes, UDFs Execution Driver Command-line Shell Thrift/JDBC

BlinkDB alpha Released and available at 2.Allows you to create random and stratified samples on native tables and materialized views 3.Adds approximate aggregate functions with statistical closed forms to HiveQL : approx_avg(), approx_sum(), approx_count() etc.

Feature Roadmap 1.Integrating BlinkDB with Facebook’s Presto and Shark as an experimental feature 2.Automatic Sample Management 3.More Hive Aggregates, UDAF Support 4.Runtime Correctness Tests

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS ± Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS ± Automatic Sample Management ± 4.96 Goal: The API should abstract the details of creating, deleting and managing samples from the user

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0% Automatic Sample Management Goal: The API should abstract the details of creating, deleting and managing samples from the user

TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics Automatic Sample Management

TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory. Automatic Sample Management

SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Automatic Sample Management

SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements Automatic Sample Management

TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Hive/Shark/Presto SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines. Automatic Sample Management

More Aggregates/ UDAF Generalized Error Estimators -Statistical Bootstrap -Applicable to complex and nested queries, UDAFs, joins etc.

More Aggregates/ UDAF Generalized Error Estimators -Statistical Bootstrap -Applicable to complex and nested queries, UDFs, joins etc. Sample A A R A1A1 A2A2 A 100 … … B ±ε

1.Given a query, how do you know if it can be approximated at runtime? -Depends on the query, data distribution, and sample size 2.Need for runtime diagnosis tests -Check whether error improves as sample size increases -Need to be extremely fast! Runtime Diagnostics

1.BlinkDB alpha released and available at Takes just 5-10 minutes to run it locally or to spin an EC2 cluster 3.Hands-on Exercises in the afternoon! Getting Started

1.Approximate queries is an important means to achieve interactivity in processing large datasets without really affecting the quality of results 2.BlinkDB.. -approximate answers with error bars by executing queries on small samples of data -supports existing Hive/Shark/Presto queries 3.For more information, please check out our EuroSys 2013 ( and KDD 2013 ( papershttp://bit.ly/blinkdb-1http://bit.ly/blinkdb-2 Summary Thanks!