EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Shark:SQL and Rich Analytics at Scale
Starfish: A Self-tuning System for Big Data Analytics.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
SDN + Storage.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
SkewTune: Mitigating Skew in MapReduce Applications
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Spark: Cluster Computing with Working Sets
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.
Cloud Computing Other Mapreduce issues Keke Chen.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: OldSQL vs. NoSQL vs. NewSQL.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce VS Parallel DBMSs
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
HadoopDB project An Architetural hybrid of MapReduce and DBMS Technologies for Analytical Workloads Anssi Salohalla.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24 th, 2012 John Kubiatowicz and Anthony.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Large-scale file systems and Map-Reduce
Spark Presentation.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
A Comparison of Approaches to Large-Scale Data Analysis
湖南大学-信息科学与工程学院-计算机与科学系
Selected Topics: External Sorting, Join Algorithms, …
Experiences with Hadoop and MapReduce
EECS 262a Advanced Topics in Computer Systems Lecture 21 Comparison of Parallel DB, CS, MR and Spark November 11th, 2018 John Kubiatowicz Electrical.
Presentation transcript:

EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

3/16/20162Cs262a-S16 Lecture-16 Today’s Papers A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker. Appears in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2009A Comparison of Approaches to Large-Scale Data Analysis Jockey: Guaranteed Job Latency in Data Parallel Clusters Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. Appears in Proceedings of the European Professional Society on Computer Systems (EuroSys), 2012Jockey: Guaranteed Job Latency in Data Parallel Clusters Thoughts?

3/16/20163Cs262a-S16 Lecture-16 Two Approaches to Large-Scale Data Analysis “Shared nothing” MapReduce –Distributed file system –Map, Split, Copy, Reduce –MR scheduler Parallel DBMS –Standard relational tables, (physical location transparent) –Data are partitioned over cluster nodes –SQL –Join processing: T1 joins T2 »If T2 is small, copy T2 to all the machines »If T2 is large, then hash partition T1 and T2 and send partitions to different machines (this is similar to the split-copy in MapReduce) –Query Optimization –Intermediate tables not materialized by default

3/16/20164Cs262a-S16 Lecture-16 Architectural Differences Parallel DBMSMapReduce Schema SupportOX IndexingOX Programming Model Stating what you want (SQL) Presenting an algorithm (C/C++, Java, …) OptimizationOX FlexibilitySpotty UDF SupportGood Fault ToleranceNot as GoodGood Node Scalability<100>10,000

3/16/20165Cs262a-S16 Lecture-16 Schema Support MapReduce –Flexible, programmers write code to interpret input data –Good for single application scenario –Bad if data are shared by multiple applications. Must address data syntax, consistency, etc. Parallel DBMS –Relational schema required –Good if data are shared by multiple applications

3/16/20166Cs262a-S16 Lecture-16 Programming Model & Flexibility MapReduce –Low level: “We argue that MR programming is somewhat analogous to Codasyl programming…” –“Anecdotal evidence from the MR community suggests that there is widespread sharing of MR code fragments to do common tasks, such as joining data sets.” –very flexible Parallel DBMS –SQL –user-defined functions, stored procedures, user-defined aggregates

3/16/20167Cs262a-S16 Lecture-16 Indexing MapReduce –No native index support –Programmers can implement their own index support in Map/Reduce code –But hard to share the customized indexes in multiple applications Parallel DBMS –Hash/b-tree indexes well supported

3/16/20168Cs262a-S16 Lecture-16 Execution Strategy & Fault Tolerance MapReduce –Intermediate results are saved to local files –If a node fails, run the node-task again on another node –At a mapper machine, when multiple reducers are reading multiple local files, there could be large numbers of disk seeks, leading to poor performance. Parallel DBMS –Intermediate results are pushed across network –If a node fails, must re- run the entire query

3/16/20169Cs262a-S16 Lecture-16 Avoiding Data Transfers MapReduce –Schedule Map close to data –But other than this, programmers must avoid data transfers themselves Parallel DBMS –A lot of optimizations –Such as determine where to perform filtering

3/16/201610Cs262a-S16 Lecture-16 Node Scalability MapReduce –10,000’s of commodity nodes –10’s of Petabytes of data Parallel DBMS –<100 expensive nodes –Petabytes of data

3/16/201611Cs262a-S16 Lecture-16 Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks –Selection –Aggregation –Join –User-defined-function (UDF) aggregation

3/16/201612Cs262a-S16 Lecture-16 Node Configuration 100-node cluster –Each node: 2.40GHz Intel Core 2 Duo, 64-bit red hat enterprise Linux 5 (kernel ) w/ 4Gb RAM and two 250GB SATA HDDs. Nodes interconnected with Cisco Catalyst 3750E 1Gb/s switches –Internal switching fabric has 128Gbps –50 nodes per switch Multiple switches interconnected via 64Gbps Cisco StackWise ring –The ring is only used for cross-switch communications.

3/16/201613Cs262a-S16 Lecture-16 Tested Systems Hadoop ( on Java 1.6.0) –HDFS data block size: 256MB –JVMs use 3.5GB heap size per node –“Rack awareness” enabled for data locality –Three replicas w/o compression: Compression or fewer replicas in HDFS does not improve performance DBMS-X (a parallel SQL DBMS from a major vendor) –Row store –4GB shared memory for buffer pool and temp space per node –Compressed table (compression often reduces time by 50%) Vertica –Column store –256MB buffer size per node –Compressed columns by default

3/16/201614Cs262a-S16 Lecture-16 Benchmark Execution Data loading time: –Actual loading of the data –Additional operations after the loading, such as compressing or building indexes Execution time –DBMS-X and vertica: »Final results are piped from a shell command into a file –Hadoop: »Final results are stored in HDFS »An additional Reduce job step to combine the multiple files into a single file

3/16/201615Cs262a-S16 Lecture-16 Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks –Selection –Aggregation –Join –User-defined-function (UDF) aggregation

3/16/201616Cs262a-S16 Lecture-16 Task Description From MapReduce paper –Input data set: 100-byte records –Look for a three-character pattern –One match per 10,000 records Varying the number of nodes –Fix the size of data per node (535MB/node) –Fix the total data size (1TB)

3/16/201617Cs262a-S16 Lecture-16 Data Loading Hadoop: –Copy text files into HDFS in parallel DBMS-X: –Load SQL command executed in parallel: it performs hash partitioning and distributes records to multiple machines –Reorganize data on each node: compress data, build index, perform other housekeeping »This happens in parallel Vertica: –Copy command to load data in parallel –Data is re-distributed, then compressed

3/16/201618Cs262a-S16 Lecture-16 Data Loading Times DBMS-X: grey is loading, white is re-organization after loading –Loading is actually sequential despite parallel load commands Hadoop does better because it only copies the data to three HDFS replicas

3/16/201619Cs262a-S16 Lecture-16 Execution SQL: –SELECT * FROM data WHERE field LIKE “%XYZ%” –Full table scan MapReduce: –Map: pattern search –No reduce –An additional Reduce job to combine the output into a single file

3/16/201620Cs262a-S16 Lecture-16 Execution time Hadoop’s large start-up cost shows up in Figure 4, when data per node is small Vertica’s good data compression Combine output grep

3/16/201621Cs262a-S16 Lecture-16 Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks –Selection –Aggregation –Join –User-defined-function (UDF) aggregation

3/16/201622Cs262a-S16 Lecture-16 Input Data Input #1: random HTML documents –Inside an html doc, links are generated with Zipfian distribution –600,000 unique html docs with unique urls per node Input #2: 155 million UserVisits records –20GB/node Input #3: 18 million Ranking records –1GB/node

3/16/201623Cs262a-S16 Lecture-16 Selection Task Find the pageURLs in the rankings table (1GB/node) with a pageRank > threshold –36,000 records per data file (very selective) SQL: SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; MR: single Map, no Reduce

3/16/201624Cs262a-S16 Lecture-16 Selection Task Hadoop’s start-up cost; DBMS uses index; vertica’s reliable message layer becomes bottleneck

3/16/201625Cs262a-S16 Lecture-16 Aggregation Task Calculate the total adRevenue generated for each sourceIP in the UserVisits table (20GB/node), grouped by the sourceIP column. –Nodes must exchange info for computing groupby –Generate 53 MB data regardless of number of nodes SQL: SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; MR: –Map: outputs (sourceIP, adRevenue) –Reduce: compute sum per sourceIP –“Combine” is used

3/16/201626Cs262a-S16 Lecture-16 Aggregation Task DBMS: Local group-by, then the coordinator performs the global group-by; performance dominated by data transfer.

3/16/201627Cs262a-S16 Lecture-16 Join Task Find the sourceIP that generated the most revenue within Jan 15-22, 2000, then calculate the average pageRank of all the pages visited by the sourceIP during this interval SQL: SELECT INTO Temp sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(‘ ’) AND Date(‘ ’) GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

3/16/201628Cs262a-S16 Lecture-16 Map Reduce Phase 1: filter UserVisits that are outside the desired date range, joins the qualifying records with records from the Ranking file Phase 2: compute total adRevenue and average pageRank per sourceIP Phase 3: produce the largest record

3/16/201629Cs262a-S16 Lecture-16 Join Task DBMS can use index, both relations are partitioned on the join key; MR has to read all data MR phase 1 takes an average seconds – 600 seconds of raw I/O to read the table; 300 seconds to split, parse, deserialize; Thus CPU overhead is the limiting factor

3/16/201630Cs262a-S16 Lecture-16 UDF Aggregation Task Compute inlink count per document SQL: SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUP BY url; Need a user-defined-function to parse HTML docs (C pgm using POSIX regex lib) Both DBMS’s do not support UDF very well, requiring separate program using local disk and bulk loading of the DBMS – why was MR always forced to use Reduce to combine results? MR: –A standard MR program

3/16/201631Cs262a-S16 Lecture-16 UDF Aggregation DBMS: lower – UDF time; upper – other query time Hadoop: lower – query time; upper: combine all results into one

3/16/201632Cs262a-S16 Lecture-16 Discussion Throughput experiments? Parallel DBMSs are much more challenging than Hadoop to install and configure properly – DBMSs require professional DBAs to configure/tune Alternatives: Shark (Hive on Spark) –Eliminates Hadoop task start-up cost and answers queries with sub-second latencies »100 node system: 10 second till the first task starts, 25 seconds till all nodes run tasks –Columnar memory store (multiple orders of magnitude faster than disk Compression: does not help in Hadoop? –An artifact of Hadoop’s Java-based implementation? Execution strategy (DBMS), failure model (Hadoop), ease of use (H/D) Other alternatives? Apache Hive, Impala (Cloudera), HadoopDB (Hadapt), …

3/16/201633Cs262a-S16 Lecture-16 Alternative: HadoopDB? The Basic Idea (An Architectural Hybrid of MR & DBMS) –To use MR as the communication layer above multiple nodes running single-node DBMS instances Queries expressed in SQL, translated into MR by extending existing tools –As much work as possible is pushed into the higher performing single node databases| How many of complaints from Comparison paper still apply here? Hadapt startup commercializing

3/16/201634Cs262a-S16 Lecture-16 Is this a good paper? What were the authors’ goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the “Test of Time” challenge? How would you review this paper today?

3/16/201635Cs262a-S16 Lecture-16 BREAK

3/16/201636Cs262a-S16 Lecture-16 Domain for Jockey: Large cluster jobs Predictability very important Enforcement of Deadlines one way toward predictability

3/16/201637Cs262a-S16 Lecture-16 Variable Execution Latency: Prevalent Even for job with narrowest latency profile –Over 4.3X variation in latency Reasons for latency variation: –Pipeline complexity –Noisy execution environment –Excess resources 4.3x

3/16/201638Cs262a-S16 Lecture-16 Job Model: Graph of Interconnected Stages Stage Tasks Job

3/16/201639Cs262a-S16 Lecture-16 Dryad’s Dag Workflow Many simultaneous job pipelines executing at once Some on behalf of Microsoft, others on behalf of customers

3/16/201640Cs262a-S16 Lecture-16 Compound Workflow Dependencies mean that deadlines on complete pipeline create deadlines on constituent jobs Median job’s output used by 10 additional jobs Deadline

3/16/201641Cs262a-S16 Lecture-16 Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? Capture deadline & penalty Best way to express performance targets Jockey’s goal: Maximize utility while minimizing resources by dynamically adjusting the allocation

3/16/201642Cs262a-S16 Lecture-16 Application Modeling Techniques: –Job simulator: »Input from profiling to a simulator which explores possible scenarios »Compute –Amdahl’s Law »Time = S + P/N »Estimate S and P from standpoint of current stage Progress metric? Many explored –totalworkWithQ: Total time completed tasks spent enqueued or executing Optimization: Minimum allocation that maximizes utility Control loop design: slack (1.2), hysteresis, dead zone (D) C(progress, allocation)  remaining run time

3/16/201643Cs262a-S16 Lecture-16 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Ex: Completion (1%), Deadline(50 min)

3/16/201644Cs262a-S16 Lecture-16 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Ex: Completion (3%), Deadline(50 min)

3/16/201645Cs262a-S16 Lecture-16 JOCKEY – CONTROL LOOP 10 nodes20 nodes30 nodes 1% complete60 minutes40 minutes25 minutes 2% complete59 minutes39 minutes24 minutes 3% complete58 minutes37 minutes22 minutes 4% complete56 minutes36 minutes21 minutes 5% complete54 minutes34 minutes20 minutes Ex: Completion (5%), Deadline(30 min)

3/16/201646Cs262a-S16 Lecture-16 Jockey in Action Initial deadline: 140 minutes

3/16/201647Cs262a-S16 Lecture-16 Jockey in Action New deadline: 70 minutes

3/16/201648Cs262a-S16 Lecture-16 Jockey in Action 48 New deadline: 70 minutes Release resources due to excess pessimism

3/16/201649Cs262a-S16 Lecture-16 Jockey in Action “Oracle” allocation: Total allocation-hours Deadline

3/16/201650Cs262a-S16 Lecture-16 Jockey in Action “Oracle” allocation: Total allocation-hours Deadline Available parallelism less than allocation

3/16/201651Cs262a-S16 Lecture-16 Jockey in Action “Oracle” allocation: Total allocation-hours Deadline Allocation above oracle

3/16/201652Cs262a-S16 Lecture-16 Evaluation Jobs which met the SLO 1.4x

3/16/201653Cs262a-S16 Lecture-16 Evaluation Allocated too many resources Simulator made good predictions: 80% finish before deadline Control loop is stable and successful Missed 1 of 94 deadlines

3/16/201654Cs262a-S16 Lecture-16 Evaluation

3/16/201655Cs262a-S16 Lecture-16 Is this a good paper? What were the authors’ goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the “Test of Time” challenge? How would you review this paper today?