Berkeley Data Analytics Stack

Berkeley Data Analytics Stack
Prof. Harold Liu 15 December 2014

2017/4/15 Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions E.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions E.g., anomaly detection, trend analysis So what does this mean? Well, this means that we want low response-time on historical data since the faster we can make a decision the better. We want the ability to perform queries on live data since decisions on real-time data are better than on stale data. Finally, we want to perform sophisticated processing on massive data as, in principle, processing more data will lead to better decisions. 2

Today’s Open Analytics Stack…
2017/4/15 Today’s Open Analytics Stack… ..mostly focused on large on-disk datasets: great for batch but slow Data Processing Application Storage Today’s open analytics stack is mostly focused on large on-disk datasets. As a result, it provides sophisticated processing on massive data but it is slow.. This stack consists of roughly four layers… Except Google, Microsoft, and at some extend Amazon, this is the de-facto standarrd stack for data analysis today, and it is used by Yahoo!, Facebook, Twitter, to name a few. Infrastructure 3

Goals Batch Interactive Streaming One stack to rule them all!
Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)

Support Interactive and Streaming Comp.
Aggressive use of memory Why? 1. Memory transfer rates >> disk or SSDs 2. Many datasets already fit into memory Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory e.g., 1TB = 1 billion 1KB each 3. Memory density (still) grows with Moore’s law RAM/SSD hybrid memories at horizon High end datacenter node 10Gbps GB 40-60GB/s 16 cores 0.2-1GB/s (x10 disks) 1-4GB/s (x4 disks) What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” 10-30TB 1-4TB 5

Increase parallelism Why? Reduce work per node  improve latency Techniques: Low latency parallel scheduler that achieve high locality Optimized parallel communication patterns (e.g., shuffle, broadcast) Efficient recovery from failures and straggler mitigation result T What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” result Tnew (< T) 6

Trade between result accuracy and response times Why? In-memory processing does not guarantee interactive query processing e.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing Challenges: accurately estimate error and running time for… … arbitrary computations GB 16 cores 40-60GB/s doubles every 18 months doubles every 36 months What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” 7

Berkeley Data Analytics Stack (BDAS)
New apps: AMP-Genomics, Carat, … Application Application in-memory processing trade between time, quality, and cost Data Processing Data Processing Data Management Storage Efficient data sharing across frameworks … in two fundamental aspects… At the application layer we build new, real applications such a AMP-Genomics a genomics pipeline, and Carat, an application I’m going to talk about soon. Building real applications allows us to drive the features and design of the lower layers. Infrastructure Resource Management Share infrastructure across frameworks (multi-programming for datacenters) 8

AMP Berkeley AMPLab Algorithms Machines People
“Launched” January 2011: 6 Year Plan 8 CS Faculty ~40 students 3 software engineers Organized for collaboration:

Berkeley AMPLab Funding: XData, CISE Expedition Grant
Industrial, founding sponsors 18 other sponsors, including Goal: Next Generation of Analytics Data Stack for Industry & Research: Berkeley Data Analytics Stack (BDAS) Release as Open Source

Berkeley Data Analytics Stack (BDAS)
Existing stack components…. Data Management Data Processing Resource Management HDFS MPI … Resource Mgmnt. Data Processing Hadoop HIVE Pig HBase Storm Our instantiation of the resource management is Mesos 11

Mesos Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers HIVE Pig HBase Storm MPI … Data Processing Hadoop Our instantiation of the resource management is Mesos HDFS Data Mgmnt. Mesos Resource Mgmnt. 12

Spark In-memory framework for interactive and iterative computations
Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction Scala interface, Java and Python APIs HIVE Pig Storm MPI Data Processing … Spark Hadoop Our instantiation of the resource management is Mesos HDFS Data Mgmnt. Mesos Resource Mgmnt. 13

Spark Streaming [Alpha Release]
Large scale streaming computation Ensure exactly one semantics Integrated with Spark  unifies batch, interactive, and streaming computations! Spark Streaming HIVE Pig Storm MPI Data Processing … Spark Hadoop Our instantiation of the resource management is Mesos HDFS Data Mgmnt. Mesos Resource Mgmnt. 14

Shark  Spark SQL HIVE over Spark: SQL-like interface (supports Hive 0.9) up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Spark Streaming HIVE Pig Storm MPI Data Processing Shark … Spark Hadoop Our instantiation of the resource management is Mesos HDFS Data Mgmnt. Mesos Resource Mgmnt. 15

Tachyon High-throughput, fault-tolerant in-memory storage
Interface compatible to HDFS Support for Spark and Hadoop Spark Streaming HIVE Pig Storm MPI Data Processing Shark … Spark Hadoop Our instantiation of the resource management is Mesos HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos 16

BlinkDB Large scale approximate query engine
Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Spark Streaming BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop Our instantiation of the resource management is Mesos HDFS Tachyon Data Mgmnt. Mesos Resource Mgmnt. 17

SparkGraph GraphLab API and Toolkits on top of Spark
Fault tolerance by leveraging Spark Spark Streaming Spark Graph BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Our instantiation of the resource management is Mesos Tachyon Data Mgmnt. Resource Mgmnt. Mesos 18

MLlib Declarative approach to ML Develop scalable ML algorithms
Make ML accessible to non-experts Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop Our instantiation of the resource management is Mesos HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos 19

Compatible with Open Source Ecosystem
Support existing interfaces whenever possible GraphLab API Hive Interface and Shell Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Our instantiation of the resource management is Mesos Tachyon Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos Data Mgmnt. HDFS API Resource Mgmnt. Mesos 20

Compatible with Open Source Ecosystem
Use existing interfaces whenever possible Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … Support Hive API Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Support HDFS API, S3 API, and Hive metadata Spark Hadoop Our instantiation of the resource management is Mesos HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos 21

Summary Support interactive and streaming computations
In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive computations Spark execution engine supports all comp. models Easy to develop sophisticated algorithms Scala interface, APIs for Java, Python, Hive QL, … New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Batch Interactive Streaming Spark

In-Memory Cluster Computing for Iterative and Interactive Applications
Spark In-Memory Cluster Computing for Iterative and Interactive Applications UC Berkeley 23

Background Commodity clusters have become an important computing platform for a variety of applications In industry: search, machine translation, ad targeting, … In research: bioinformatics, NLP, climate simulation, … High-level cluster programming models like MapReduce power many of these apps Theme of this work: provide similarly powerful abstractions for a broader class of applications Explain that trends in data collection rates vs processor and IO speeds are driving this 24

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage e.g., MapReduce: Map Reduce Input Output Also applies to Dryad, SQL, etc Benefits: easy to do fault tolerance and 25

Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: Iterative algorithms (many in machine learning) Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps

Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: Fault tolerance (for crashes & stragglers) Data locality Scalability RDDs = first-class way to manipulate and persist intermediate datasets Solution: augment data flow model with “resilient distributed datasets” (RDDs) 27

Programming Model Resilient distributed datasets (RDDs)
Immutable collections partitioned across cluster that can be rebuilt if a partition is lost Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) Can be cached across parallel operations Parallel operations on RDDs Reduce, collect, count, save, … Restricted shared variables Accumulators, broadcast variables You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion 28

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Cached RDD Parallel operation cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3 29

RDDs in More Detail An RDD is an immutable, partitioned, logical collection of records Need not be materialized, but rather contains information to rebuild a dataset from stable storage Partitioning can be based on a key in each record (using hash or range partitioning) Built using bulk transformations on other RDDs Can be cached for future reuse

RDD Operations Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (Actions) (return a result to driver) reduce collect count save lookupKey …

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions e.g.: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD 38

Example 1: Logistic Regression
Goal: find best line separating two sets of points random initial line + + + + + + – + + – Note that dataset is reused on each gradient computation – – + + – – – – – – target 39

Logistic Regression Code
val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) 40

Logistic Regression Performance
127 s / iteration first iteration 174 s further iterations 6 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each) 41

Example 2: MapReduce Or with combiners:
MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) 42

Example 3 43

Other Spark Applications
Twitter spam classification (Justin Ma) EM alg. for traffic prediction (Mobile Millennium) K-means clustering Alternating Least Squares matrix factorization In-memory OLAP aggregation on Hive data SQL on Spark (future work)

Conclusion By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics RDDs provide: Lineage info for fault recovery and debugging Adjustable in-memory caching Locality-aware parallel operations We plan to make Spark the basis of a suite of batch and interactive data analysis tools 48

Berkeley Data Analytics Stack

Similar presentations

Presentation on theme: "Berkeley Data Analytics Stack"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Berkeley Data Analytics Stack

Similar presentations

Presentation on theme: "Berkeley Data Analytics Stack"— Presentation transcript:

Similar presentations

About project

Feedback