Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Tyson Condie.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
Spark Streaming Large-scale near-real-time stream processing
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
1 CS : Project Suggestions Ion Stoica and Ali Ghodsi ( September 14, 2015.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Matthew Winter and Ned Shawa
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Berkeley Data Analytics Stack - Apache Spark
Fast, Interactive, Language-Integrated Cluster Computing
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark.
Spark Presentation.
Data Platform and Analytics Foundational Training
Berkeley Data Analytics Stack (BDAS) Overview
Distributed Computing with Spark
Introduction to MapReduce and Hadoop
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
Presentation transcript:

Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015

Data Processing Goals Low latency (interactive) queries on historical data: enable faster decisions –e.g., identify why a site is slow and fix it Low latency queries on live data (streaming): enable decisions on real-time data –e.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec) Sophisticated data processing: enable “better” decisions –e.g., anomaly detection, trend analysis

Today’s Open Analytics Stack…..mostly focused on large on-disk datasets: great for batch but slow Application Storage Data Processing Infrastructure

Goals Batch Interactive Streaming One stack to rule them all!  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatible with existing open source ecosystem (Hadoop/HDFS)

Support Interactive and Streaming Comp. Aggressive use of memory Why? 1. Memory transfer rates >> disk or SSDs 2. Many datasets already fit into memory Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory e.g., 1TB = 1 billion 1KB each 3. Memory density (still) grows with Moore’s law RAM/SSD hybrid memories at horizon High end datacenter node 16 cores 10-30TB GB 1-4TB 10Gbps GB/s (x10 disks) 1- 4GB/s (x4 disks) 40-60GB/s

Support Interactive and Streaming Comp. Increase parallelism Why? –Reduce work per node  improve latency Techniques: –Low latency parallel scheduler that achieve high locality –Optimized parallel communication patterns (e.g., shuffle, broadcast) –Efficient recovery from failures and straggler mitigation result T T new (< T)

Trade between result accuracy and response times Why? –In-memory processing does not guarantee interactive query processing e.g., ~10’s sec just to scan 512 GB RAM! Gap between memory capacity and transfer rate increasing Challenges: –accurately estimate error and running time for… –… arbitrary computations GB 16 cores 40-60GB/s doubles every 18 months doubles every 36 months Support Interactive and Streaming Comp.

Berkeley Data Analytics Stack (BDAS) Infrastructure Storage Data Processing Application Resource Management Data Management Share infrastructure across frameworks (multi-programming for datacenters) Efficient data sharing across frameworks Data Processing in-memory processing trade between time, quality, and cost Application New apps: AMP-Genomics, Carat, …

Berkeley AMPLab  “Launched” January 2011: 6 Year Plan –8 CS Faculty –~40 students –3 software engineers Organized for collaboration:

Berkeley Funding: – XData, CISE Expedition Grant –Industrial, founding sponsors –18 other sponsors, including Goal: Next Generation of Analytics Data Stack for Industry & Research: Berkeley Data Analytics Stack (BDAS) Release as Open Source

Berkeley Data Analytics Stack (BDAS) Existing stack components…. HDFS MPI … Resource Mgmnt. Data Mgmnt. Data Processing Hadoop HIVEPig HBaseStorm Data Management Data Processing Resource Management

Mesos Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers HDFS MPI … Resource Mgmnt. Data Mgmnt. Data Processing Hadoop HIVEPig HBaseStorm Mesos

Spark In-memory framework for interactive and iterative computations –Resilient Distributed Dataset (RDD): fault-tolerance, in- memory storage abstraction Scala interface, Java and Python APIs HDFS Mesos MPI Resource Mgmnt. Data Mgmnt. Data Processing Storm … Spark Hadoop HIVE Pig

Spark Streaming [Alpha Release] Large scale streaming computation Ensure exactly one semantics Integrated with Spark  unifies batch, interactive, and streaming computations! HDFS Mesos MP I Resource Mgmnt. Data Mgmnt. Data Processing Hadoop HIVE Pig Stor m … Spark Streamin g

Shark  Spark SQL HIVE over Spark: SQL-like interface (supports Hive 0.9) –up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at HDFS Mesos MP I Resource Mgmnt. Data Mgmnt. Data Processing Stor m … Spark Streamin g Shark Hadoop HIVE Pig

Tachyon High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS Support for Spark and Hadoop HDFS Mesos MP I Resource Mgmnt. Data Mgmnt. Data Processing Hadoop HIVE Pig Stor m … Spark Streamin g SharkTachyon

BlinkDB Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Mesos MP I Resource Mgmnt. Data Processing Stor m … Spark Streamin g SharkBlinkDB HDFS Data Mgmnt. Tachyon Hadoop Pig HIVE

SparkGraph GraphLab API and Toolkits on top of Spark Fault tolerance by leveraging Spark Mesos MP I Resource Mgmnt. Data Processing Stor m … Spark Streamin g SharkBlinkDB HDFS Data Mgmnt. Tachyon HadoopHIVE Pig Spark Graph

MLlib Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts Mesos MP I Resource Mgmnt. Data Processing Stor m … Spark Streamin g SharkBlinkDB HDFS Data Mgmnt. Tachyon HadoopHIVE Pig Spark Graph MLbas e

Compatible with Open Source Ecosystem Support existing interfaces whenever possible Mesos MP I Resource Mgmnt. Data Processing Stor m … Spark Streamin g Shark BlinkDB HDFS Data Mgmnt. Tachyon Hadoop HIVE Pig Spark Graph MLbas e GraphLab API Hive Interface and Shell HDFS API Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos

Compatible with Open Source Ecosystem Use existing interfaces whenever possible Mesos MP I Resource Mgmnt. Data Processing Stor m … Spark Streamin g Shark BlinkDB HDFS Data Mgmnt. Tachyon Hadoop HIVE Pig Spark Graph MLbas e Support HDFS API, S3 API, and Hive metadata Support Hive API Accept inputs from Kafka, Flume, Twitter, TCP Sockets, …

Summary Support interactive and streaming computations –In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive computations –Spark execution engine supports all comp. models Easy to develop sophisticated algorithms –Scala interface, APIs for Java, Python, Hive QL, … –New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software –Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5 th Cloudera engineer) Batch Interacti ve Streami ng Spark

In-Memory Cluster Computing for Iterative and Interactive Applications UC Berkeley

Background Commodity clusters have become an important computing platform for a variety of applications –In industry: search, machine translation, ad targeting, … –In research: bioinformatics, NLP, climate simulation, … High-level cluster programming models like MapReduce power many of these apps Theme of this work: provide similarly powerful abstractions for a broader class of applications

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage e.g., MapReduce: Map Reduce InputOutput

Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: –Iterative algorithms (many in machine learning) –Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps

Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: –Fault tolerance (for crashes & stragglers) –Data locality –Scalability Solution: augment data flow model with “resilient distributed datasets” (RDDs)

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Block 1 Block 2 Block 3 Worke r Driver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count... tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Cached RDD Parallel operation Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Spark Components

Programming Model by RDD Resilient distributed datasets (RDDs) –Immutable collections partitioned across cluster that can be rebuilt if a partition is lost –Created by transforming data in stable storage using data flow operators (map, filter, group-by, …) –Can be cached across parallel operations Parallel operations on RDDs –Reduce, collect, count, save, …

RDDs in More Detail An RDD is an immutable, partitioned, logical collection of records –Need not be materialized, but rather contains information to rebuild a dataset from stable storage Partitioning can be based on a key in each record (using hash or range partitioning) Built using bulk transformations on other RDDs Can be cached for future reuse

RDD Operations Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Parallel operations (Actions) (return a result to driver) reduce collect count save lookupKey …

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions e.g.: cachedMsgs = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)).cache() HdfsRDD path: hdfs://… HdfsRDD path: hdfs://… FilteredRDD func: contains(...) FilteredRDD func: contains(...) MappedRDD func: split(…) MappedRDD func: split(…) CachedRDD

Example 1: Logistic Regression Goal: find best line separating two sets of points + – – – – – – – – – + target – random initial line

Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

Example 2: MapReduce MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)).groupByKey().map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)).reduceByKey(myCombiner).map((key, val) => myReduceFunc(key, val))

Example 3

RDD Graph

RDD Dependency Types RDD Partition

Scheduling

Scheduler Optimization

Event Flow (Direct Acycle Graph)

Conclusion