Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

Similar presentations


Presentation on theme: "AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)"— Presentation transcript:

1 AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
March 15, 2013 ECNU, Shanghai, China

2 Algorithms, Machines, People Lab (AMPLab)
8 CS Faculty: Michael Franklin, Ion Stoica, Michael Jordon, David Patterson, Randy Katz, ... ~ 40 students, 3 software engineers NSF + DARPA + 21 industrial sponsors Goal: Next Generation of Analytics Data Stack for Industry & Research: Berkeley Data Analytics Stack (BDAS) Release as Open Source

3 Previous AMPCamps Aug 2012 at UC Berkeley
150 people from industry and academia 5000 signed up online to watch live stream Feb 2013 at Strata Conference, Silicon Valley

4 AMPCamp @ ECNU Friday morning Introduction
Parallel Programming with Spark Shark Machine Learning with Spark (K-Means) Friday afternoon Hands-on exercise Saturday Exercises + discussions

5 Myself Reynold Xin, PhD student at UC Berkeley AMPLab 辛湜 shi2
Work/intern experience at Google Research, IBM DB2, Altera

6 Today’s Open Analytics Stack…
4/19/2017 Today’s Open Analytics Stack… ..mostly focused on large on-disk datasets: great for batch but slow Data Processing Application Today’s open analytics stack is mostly focused on large on-disk datasets. As a result, it provides sophisticated processing on massive data but it is slow.. This stack consists of roughly four layers… Except Google, Microsoft, and at some extend Amazon, this is the de-facto standarrd stack for data analysis today, and it is used by Yahoo!, Facebook, Twitter, to name a few. Storage Infrastructure

7 Goals Batch Interactive Streaming One stack to rule them all!
Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)

8 Our Approach: Support Interactive and Streaming Comp.
High end datacenter node 10Gbps Aggressive use of memory Why? Memory transfer rates >> disk or even SSDs Gap is growing especially w.r.t. disk Many datasets already fit into memory The inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory E.g., 1TB = 1 billion 1 KB each Memory density (still) grows with Moore’s law RAM/SSD hybrid memories at horizon GB What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” 40-60GB/s 16 cores 0.2-1GB/s (x10 disks) 1-4GB/s (x4 disks) 10-30TB 1-4TB

9 Our Approach Easy to combine batch, streaming, and interactive computations Single execution model that supports all computation models Easy to develop sophisticated algorithms Powerful Python and Scala shells High level abstractions for graph based, and ML algorithms Compatible with existing open source ecosystem (Hadoop/HDFS) Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..) Support existing execution models (e.g., Hive, GraphLab)

10 Berkeley Data Analytics Stack (BDAS)
Existing stack components…. Our instantiation of the resource management is Mesos Data Management Data Processing Resource Management HDFS MPI Resource Mgmnt. Data Processing Hadoop HIVE Pig HBase Storm

11 Mesos [Released, v0.9] Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers Our instantiation of the resource management is Mesos HIVE Pig HBase Storm MPI Data Processing Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

12 Spark [Release, v0.7] In-memory framework for interactive and iterative computations Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction Scala interface, Java and Python APIs Our instantiation of the resource management is Mesos HIVE Pig Storm MPI Data Processing Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

13 Spark Community 3000 people attended online training in August
500+ meetup members 14 companies contributing 31 contributors in the last release spark-project.org

14 Spark Streaming [Alpha Release]
Large scale streaming computation Ensure exactly one semantics Integrated with Spark  unifies batch, interactive, and streaming computations! Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

15 Shark [Release, v0.2] HIVE over Spark: SQL-like interface (supports Hive 0.9) up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Shark Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

16 Tachyon [Alpha Release, this Spring]
High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS Support for Spark and Hadoop Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Shark Hadoop Spark HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

17 BlinkDB [Alpha Release, this Spring]
Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Our instantiation of the resource management is Mesos Spark Streaming BlinkDB Pig Storm MPI Data Processing Shark HIVE Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

18 SparkGraph [Alpha Release, this Spring]
GraphLab API and Toolkits on top of Spark Fault tolerance by leveraging Spark Our instantiation of the resource management is Mesos Spark Streaming Spark Graph BlinkDB Pig Storm MPI Data Processing Shark HIVE Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

19 MLbase [In development]
Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts Our instantiation of the resource management is Mesos Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

20 Compatible with Open Source Ecosystem
Support existing interfaces whenever possible GraphLab API Our instantiation of the resource management is Mesos Hive Interface and Shell Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE Spark Hadoop Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos HDFS Tachyon Data Mgmnt. HDFS API Resource Mgmnt. Mesos

21 Compatible with Open Source Ecosystem
Use existing interfaces whenever possible Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … Support Hive API Our instantiation of the resource management is Mesos Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE Support HDFS API, S3 API, and Hive metadata Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

22 Holistic approach to address next generation of Big Data challenges!
Summary Holistic approach to address next generation of Big Data challenges! Support interactive and streaming computations In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive computations Spark execution engine supports all comp. models Easy to develop sophisticated algorithms Scala interface, APIs for Java, Python, Hive QL, … New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Batch Interactive Streaming Spark


Download ppt "AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)"

Similar presentations


Ads by Google