AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
1 Atigeo Confidential Building an intelligent big data app in 30 minutes Strata Barcelona Nov 2014 David TalbyClaudiu Barbura SVP EngineeringSr. Director,
1 CS : Cloud Computing, Systems, Networking, and Frameworks Fall 2011 (MW 1:00-2:30, 293 Cory Hall) Ion Stoica (
Hadoop Ecosystem Overview
Outline | Motivation| Design | Results| Status| Future
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
Tyson Condie.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark Streaming Large-scale near-real-time stream processing
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Welcome to the Intermountain Big Data Conference! 2 Data Science and Machine Learning Tools from Python to R, with Hands-On R/Shiny U Student – Math major.
Matthew Winter and Ned Shawa
SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim.
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015.
Next Generation of Apache Hadoop MapReduce Owen
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
Microsoft Ignite /28/2017 6:07 PM
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Big Data Analytics and HPC Platforms
Berkeley Data Analytics Stack - Apache Spark
PROTECT | OPTIMIZE | TRANSFORM
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Spark Streaming for Real Time data analysis
Spark Presentation.
Data Platform and Analytics Foundational Training
Berkeley Data Analytics Stack (BDAS) Overview
Hadoop Clusters Tess Fulkerson.
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Introduction to Spark.
CS110: Discussion about Spark
Overview of big data tools
Spark and Scala.
Big-Data Analytics with Azure HDInsight
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS) March 15, 2013 AMPCamp @ ECNU, Shanghai, China

Algorithms, Machines, People Lab (AMPLab) 8 CS Faculty: Michael Franklin, Ion Stoica, Michael Jordon, David Patterson, Randy Katz, ... ~ 40 students, 3 software engineers NSF + DARPA + 21 industrial sponsors Goal: Next Generation of Analytics Data Stack for Industry & Research: Berkeley Data Analytics Stack (BDAS) Release as Open Source

Previous AMPCamps Aug 2012 at UC Berkeley 150 people from industry and academia 5000 signed up online to watch live stream Feb 2013 at Strata Conference, Silicon Valley

AMPCamp @ ECNU Friday morning Introduction Parallel Programming with Spark Shark Machine Learning with Spark (K-Means) Friday afternoon Hands-on exercise Saturday Exercises + discussions

Myself Reynold Xin, PhD student at UC Berkeley AMPLab 辛湜 shi2 Work/intern experience at Google Research, IBM DB2, Altera

Today’s Open Analytics Stack… 4/19/2017 Today’s Open Analytics Stack… ..mostly focused on large on-disk datasets: great for batch but slow Data Processing Application Today’s open analytics stack is mostly focused on large on-disk datasets. As a result, it provides sophisticated processing on massive data but it is slow.. This stack consists of roughly four layers… Except Google, Microsoft, and at some extend Amazon, this is the de-facto standarrd stack for data analysis today, and it is used by Yahoo!, Facebook, Twitter, to name a few. Storage Infrastructure

Goals Batch Interactive Streaming One stack to rule them all! Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)

Our Approach: Support Interactive and Streaming Comp. High end datacenter node 10Gbps Aggressive use of memory Why? Memory transfer rates >> disk or even SSDs Gap is growing especially w.r.t. disk Many datasets already fit into memory The inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory E.g., 1TB = 1 billion records @ 1 KB each Memory density (still) grows with Moore’s law RAM/SSD hybrid memories at horizon 128-512GB What do people use big data for? First, they use it to generate reports to track and better understand business processes, ransactions Second, they use it to diagnose and answer questions such as Why the user engagement dropping?, why is the system slow? Or to detect spam, worms, or DDoS attacks But most importantly they use it to make decisions, such us improving the business process, deciding what features to add to the product, deciding what ad to show, or, once it identifies a spam, to block it. Thus, the development of the BDAS stack is driven by the believe that “data is as useful as the decisions you can take based on that data” 40-60GB/s 16 cores 0.2-1GB/s (x10 disks) 1-4GB/s (x4 disks) 10-30TB 1-4TB

Our Approach Easy to combine batch, streaming, and interactive computations Single execution model that supports all computation models Easy to develop sophisticated algorithms Powerful Python and Scala shells High level abstractions for graph based, and ML algorithms Compatible with existing open source ecosystem (Hadoop/HDFS) Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..) Support existing execution models (e.g., Hive, GraphLab)

Berkeley Data Analytics Stack (BDAS) Existing stack components…. Our instantiation of the resource management is Mesos Data Management Data Processing Resource Management HDFS MPI … Resource Mgmnt. Data Processing Hadoop HIVE Pig HBase Storm

Mesos [Released, v0.9] Management platform that allows multiple framework to share cluster Compatible with existing open analytics stack Deployed in production at Twitter on 3,500+ servers Our instantiation of the resource management is Mesos HIVE Pig HBase Storm MPI … Data Processing Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Spark [Release, v0.7] In-memory framework for interactive and iterative computations Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction Scala interface, Java and Python APIs Our instantiation of the resource management is Mesos HIVE Pig Storm MPI Data Processing … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Spark Community 3000 people attended online training in August 500+ meetup members 14 companies contributing 31 contributors in the last release spark-project.org

Spark Streaming [Alpha Release] Large scale streaming computation Ensure exactly one semantics Integrated with Spark  unifies batch, interactive, and streaming computations! Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Shark [Release, v0.2] HIVE over Spark: SQL-like interface (supports Hive 0.9) up to 100x faster for in-memory data, and 5-10x for disk In tests on hundreds node cluster at Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Shark … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

Tachyon [Alpha Release, this Spring] High-throughput, fault-tolerant in-memory storage Interface compatible to HDFS Support for Spark and Hadoop Our instantiation of the resource management is Mesos Spark Streaming HIVE Pig Storm MPI Data Processing Shark … Hadoop Spark HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

BlinkDB [Alpha Release, this Spring] Large scale approximate query engine Allow users to specify error or time bounds Preliminary prototype starting being tested at Facebook Our instantiation of the resource management is Mesos Spark Streaming BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

SparkGraph [Alpha Release, this Spring] GraphLab API and Toolkits on top of Spark Fault tolerance by leveraging Spark Our instantiation of the resource management is Mesos Spark Streaming Spark Graph BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

MLbase [In development] Declarative approach to ML Develop scalable ML algorithms Make ML accessible to non-experts Our instantiation of the resource management is Mesos Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

Compatible with Open Source Ecosystem Support existing interfaces whenever possible GraphLab API Our instantiation of the resource management is Mesos Hive Interface and Shell Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Spark Hadoop Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos HDFS Tachyon Data Mgmnt. HDFS API Resource Mgmnt. Mesos

Compatible with Open Source Ecosystem Use existing interfaces whenever possible Accept inputs from Kafka, Flume, Twitter, TCP Sockets, … Support Hive API Our instantiation of the resource management is Mesos Spark Streaming Spark Graph MLbase BlinkDB Pig Storm MPI Data Processing Shark HIVE … Support HDFS API, S3 API, and Hive metadata Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

Holistic approach to address next generation of Big Data challenges! Summary Holistic approach to address next generation of Big Data challenges! Support interactive and streaming computations In-memory, fault-tolerant storage abstraction, low-latency scheduling,... Easy to combine batch, streaming, and interactive computations Spark execution engine supports all comp. models Easy to develop sophisticated algorithms Scala interface, APIs for Java, Python, Hive QL, … New frameworks targeted to graph based and ML algorithms Compatible with existing open source ecosystem Open source (Apache/BSD) and fully committed to release high quality software Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Batch Interactive Streaming Spark