1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/) September 14, 2015.

Slides:



Advertisements
Similar presentations
Ali Ghodsi UC Berkeley & KTH & SICS
Advertisements

Shark:SQL and Rich Analytics at Scale
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Spark: Cluster Computing with Working Sets
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Outline | Motivation| Design | Results| Status| Future
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Tyson Condie.
Introduction To Windows Azure Cloud
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Server to Server Communication Redis as an enabler Orion Free
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Matthew Winter and Ned Shawa
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
BIG DATA/ Hadoop Interview Questions.
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Information Retrieval in Practice
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
CS110: Discussion about Spark
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
CS : Project Suggestions
Fast, Interactive, Language-Integrated Cluster Computing
CS639: Data Management for Data Science
Big Data Analytics & Spark
Presentation transcript:

1 CS : Project Suggestions Ion Stoica and Ali Ghodsi ( September 14, 2015

Projects This is a project-oriented class Reading papers should be a means to a great project, not a goal in itself! Strongly prefer groups of two students Today, I’ll present some suggestions But, you are free to come up with your own proposal Main goal: just do a great project 2

Projects Many projects around Spark Local expertise Great platform to disseminate your work Short review based on log mining example to provide context 3

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns 4

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver 5

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver lines = spark.textFile(“hdfs://...”) 6

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD 7

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Driver 8

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Driver Transformed RDD 9

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver messages.filter(lambda s: “mysql” in s).count() 10

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver messages.filter(lambda s: “mysql” in s).count() Action 11

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 12

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks 13

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block 14

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data 15

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results 16

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() 17

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks Driver 18

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache 19

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results 20

Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data  Faster Results Full-text search of Wikipedia 60GB on 20 EC2 machines 0.5 sec from mem vs. 20s for on- disk Cache your data  Faster Results Full-text search of Wikipedia 60GB on 20 EC2 machines 0.5 sec from mem vs. 20s for on- disk 21

Spark 22 {JSON} Data Sources Spark Core DataFramesML Pipelines Spark Streaming Spark SQL MLlibGraphX ?

Pipeline Shuffle Problem Right now shuffle senders write data on storage after which the data is shuffled to receivers Shuffle often most expensive communication pattern, sometimes dominates job comp. time Project Start sending shuffle data as it is being produced Challenge How do you do recovery and speculation? Could store data as being sent, but still not easy…. 23

Fault Tolerance & Perf. Tradeoffs Problem: Maintaining lineage in Spark provides fault recovery, but comes at performance cost E.g., hard to support super small tasks due to lineage overhead Project: Evaluate how much you can speed up Spark by ignoring fault tolerance Can generalize to other cluster computing engines Challenge What do you do for large jobs, how do you treat stragglers? Maybe a hybrid method, i.e., just don’t do lineage for small jobs? Need to figure out when a job is small… 24

(Eliminating) Scheduling Overhead Problem: with Spark, driver schedules every task Latency 100s ms or higher; cannot run ms queries Driver can become a bottleneck Project: Have workers perform scheduling Challenge: How do you handle faults? Maybe some hybrid solution across driver and workers? 25

Cost-based Optimization in SparkSQL Problem: Spark employs a rule-based Query Planner (Catalyst) Limited optimization opportunities especially when operator performance varies widely based on input data E.g., join and selection on skewed data Project: cost-based optimizer Estimate operators’ costs, and use these costs to compute the query plan 26

Streaming Graph Processing Problem: With GraphX, queries can be fast but updates are typically in batches (slow) Project: Incrementally update graphs Support window based graph queries Note: Discuss with Anand Iyer and Ankur Dave if interested 27

Streaming ML Problem: Today ML algorithms typically performed on static data Cannot update model in real-time Project: Develop on-line ML algorithms that update the model continuously as new data is streamed Notes: Also contact Joey Gonzalez if interested 28

Beyond JVM: Using Non-Java Libraries Problem: Spark tasks are executed within JVMs Limits performance and use of non-Java popular libraries Project: General way to add support for non-Java libraries Example: use JNI to call arbitrary libraries Challenges: Define interface, shared data formats, etc Notes Contact Guanhua and Shivaram, if needed 29

Beyond JVM: Dynamic Code Generation Problem: Spark tasks are executed within JVMs Limits performance and use of non-Java popular libraries Project: Generate non-Java code, e.g., C++, CUDA for GPUs Challenges: API and shared data format Notes Contact Guanhua and Shivaram, if needed 30

Beyond JVM: Resource Management and Scheduling Problem Need to schedule processes hosting non-Java code GPU cannot be invoked by more than one process Project: Develop scheduling, and resource management algorithms Challenge: Preserve fault tolerance, straggler mitigation Notes Contact Guanhua and Shivaram, if needed 31

Time Series for DataFrames Insprired by Pandas and R DataFrames, Spark recently introduced DataFrames Problem Spark DataFrames don’t support time series Project: Develop and contribute distributed time series operations for Data Frames Challenge: Spark doesn’t have indexes 32

ACID transactions to Spark SQL Problem Spark SQL is used for Analytics and doesn’t support ACID Project: Develop and add row-level ACID tx on top of Spark SQL Challenge: Challenging to provide transactions and analytics in one system 33

Typed Data Frames Problem DataFrames in Spark, unlike Spark RDDs, do not provide type safety Project: Develop a typed DataFrame framework for Spark Challenge: SQL-like operations are inherently dynamic (e.g. filter(“col”) and make it hard to have static typing unless fancy reflection mechanisms are used 34

General pipelines for Spark Problem Spark.ml provides a pipeline abstraction for ML, generalize it to cover all of Spark Project: Develop a pipeline abstraction (similar to ML pipelines) that spans all of Spark, allowing users to perform SQL operations, GraphX operations, etc 35

Beyond BSP Problem With BSP each worker executes the same code Project Can we extend Spark (or other cluster computing framework) to support non-BSP computation How much better than emulating everything with BSP? Challenge Maintain simple APIs More complex scheduling, communication patterns 36

37 Project idea: cryptography & big data (Alessandro Chiesa) As data and computations scale up to larger sizes… … can cryptography follow? One direction: zero knowledge proofs for big data

38 Classical setting: zero knowledge proofs on 1 machine result server client Here is the result of your computation. I don’t believe you. I don’t want to give you my private data. Send me a ZK proof of correctness? & ZK proof add crypto magic + generate ZK proof + generate ZK proof + verify ZK proof + verify ZK proof

39 New setting for big data: zero knowledge proofs on clusters result cluster client & ZK proof + generate ZK proof + generate ZK proof + verify ZK proof + verify ZK proof Problem: cannot generate ZK proof on 1 machine (as before) Challenge: generate the ZK proof over a cluster (e.g., using Spark) End goal: “scaling up” ZK proofs to computations on big data & explore security applications!

Succinct (quick overview) Queries on compressed data Basic operations: Search: given a substring “s” return offsets of all occurrences of “s” within the input Extract: given an offset “o” and a length “l” uncompress and return “l” bytes from original file starting at “o” Count: given a substring “s” return the number of occurrences of “s” within the input Can implement key-value store on top of it 40

Succinct: Efficient Point Query Support Problem: Spark implementation: expensive, as always queries all workers Project: Implement Succinct on top of Tachyon (storage layer) Provide efficient key-value store lookups, i.e., lookup a single worker if key is there Note: Contact Anurag and Rachit, if interested 41

Succinct: External Memory Support Problem: Some data increases faster than main memory Need to execute queries on external storage (e.g., SSDs) Project: Design & implement compressed data structures for efficient external memory execution A lot of work in theory community, that could be exploited Note: Contact Anurag and Rachit, if interested 42

Succinct: Updates Problem: Current systems use a multi-store architecture Expensive to update compressed representation Project: Develop a low overhead update solution with minimal impact on memory overhead and query performance Start from multi-store architecture (see NSDI paper) Note: Contact Anurag and Rachit, if interested 43

Succinct: SQL Problem: Arbitrary sub-string search powerful but not as many workloads Project: Support SQL on top of Succinct Start from SparkSQL and Succinct Spark package? Note: Contact Anurag and Rachit, if interested 44

Succinct: Genomics Problem: Genomics pipeline still expensive Project: Genome processing on a single machine (using compressed data) Enable queries on compressed genomes Challenges: Domain specific query optimizations Note: Contact Anurag and Rachit, if interested 45