1 CS 294-42: Project Suggestions Ion Stoica (http://www.cs.berkeley.edu/~istoica/classes/cs294/11/) September 14, 2011.

Slides:



Advertisements
Similar presentations
Wei Lu 1, Kate Keahey 2, Tim Freeman 2, Frank Siebenlist 2 1 Indiana University, 2 Argonne National Lab
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
SLA-Oriented Resource Provisioning for Cloud Computing
System Center 2012 R2 Overview
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Resource Management with YARN: YARN Past, Present and Future
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Reciprocal Resource Fairness: Towards Cooperative Multiple-Resource Fair Sharing in IaaS Clouds School of Computer Engineering Nanyang Technological University,
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
1© Copyright 2015 EMC Corporation. All rights reserved. SDN INTELLIGENT NETWORKING IMPLICATIONS FOR END-TO-END INTERNETWORKING Simone Mangiante Senior.
Cluster Scheduler Reference: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011 Multi-agent Cluster Scheduling for Scalability.
1 CS : Cloud Computing, Systems, Networking, and Frameworks Fall 2011 (MW 1:00-2:30, 293 Cory Hall) Ion Stoica (
Adaptive Server Farms for the Data Center Contact: Ron Sheen Fujitsu Siemens Computers, Inc Sever Blade Summit, Getting the.
Distributed Low-Latency Scheduling
A Platform for Fine-Grained Resource Sharing in the Data Center
Tyson Condie.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Improving Network I/O Virtualization for Cloud Computing.
Wireless Networks Breakout Session Summary September 21, 2012.
Challenges towards Elastic Power Management in Internet Data Center.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.
1 CS : Project Suggestions Ion Stoica and Ali Ghodsi ( September 14, 2015.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Web Technologies Lecture 13 Introduction to cloud computing.
Scalability == Capacity * Density.
A Platform for Fine-Grained Resource Sharing in the Data Center
Next Generation of Apache Hadoop MapReduce Owen
Presented by Qifan Pu With many slides from Ali’s NSDI talk Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Jihui Yang CS525 Advanced Distributed System March 1, 2016.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
BIG DATA/ Hadoop Interview Questions.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Organizations Are Embracing New Opportunities
Introduction to Load Balancing:
Measurement-based Design
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
GGF15 – Grids and Network Virtualization
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
Data Security for Microsoft Azure
Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3
Cloud Computing Large-scale Resource Management
Presentation transcript:

1 CS : Project Suggestions Ion Stoica ( September 14, 2011

Projects This is a project oriented class Reading papers should be means to a great project not a goal in itself! Strongly prefer groups of two Perfectly fine to have the same project at cs262 Today, I’ll present some suggestions But, you are free to come up with your own proposal Main goal: just do a great project 2

Where I’m Coming From? Key challenge: maximize economic value of data, i.e., Extract value from data while reducing costs (e.g., storage, computation) 3

Where I’m Coming From? Tools to extract value from big-data Scalability Response time Accuracy Provide high cluster utilization for heterogeneous workloads Support diverse SLAs Predictable performance Isolation Consistency 4

Caveats Cloud computing is HOT, but lot of NOISE! Not easy to differentiate between narrow engineering solutions and fundamental tradeoffs predict the importance of the problem you solve Cloud computing it’s akin Gold Rush! 5

Background: Mesos Rapid innovation in cloud computing No single framework optimal for all applications Running each framework on its dedicated cluster Expensive Hard to share data 6 Dryad Pregel Cassandra Hypertable Need to run multiple frameworks on same cluster

Background: Mesos – Where We Want to Go Hadoop Pregel MPI Shared cluster Today: static partitioningMesos: dynamic sharing uniprogrammingmultiprogramming

Background: Mesos – Solution Mesos is a common resource sharing layer over which diverse frameworks can run 8 Node Hadoop Node MPI … Mesos

Background: Workload in Datacenters Frontend (Web- servers, dabses) Decision-driven processes Exploratory queries (e.g., Dremel) Production jobs (e.g., compute summaries) Analytics jobs 9 High Low Interactive (low-latency) Batch Priority Response

Datacenter OS: Resource Management, Scheduling 10

Hierarchical Scheduler (for Mesos) Allow administrators to organize into groups Provide resource guarantees per group Share available resources (fairly) across groups Research questions Abstraction (when using multiple resources)? How to implement using resource offers? What policies are compatible at different levels in the hierarchy? 11

Cross Application Resource Management An app uses many services (e.g., file systems, key- value storage, databases, etc) If an app has high priority and the service it uses doesn’t, the app SLA (Service Level Agreement) might be violated Research questions Abstraction, e.g., resource delegation, priority propagation? Clean-slate mechanisms vs. incremental deployability This is also highly challenging in single node OSes! 12

Resource Management using VMs Most cluster resource managers use Linux containers (e.g., Mesos) Thus, schedulers assume no task migration Research questions: Develop scheduler for VM environments (e.g., extend DRF) Tradeoffs between migration, delay, and preemption 13

Task Granularity Selection (Yanpei Chen) Problem: number of tasks per stage in today’s MapRed apps (highly) sub-optimal Research question: Derive algorithms to pick the number of tasks to optimize various performance metrics, e.g., utilization, response time, network traffic subject to various constraints, e.g., capacity, network 14

Resource Revocation Which task we should revoke/preempt? Two questions Which slot has least impact on the giving framework? Is the slot acceptable to receiving framework? Research questions Identify feasible slot for receiving framework with least impact on giving framework Light-weight protocol design 15

Control Plane Consistency Model What type of consistency is “good-enough” for various control plane functions File system metadata (Hadoop) Routing (Nicira) Scheduling Coordinated caching … Research question What are trade-off between performance and consistency? Develop generic framework for control plane 16

Decentralized vs. Centralized Scheduling Decentralized schedulers E.g., Mesos, Hadoop 2.0 Delegate decision to apps (i.e., frameworks, jobs) Advantages: scale and separation of concerns (i.e., apps know the best where and which tasks to run) Centralized schedulers Knows all app requirements Advantages: optimal Research challenge: Evaluate centralized vs. decentralized schedulers Characterize class of workloads for which decentralized scheduler is good enough 17

Opportunistic Scheduling Goal: schedule interactive jobs (e.g., <100ms latency) Existing schedulers: high overhead (e.g., Mesos needs to decide on every offer) Research challenge: Tradeoff between utilization and response time Evaluate hybrid approach 18

Background: Dominant Resource Fairness Implement fair (proportional) allocation for multiple types of resources Key properties Strategy proof: users cannot get an advantage by lying about their demands Sharing incentives: users are incentivized to share a cluster rather than partitioning it 19

DRF for Non-linear Resources/Demands DRF assume resources & demands are additive E.g., task 1 needs (1CPU, 1GB) and task 2 needs (1CPU, 3GB)  both tasks need (2CPU, 4GB) Sometime demands are non-linear E.g., shared memory Sometime resources are non-linear E.g., disk throughput, caches Research challenge: DRF-like scheduler for non-linear resources & demands (could be two projects here!) 20

DRF for OSes DRF designed for clusters using resource offer mechanism Redesign DRF to support multi-core OSes Research questions: Is resource offer best abstraction? How to best leverage preemption? (in Mesos tasks are not preempted by default) How to support gang scheduling? 21

Storage & Data Processing 22

Resource Isolation for Storage Services Share storage (e.g., key-value store) between Frontend, e.g., web services Backend, e.g., analytics on freshest data Research challenge Isolation mechanism: protect front-end performance from back-end workload 23

“Quicksilver” DB Goal: interactive queries with bounded error on “unbounded” data Trade between efficiency and accuracy Query response time target: < 100ms Approach: random pre-sampling across different dimensions (columns) Research question: given a query and an error bound, find Smallest sample to compute result Sample minimizing disk (or memory) access times (Talk with Sameer, if interested) 24

Split-Privacy DB (1/2) 25 Partition data & computation Private Public (stored on cloud) Goal: use cloud without revealing the computation result Example: Operation f(x, y) = x + y, where x: private y: public Pick random number a, and compute x’ = x + a compute f(x’, y) = r’ = x’ + y recover result: r = r’ – a = (x’ – a) + y = x + y Private DB Public DB f private f public result

Split-Privacy DB (2/2) 26 Partition data & computation Private Public (stored on cloud) Example: patient data (private), public clinical and genomics data sets Goal: use cloud without revealing the computation result Research questions: What types of computation can be implemented? Any more powerful than privacy-preserving computation / Data Mining? Private DB Public DB f private f public result

RDDs as an OS Abstraction Resilient Data Sets (RDDs) Fault-tolerant (in-memory) parallel data structures Allows Spark apps to efficiently reuse data Design cross-application RDDs Research questions RDD reconstruction (track software and platform changes) Enable users to share intermediate results of queries (identify when two apps compute same RDD) RDD cluster-wide caching 27

Provenance-based Efficient Storage (Peter B and Patrick W) Reduce storage by deleting data that can be recreated Generalization of previous project Research challenges: Identify data that can deterministically recreated and the code to do so Use hints? Tradeoff between re-creation and storage May take into account access patter, frequency, performance 28

Very-low Latency Streaming Challenge: straglers, failures Approaches to reduce latency: Redundant computations Speculative execution Research questions Theoretical trade-off between response time and accuracy? Achieve target latency and accuracy, while minimizing the overhead 29