SDN + Storage.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
SDN Controller Challenges
MapReduce.
Transparent and Flexible Network Management for Big Data Processing in the Cloud Anupam Das Curtis Yu Cristian Lumezanu Yueping Zhang Vishal Singh Guofei.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Chapter 8: Scheduling Resources and Costs
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Network Aware Resource Allocation in Distributed Clouds.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
HAMS Technologies 1
Copyright © 2011, Programming Your Network at Run-time for Big Data Applications 張晏誌 指導老師:王國禎 教授.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Programming Your Network at Run- Time for Big Data Applications Guohui Wang, TS Eugene Ng, Anees Shaikh Presented by Jon Logan.
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Resource Allocation in Network Virtualization Jie Wu Computer and Information Sciences Temple University.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Dzmitry Kliazovich University of Luxembourg, Luxembourg
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
GridNets 2006 – October 1 st Grid Resource Management by means of Ant Colony Optimization Gustavo Sousa Pavani and Helio Waldman Optical Networking Laboratory.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Routing in Delay Tolerant Network Qing Ye EDIFY Group of Lehigh University.
Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.
Part III BigData Analysis Tools (YARN) Yuan Xue
By: Joel Dominic and Carroll Wongchote 4/18/2012.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Chapter 10 Data Analytics for IoT
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Managing Data Transfer in Computer Clusters with Orchestra
Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka
Improving Datacenter Performance and Robustness with Multipath TCP
Improving Datacenter Performance and Robustness with Multipath TCP
Introduction to MapReduce and Hadoop
PA an Coordinated Memory Caching for Parallel Jobs
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
GARRETT SINGLETARY.
Networked Real-Time Systems: Routing and Scheduling
Manajemen Industri Teknologi informasi
Backbone Traffic Engineering
MapReduce: Simplified Data Processing on Large Clusters
Towards Predictable Datacenter Networks
Presentation transcript:

SDN + Storage

Outline Measurement of storage traffic Network aware placement Control of resources SDN + Resource allocation Predicting Resources utilization Bring it all together

HDFS Storage Patters Maps reads from HDFS Local read versus Non-local read Rack locality or not Locality!!! 80%

HDFS Storage Patters Maps reads from HDFS Local read versus Non-local read Rack locality or not Cross-rack Traffic 80%

HDFS Storage Patters Reducers writes to HDFS 3 copies of file written to HDFS 2 rack local and 1 non-rack local Fault tolerance and good performance THERE MUST BE CROSS RACK TRAFFIC Ideal Goal: Minimize Congestion

Real Life Traces Analyze Facebook traces: 33% of time spent in network Network links are highly utilized; why? Determine cause of network traffic Job output Job input Pre-processing

Current Ways To Improve HDFS Transfers Change Network Paths Hedera, MicroTE, C-thru, Helios Change Network Rates Orchestra, D3 Increase Network Capacity VL2, Portland (Fat-Tree)

The case for Flexible Endpoints Traffic Matrix limits benefits of techniques that change paths of network rates Ability to Change Matrix is important 90% 20% 90% 80%

Flexible Endpoints in HDFS Recall: Constraint placed by HDFS 3 replicas 2 fault domains Doesn’t matter where as long as constraints are met The source of transfer is fixed! However destination, location of 3 replicas is not fixed

Sinbad Determine placement for block replica Benefits Place replicas to avoid hotspots Constraints: 3 copies Spread across 2 fault domains Benefits Faster writes: Faster transfers

Sinbad: Ideal Algorithm Input: Blocks of diff size Links of diff capacity Objective: Minimize write time (transfer time) Challenges: Lack of future knowledge Location & duration of hotspots Size and arrival times of new replicas

Sinbad Heuristic Assumptions Heuristic: Link utilizations are stable True for 5-10 seconds All block have same size Fixed-size large blocks Heuristic: Pick least-loaded link/path Send block from file with least amount to send

Sinbad Architecture Recall: original DFS is master-slave architecture Sinbad has similar

Sinbad Determine placement for block replica Benefits Place replicas to avoid hotspots Constraints: 3 copies Spread across 2 fault domains Benefits Faster writes: Faster transfers

Orchestrating the Entire Cluster How to control Compute, Network, Storage? Challenges from SinBAD How to determine future replica demands? You can’t control job arrival You can control task scheduling If you predict job characteristics you can determine future How to determines future hot spots? Control all network traffic (SDN) Use future

Ideal Centralized Entity Controls: Storage, CPU, N/W Determines: Which task to run Where to run the task When to start Network transfer What rate to transfer at Which network path

Predicting Job Characteristics To predict resources that a job needs to complete, what do you need?

Predicting Job Characteristics Job’s DAG (job’s traces history) Computations time for each node Data transfer size between nodes Transfer time between nodes

Things you absolutely know! Input data Size of input data Location of all replicas Split of input data Job’s D.A.G # of Map # of Reduce 200GB HDFS 3 Mappers Map Map Map Reduce Reduce 2 Reducers HDFS

Approaches to Prediction: Input/intermediate/Output Data Assumption: Map & Reduce run same code over and over Code gives the same ratio of reduction E.g. 50% reduction from Map to intermediate E.g. 90% reduction from intermediate to output Implications: Given size of input, you can determine size of future transfers Problems: Not always true!!! 200GB HDFS Map Map Map 100GB Reduce Reduce 10 GB HDFS

Approaches to Prediction: Task Run Time Assumption: Task is dominated by reading input Time to run a task is essentially time to read input If Local: Time to read from Disk If non-local: Time to read across Network Implication: If you can model read time you can determine task run time Problems: How do you model disk I/O? How do you model I/O interrupt contention? 200GB HDFS Map Map Map 100GB Reduce Reduce 10 GB HDFS

Predict Job Runs Given: Can you predict job completion time? Prediction of tasks, transfers, and of Dag Can you predict job completion time? How do you account for interleaving between jobs? How do you determine optimal # of slots? How do you determine optimal network bandwidth?

Really easy right? But what happens if the network only has 2 slots 100GB 200GB 10 GB Map Reduce 2 1 23 3 Map HDFS 8 HDFS 2 2 Map Reduce 23 1 30 3 0 sec 10 sec 40 sec Really easy right? But what happens if the network only has 2 slots You can’t run map in parallel

Which tasks to run in which order? How many slots to assign? 100GB 200GB 10 GB Map Reduce 2 1 23 3 HDFS 8 Map HDFS 2 2 Map Reduce 23 1 30 3 0 sec 3 sec 13 sec 33 sec Which tasks to run in which order? How many slots to assign?

Approaches to Prediction Job Run Times Assumption: Job Runtime  Function (# slots) Implication: Given N slots, I can predict completion time Jockey Approach [EuroSys’10] Track job progress: fraction of completed tasks Build a map of [{% done + # of slots}  time to complete] Use simulator to build map Iterate through all possible combination of # of slots and %done. Problems: Ignores network transfers: Network congestion Cross job contention on server can impact completion time Not all tasks are equal: # of tasks done isn’t a good representation of progress

Open Questions What about background traffic? Control messages Other bulk transfer What about unexpected events? Failures? Loss of data? What about protocol inefficiencies? Hadoop scheduling TCP inefficiencies Server scheduling