Demystifying and Controlling the Performance of Big Data Jobs Theophilus Benson Duke University.

Slides:



Advertisements
Similar presentations
U NDERSTANDING D ATA C ENTER T RAFFIC C HARACTERISTICS Theophilus Benson 1, Ashok Anand 1, Aditya Akella 1, Ming Zhang 2 University Of Wisconsin – Madison.
Advertisements

Connection-level Analysis and Modeling of Network Traffic understanding the cause of bursts control and improve performance detect changes of network state.
SDN + Storage.
Network Simulation One tool to simulation network protocols for the Internet is the network simulator (NS) The simulation environment needs to be set-
Top Causes for Poor Application Performance Case Studies Mike Canney.
Improving Datacenter Performance and Robustness with Multipath TCP Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik,
Module 5 - Switches CCNA 3 version 3.0 Cabrillo College.
Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.
Computer Science Generating Streaming Access Workload for Performance Evaluation Shudong Jin 3nd Year Ph.D. Student (Advisor: Azer Bestavros)
1 Network Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
1 LAN Traffic Measurements Carey Williamson Department of Computer Science University of Calgary.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.
The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP
Datacenter Network Topologies
CS 5253 Workshop 1 MAC Protocol and Traffic Model.
Network Traffic Measurement and Modeling CSCI 780, Fall 2005.
Traffic Characterization Dr. Abdulaziz Almulhem. Almulhem©20012 Agenda Traffic characterization Switching techniques Internetworking, again.
Copyright © 2005 Department of Computer Science CPSC 641 Winter Network Traffic Measurement A focus of networking research for 20+ years Collect.
Copyright © 2005 Department of Computer Science CPSC 641 Winter LAN Traffic Measurements Some of the first network traffic measurement papers were.
Rethinking Internet Traffic Management: From Multiple Decompositions to a Practical Protocol Jiayue He Princeton University Joint work with Martin Suchara,
Measuring a (MapReduce) Data Center Srikanth KandulaSudipta SenguptaAlbert Greenberg Parveen Patel Ronnie Chaiken.
Data Center Traffic and Measurements Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking.
Jennifer Rexford Fall 2014 (TTh 3:00-4:20 in CS 105) COS 561: Advanced Computer Networks TCP.
ProActive Routing In Scalable Data Centers with PARIS Joint work with Dushyant Arora + and Jennifer Rexford* + Arista Networks *Princeton University Theophilus.
1 An Information Theoretic Approach to Network Trace Compression Y. Liu, D. Towsley, J. Weng and D. Goeckel.
CS 6401 Network Traffic Characteristics Outline Motivation Self-similarity Ethernet traffic WAN traffic Web traffic.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Data-Center Traffic Management COS 597E: Software Defined Networking.
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
A Scalable, Commodity Data Center Network Architecture.
Demystifying and Controlling the Performance of Data Center Networks.
Modeling and Evaluation of Fibre Channel Storage Area Networks Xavier Molero, Federico Silla, Vicente Santonia and Jose Duato.
New Challenges in Cloud Datacenter Monitoring and Management
Self-Similarity of Network Traffic Presented by Wei Lu Supervised by Niclas Meier 05/
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
Traffic Modeling.
1/28/2010 Network Plus Unit 5 Section 2 Network Management.
Internet Traffic Management. Basic Concept of Traffic Need of Traffic Management Measuring Traffic Traffic Control and Management Quality and Pricing.
Top-Down Network Design Chapter Nine Developing Network Management Strategies Oppenheimer.
Copyright © 2011, Programming Your Network at Run-time for Big Data Applications 張晏誌 指導老師:王國禎 教授.
Performance of HTTP Application in Mobile Ad Hoc Networks Asifuddin Mohammad.
Transport Layer3-1 Announcements r Collect homework r New homework: m Ch3#3,4,7,10,12,16,18-20,25,26,31,33,37 m Due Wed Sep 24 r Reminder: m Project #1.
S4-Chapter 3 WAN Design Requirements. WAN Technologies Leased Line –PPP networks –Hub and Spoke Topologies –Backup for other links ISDN –Cost-effective.
Performance Analysis of Real Traffic Carried with Encrypted Cover Flows Nabil Schear David M. Nicol University of Illinois at Urbana-Champaign Department.
Transport Layer3-1 TCP throughput r What’s the average throughout of TCP as a function of window size and RTT? m Ignore slow start r Let W be the window.
CS 447 Network & Data Communication QoS Implementation for the Internet IntServ and DiffServ Department of Computer Science Southern Illinois University.
Subways: A Case for Redundant, Inexpensive Data Center Edge Links Vincent Liu, Danyang Zhuo, Simon Peter, Arvind Krishnamurthy, Thomas Anderson University.
ProtoRINA over ProtoGENI What is RINA? [1][2] References [1] John Day. “Patterns in Network Architecture: A Return to Fundamentals”. Prentice Hall, 2008.
Network Traffic Characteristics of Data Centers in the Wild
Internet Measurement and Analysis Vinay Ribeiro Shriram Sarvotham Rolf Riedi Richard Baraniuk Rice University.
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
Performance Limitations of ADSL Users: A Case Study Matti Siekkinen, University of Oslo Denis Collange, France Télécom R&D Guillaume Urvoy-Keller, Ernst.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.
1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.
MMPTCP: A Multipath Transport Protocol for Data Centres 1 Morteza Kheirkhah University of Edinburgh, UK Ian Wakeman and George Parisis University of Sussex,
For more course tutorials visit NTC 406 Entire Course NTC 406 Week 1 Individual Assignment Network Requirements Analysis Paper NTC 406.
Modeling and Evaluation of Fibre Channel Storage Area Networks
Architecture and Algorithms for an IEEE 802
Improving Datacenter Performance and Robustness with Multipath TCP
Module 5 - Switches CCNA 3 version 3.0.
CPSC 641: Network Measurement
CPSC 641: LAN Measurement Carey Williamson
Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,
湖南大学-信息科学与工程学院-计算机与科学系
NTHU CS5421 Cloud Computing
Internet and Web Simple client-server model
Carey Williamson Department of Computer Science University of Calgary
CPSC 641: Network Measurement
Building a Smart Cloud Strategy
Presentation transcript:

Demystifying and Controlling the Performance of Big Data Jobs Theophilus Benson Duke University

Why are Data Centers Important? Internal users –Line-of-Business apps –Production test beds External users –Web portals –Web services –Multimedia applications –Chat/IM

Why are Data Centers Important? Poor performance  loss of revenue Understanding traffic is crucial Traffic engineering is crucial

Canonical Data Center Architecture Core (L3) Edge (L2) Top-of-Rack Aggregation (L2) Application servers

Dataset: Data Centers Studied DC RoleDC Name LocationNumber Devices UniversitiesEDU1US-Mid22 EDU2US-Mid36 EDU3US-Mid11 Private Enterprise PRV1US-Mid97 PRV2US-West100 Commercial Clouds CLD1US-West562 CLD2US-West763 CLD3US-East612 CLD4S. America427 CLD5S. America427  10 data centers  3 classes  Universities  Private enterprise  Clouds  Internal users  Univ/priv  Small  Local to campus  External users  Clouds  Large  Globally diverse

Dataset: Collection SNMP –Poll SNMP MIBs –Bytes-in/bytes-out/discards – > 10 Days –Averaged over 5 mins Packet Traces –Cisco port span –12 hours Topology –Cisco Discovery Protocol DC Name SNMPPacket Traces Topology EDU1Yes EDU2Yes EDU3Yes PRV1Yes PRV2Yes CLD1YesNo CLD2YesNo CLD3YesNo CLD4YesNo CLD5YesNo

Canonical Data Center Architecture Core (L3) Edge (L2) Top-of-Rack Aggregation (L2) Application servers Packet Sniffers

Applications Start at bottom –Analyze running applications –Use packet traces BroID tool for identification –Quantify amount of traffic from each app

Applications CLD4, CLD5 – Map-Reduce applications CLD1, CLD2, CLD3 – MSN, Mail

Analyzing Packet Traces Transmission patterns of the applications Properties of packet crucial for –Understanding effectiveness of techniques ON-OFF traffic at edges –Binned in 15 and 100 m. secs –We observe that ON-OFF persists 10 Routing must react quickly to overcome bursts

Possible Causes of Burstiness Transport level: TCP delayed ACK – Ideal: one ACK release one data packet Data 2 Ack 2 Data 1 Ack 1

Possible Causes of Burstiness Transport level: TCP delayed ACK – Reality: one ACK release multiple packets – Delayed ACK phenomenon Optimization to reduce network load Data 2 Data 1 Ack 1 & 2 Data 3 Data 4

Possible Causes of Burstiness Physical level: Interrupt Coalescing – Ideal: each packet is processed when it comes in OS Driver Application Data 1 NIC

Possible Causes of Burstiness Physical level: Interrupt Coalescing – Reality: O.S. batches packets to network card OS Driver Application Data 1 NIC Data 2 Data 3

Data-Center Traffic is Bursty Understanding arrival process –Range of acceptable models What is the arrival process? –Heavy-tail for the 3 distributions ON, OFF times, Inter-arrival, –Lognormal across all data centers Different from Pareto of WAN –Need new models 15 Data Center Off Period Dist ON periods Dist Inter-arrival Dist Prv2_1Lognormal Prv2_2Lognormal Prv2_3Lognormal Prv2_4Lognormal EDU1LognormalWeibull EDU2LognormalWeibull EDU3LognormalWeibull Need new models to generate traffic

Possible Causes for Heavy Tail Distribution of files in a file system – NFS  Log-Normal Distribution of files on the web – HTTP files  Log-Normal

Implications of distribution Lognormal V. Pareto – Both heavy tail distributions – Differences is in length of tail 99 Th percentile used to determine SLA – Length of tail impacts 99 th percentile – Modeling with wrong distribution  SLA violations

Canonical Data Center Architecture Core (L3) Edge (L2) Top-of-Rack Aggregation (L2) Application servers

Data Center Interconnect Questions How much locality exists in data centers? What does utilization at the core look like?

Intra-Rack Versus Extra-Rack Quantify amount of traffic using interconnect –Perspective for interconnect analysis Edge Application servers Extra-Rack Intra-Rack Extra-Rack = Sum of Uplinks Intra-Rack = Sum of Server Links – Extra-Rack

Intra-Rack Versus Extra-Rack Results Clouds: most traffic stays within a rack (75%) –Colocation of apps and dependent components Other DCs: > 50% leaves the rack –Un-optimized placement

Extra-Rack Traffic on DC Interconnect Utilization: core > agg > edge – Aggregation of many unto few Tail of core utilization differs –Hot-spots  links with > 70% util –Prevalence of hot-spots differs across data centers

Extra-Rack Traffic on DC Interconnect Utilization: core > agg > edge – Aggregation of many unto few Tail of core utilization differs –Hot-spots  links with > 70% util –Prevalence of hot-spots differs across data centers

Persistence of Core Hot-Spots Low persistence: PRV2, EDU1, EDU2, EDU3, CLD1, CLD3 High persistence/low prevalence: PRV1, CLD2 –2-8% are hotspots > 50% High persistence/high prevalence: CLD4, CLD5 –15% are hotspots > 50%

Traffic Matrix Analysis ABCD Predictability: 5% difference across a set of time periods Predictability: 5% difference across a set of time periods

Is Data Center Traffic Predictable? 99% 27%

How Long is Traffic Predictable? Different patterns of predictability 1 second of historical data able to predict future 1.5 –

Observations from Study Links utils low at edge and agg Core most utilized –Hot-spots exists (> 70% utilization) –< 25% links are hotspots –Loss occurs on less utilized links (< 70%) Implicating momentary bursts Time-of-Day variations exists –Variation an order of magnitude larger at core Packet sizes follow a bimodal distribution

Retrospective Remarks New age of data center networks – Bing deploying Clos style networks

Future Data Center Networks Larger bisection  less contention for network resources

Retrospective Remarks New age of data center networks – Bing deploying Clos style networks – Server links  10Gig Implications for results: – Burstiness at edge holds: perhaps grows – Flow characteristics remain unchanged – Core utilization likely to change More evenly distributed Applications are still the same Traffic more evenly distributed across core layers More NIC optimizations

Insights Gained 75% of traffic stays within a rack (Clouds) –Applications are not uniformly placed At most 25% of core links highly utilized Data center traffic is partially predictable

End of the Talk! Please turn around and proceed to a safe part of this talk