Presentation on theme: "Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue"— Presentation transcript:
1 Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications Group Research 1:AKHTAR, KamranSU, HaoSUN, QiangTANG, YueYANG, Xiaofan(42 slides)
2 SummaryThis is the first large-scale analysis of failures in a data center network.Based on data center networks and characterizing failure events within data center, using multiple data sources commonly collected by network operators, analyze and estimating the failures, their impact, and the effectiveness of network redundancy.Key achievements: They found that commodity switches exhibit high reliability which supports current proposals to design flat networks using commodity components. They also highlighted the importance of studies to better manage middle boxes such as load balancers, as they exhibit high failure rates. Finally, at both network and application layers, more investigation is needed to analyze and improve the effectiveness of redundancy.
3 OUTLINE Part 1: Introduction Part 2: Background Part 3: Methodology and Data SetsPart 4: Failure AnalysisPart 5: Estimating Failure ImpactPart 6: DiscussionPart 7: Related WorkPart 8: Conclusions and Future Work
13 3.1 Existing data sets 1. Network event logs (SNMP/syslog) 2. NOC TicketsInfo about when and how events were discovered as well as when they were resolved (operators employ it)3. Network traffic data4. Network topology data
14 3.2---3.4 Defining and identifying failures with impact Link failures Device failures“provisioning” (no data before, some data transferred during failure)
15 At least one link failure within a time window of five minutes For link failuresEliminating spurious notificationsFocus on measurable eventsFor device failuresAt least one link failure within a time window of five minutesWe only need failure events impacted network traffic
17 Outline Failure event panorama Daily volume of failures Probability of failureAggregate impact of failuresProperties of failuresGrouping link failuresRoot causes of failures
18 Failure event panorama All failures vs. failures with impactWidespread failuresLong-lived failures
19 Daily volume of failures Link failures are variable and burstyDevice failures are usually caused by maintenanceTable4: Failures per time unit
20 Probability of failure Load balancers have the highest failure probabilityToRs have low failure ratesLoad balancer links have the highest rate of logged failuresManagement and inter-data center links have lowest failure
21 Aggregate impact of failures Load balancers have the most failures but ToRs have the most downtimeLoad balancer links experience many failure events but relatively small downtimeLoad balancer failures dominated by few failure prone devices
22 properties of failures Time to repairLoad balancers experience short-lived failureToRs experience correlated failuresInter-data center links take the longest to repair
23 properties of failures Time between failuresLoad balancer failures are burstyLink flapping is absent from the actionable network logsMGMT, CORE and ISC links are the most reliable in time between failures
24 properties of failures Reliability of network elementsData center networks experience high availabilityLinks have high availability (having higher than 9’s of reliability)
25 Grouping link failures To group correlated failuresRequire that link failures occur in the same data centerFailures to occur within a predefined time thresholdLink failures tend to be isolated
26 Root causes of failures Choose to leverage the “Problem type” field of the NOC ticketsHardware problems take longer to mitigateLoad balancers affected by software problemsLink failures are dominated by connection and hardware problems
28 5.1 Is redundancy effective in reducing impact? Several reasons why redundancy may not be 100% effective:1. bugs in fail-over mechanisms can arise if there is uncertainty as to which link or component is the back up.2. if the redundant components are not configured correctly, they will not be able to reroute traffic away from the failed component.3. protocol issues such as TCP backoff, timeouts, and spanning tree reconfigurations may result in loss of traffic.Network redundancy helps, but it is not entirely effective.
29 5.2 Redundancy at different layers of the network topology Links highest in the topology benefit most from redundancy.Links from ToRs to aggregation switches benefit the least from redundancy, but have low failure impact.
31 Low-end switches exhibit high reliability Low-costThe lowest failure rate with a lower failure probabilityHowever, as populations of these devices rise, the absolute number of failures observed will inevitably increase.
32 Improve reliability of middle-boxes Need to be taken into accountThe development of better management and debugging toolsSoftware load balancers running on commodity serversLoad balancer links have the highest rate of logged failuresManagement and inter-data center links have lowest failure
33 Improve the effectiveness of network redundancy Network redundancies in our system are 40% effective at masking the impact of network failures.One cause: due to configuration issues that lead to redundancy being ineffective at masking failureThe back up link was subject to the same flaw as the primary.
34 Separate control plane from data plane The cases of NOC ticketsThe separation between control plane and data plane becomes even more crucial to avoid impact to hosted applications.
36 Application failuresOther study  found:The majority of failures occur during the TCP handshake as a result of end-toend connectivity issues.Web access failures are dominated by server-side issues.These findings highlight the importance of studying failures in data centers hosting Web services.
37 Network failuresSome studies observe significant instability and flapping as a result of external routing protocols .Unlike these studies, not observe link flapping owing to our data sources being geared towards actionable events.Some studies find that 70% of failures involve only a single link .Similarly observe that the majority of failures in data centers are isolated.Some studies also observe longer time to repair on wide area links .Similar to the observations for wide area links connecting data centers.
38 Failures in cloud computing Some studies consider the availability of distributed storage and observe that the majority of failures involving more than ten storage nodes are localized within a single rack .Also observe spatial correlations but they occur higher in the network topology, where we see multiple ToRs associated with the same aggregation switch having correlated failures.
40 Give your own opinion about what you think is good or bad about the paper, e.g. how could it be improved?
41 References V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT,  B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS,  A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW,  D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM,  K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), The figures come from the project paper, Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications.