Presentation on theme: "Roadmap Motivation Background Methodology Results"— Presentation transcript:
0HOMME Trace Analysis Fabrice Mizero Mentor: Dr. John Dennis Collaborators:Prof. Malathi Veeraraghavan (University of Virginia)Prof. Robert D. Russell (University of New Hampshire)Qian Liu(University of New Hampshire)Aug 1, 2014
1Roadmap Motivation Background Methodology Results Conclusion and SolutionsFuture Work
2Big PictureUnderstanding the causes of poor performance of CESM on Yellowstone: a 5-step approachExperimental execution and data collectionHOMME trace analysisIBMgtSim: routing studyNetwork simulationIntegrated simulation
33-level2-hop4-hop6-hop*Credit: Dr. John DennisZhengyang Liu
4Suspected Causes Network Congestion OS Jitter “…OS noise, shape of the allocated partition, and interference from other jobs.” Abhinav Bhatele et al. SC13Network CongestionHead of Line BlockingCredit-Based Flow ControlOS JitterKernel InterruptsApplication Interference:Self-InterferenceInterference with others (Neighborhood Effect)Competition against OS Daemons, Timer Interrupts, buffer-cache synchronization, etc.
5Congestion Head of Line Blocking (HOL) Worst Case Scenario: Congestion Spreading due to HOLH1Victim FlowOut of Buffer Space!!H4Out of Buffer Space!!S1S2H2H5Stuck!!!H3H6H7
6OS Jitter Each compute node runs its own OS - RHEL Interference caused by OS routinesTimer interruptsOS DaemonsHardware interruptsCompetition for CPU resources.Example: Line Printer Daemon
73 Questions How does congestion impact network latency? How important is OS Jitter to network latency?What has a bigger impact to message latency: OS Jitter or Congestion?
8Experimental Set-Up Congestion: OS Jitter: 2 Platforms Jellystone: Non-production machineYellowstone: production machineDifferent message sizes & Hop distanceOS Jitter:Linux Transparent Huge Pages (THP)
9Extrae Trace Collection MethodologyExtrae Trace CollectionClock Skew CorrectionHop, SizeHop, SizeWilcoxon Rank Sum Test
10Extrae Tracing tool Developed at BSC Chronologic event, state, communications recordsOne way communication delays – Visuals with ParaverMPI-IsendStartTimeEnd
11Clock Skew Same size, Same Hop-Count, host-pair level Host A Ca(t1)Ideally, CAB= Cb(t2) – Ca(t1)Host BCb(t2)In reality, Offset = Ca(t) – Cb(t) != 0Skew = Ca’(t) - Cb’(t) != 0Same size, Same Hop-Count, host-pair levelMin delay: best approximation of offsetCAB(t) – min( CAB(t)) + minpingpong
12Statistical Methods Wilcoxon Rank Sum Test: Non-parametric significance testCompare the means of two independent populationsTests:OS Jitter?Jellystone: no THP <=> with THPCongestion?Yellowstone: 0-Hop delays 4-Hop DelaysJellystone: THP Yellowstone: THP
13Perfquery Perfquery: IB performance counters query tool. PortXmitWait: Port congestion monitoringCredit-Based Flow controlTOR SwitchCredits?PortXmitWaitNoYesHost A
14Results How important is OS Jitter to network latency? Jellystone::0-Hop::NoTHP vs. Jellystone::0-Hop::THPIntranode communications delays with THP enabled are slower than without THP.Msg sizeSample sizep-ValueInterpretation488B54624::45727<0.001, <0.001,1NoTHP is faster than with THP1952B9503::79502440B102120::854682928B47504::39764
15ResultsWhat has a bigger impact to message latency: OS Jitter or Congestion?Comparing: Yellowstone: 0-Hop delays, 4-Hop delaysFor all considered message sizes, intranode communications delays can outweigh internode delaysMsg sizeSample sizep-ValuesInterpretation488B54325::23621<0.001, <0.001,14-Hop is faster than 0-Hop2440B101581::165292928B47243::212594880B49603::4720
16ConclusionOS Jitter can cause performance degradation or variability.Inter-job interference can lead to application performance variability.SolutionsCongestion:Dynamic Allocation of Virtual Lanes to redirect victim flows around congested ports.OS Jitter:Linux Tickless KernelMPI-3 for better control over share-memory communications.
17Future WorkFurther study on the Dynamic Virtual Lanes assignment solutionPlan and collect new HOMME traces with PortXmitWait monitored and LSF Logs saved.Study intra-job interferenceMore efficient algorithm of correcting Clock Skew