Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fabrice Mizero Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan (University of Virginia) Prof. Robert D. Russell (University of New Hampshire)

Similar presentations


Presentation on theme: "Fabrice Mizero Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan (University of Virginia) Prof. Robert D. Russell (University of New Hampshire)"— Presentation transcript:

1 Fabrice Mizero Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan (University of Virginia) Prof. Robert D. Russell (University of New Hampshire) Qian Liu(University of New Hampshire) Aug 1, 2014

2 Motivation Background Methodology Results Conclusion and Solutions Future Work 1 Roadmap

3 Understanding the causes of poor performance of CESM on Yellowstone: a 5-step approach  Experimental execution and data collection  HOMME trace analysis  IBMgtSim: routing study  Network simulation  Integrated simulation Big Picture 2

4 3 2-hop 4-hop 6-hop *Credit: Dr. John Dennis Zhengyang Liu

5 Network Congestion  Head of Line Blocking  Credit-Based Flow Control OS Jitter  Kernel Interrupts Application Interference:  Self-Interference  Interference with others (Neighborhood Effect) 4 Suspected Causes “ …OS noise, shape of the allocated partition, and interference from other jobs.” Abhinav Bhatele et al. SC13

6 H4 5 Congestion  Head of Line Blocking (HOL)  Worst Case Scenario:  Congestion Spreading due to HOL H1 H2 H5 H3 H6 H7 S2S1 Stuck!!! Out of Buffer Space!! Victim Flow

7 Each compute node runs its own OS - RHEL Interference caused by OS routines  Timer interrupts  OS Daemons  Hardware interrupts Competition for CPU resources.  Example: Line Printer Daemon 6 OS Jitter

8 How does congestion impact network latency? How important is OS Jitter to network latency? What has a bigger impact to message latency: OS Jitter or Congestion? 7 3 Questions

9 Congestion:  2 Platforms Jellystone: Non-production machine Yellowstone: production machine  Different message sizes & Hop distance OS Jitter:  Linux Transparent Huge Pages (THP) 8 Experimental Set-Up

10 9 Methodology Extrae Trace Collection Hop, Size Wilcoxon Rank Sum Test Clock Skew Correction

11 Tracing tool Developed at BSC Chronologic event, state, communications records One way communication delays – Visuals with Paraver 10 Extrae MPI-Isend Start End Time

12 11 Clock Skew Host A C a (t1) Host B C b (t2) In reality, Offset = C a (t) – C b (t) != 0 Skew = C a ’ (t) - C b ’ (t) != 0 Ideally, C AB = C b (t2) – C a (t1) Same size, Same Hop-Count, host-pair level  Min delay: best approximation of offset  C AB (t) – min( C AB (t)) + min pingpong

13 Wilcoxon Rank Sum Test:  Non-parametric significance test  Compare the means of two independent populations  Tests: OS Jitter?  Jellystone: no THP with THP Congestion?  Yellowstone: 0-Hop delays  4-Hop Delays  Jellystone: THP  Yellowstone: THP 12 Statistical Methods

14 Perfquery: IB performance counters query tool. PortXmitWait: Port congestion monitoring  Credit-Based Flow control 13 Perfquery Host A TOR Switch Credits ? No Yes PortXmitWait

15 How important is OS Jitter to network latency?  Jellystone::0-Hop::NoTHP vs. Jellystone::0-Hop::THP  Intranode communications delays with THP enabled are slower than without THP. 14 Results

16 What has a bigger impact to message latency: OS Jitter or Congestion?  Comparing: Yellowstone: 0-Hop delays, 4-Hop delays  For all considered message sizes, intranode communications delays can outweigh internode delays 15 Results

17 OS Jitter can cause performance degradation or variability. Inter-job interference can lead to application performance variability. Solutions  Congestion:  Dynamic Allocation of Virtual Lanes to redirect victim flows around congested ports.  OS Jitter:  Linux Tickless Kernel  MPI-3 for better control over share-memory communications. 16 Conclusion

18 Further study on the Dynamic Virtual Lanes assignment solution Plan and collect new HOMME traces with PortXmitWait monitored and LSF Logs saved. Study intra-job interference More efficient algorithm of correcting Clock Skew 17 Future Work

19 Fabrice Mizero


Download ppt "Fabrice Mizero Mentor: Dr. John Dennis Collaborators: Prof. Malathi Veeraraghavan (University of Virginia) Prof. Robert D. Russell (University of New Hampshire)"

Similar presentations


Ads by Google