Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council.

Similar presentations


Presentation on theme: "Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council."— Presentation transcript:

1 Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council

2 2 Agenda Introduction to HPC Advisory Council Benchmark Configuration Performance Benchmark Testing and Results Summary Q&A / For More Information

3 3 The HPC Advisory Council World-wide HPC organization (360+ members) Bridges the gap between HPC usage and its full potential Provides best practices and a support/development center Explores future technologies and future developments Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage Leading edge solutions and technology demonstrations

4 4 HPC Advisory Council Members

5 5 HPC Council Board HPC Advisory Council Chairman Gilad Shainer - gilad@hpcadvisorycouncil.comgilad@hpcadvisorycouncil.com HPC Advisory Council Media Relations and Events Director Brian Sparks - brian@hpcadvisorycouncil.combrian@hpcadvisorycouncil.com HPC Advisory Council China Events Manager Blade Meng - blade@hpcadvisorycouncil.comblade@hpcadvisorycouncil.com Director of the HPC Advisory Council, Asia Tong Liu - tong@hpcadvisorycouncil.comtong@hpcadvisorycouncil.com HPC Advisory Council HPC|Works SIG Chair and Cluster Center Manager Pak Lui - pak@hpcadvisorycouncil.compak@hpcadvisorycouncil.com HPC Advisory Council Director of Educational Outreach Scot Schultz – scot@hpcadvisorycouncil.comscot@hpcadvisorycouncil.com HPC Advisory Council Programming Advisor Tarick Bedeir - Tarick@hpcadvisorycouncil.comTarick@hpcadvisorycouncil.com HPC Advisory Council HPC|Scale SIG Chair Richard Graham – richard@hpcadvisorycouncil.comrichard@hpcadvisorycouncil.com HPC Advisory Council HPC|Cloud SIG Chair William Lu – william@hpcadvisorycouncil.comwilliam@hpcadvisorycouncil.com HPC Advisory Council HPC|GPU SIG Chair Sadaf Alam – sadaf@hpcadvisorycouncil.comsadaf@hpcadvisorycouncil.com HPC Advisory Council India Outreach Goldi Misra – goldi@hpcadvisorycouncil.comgoldi@hpcadvisorycouncil.com Director of the HPC Advisory Council Switzerland Center of Excellence and HPC|Storage SIG Chair Hussein Harake – hussein@hpcadvisorycouncil.comhussein@hpcadvisorycouncil.com HPC Advisory Council Workshop Program Director Eric Lantz – eric@hpcadvisorycouncil.comeric@hpcadvisorycouncil.com HPC Advisory Council Research Steering Committee Director Cydney Stevens - cydney@hpcadvisorycouncil.comcydney@hpcadvisorycouncil.com

6 6 HPC Advisory Council HPC Center InfiniBand-based Storage (Lustre)JuniperDodecas Plutus Janus Athena Vesta Mercury Mala Lustre FS 512 cores 456 cores192 cores 704 cores 384 cores 256 cores 16 GPUs 64 cores

7 7 Special Interest Subgroups Missions HPC|Scale –To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary based supercomputers with networks and clusters of microcomputers acting in unison to deliver high-end computing services. HPC|Cloud –To explore usage of HPC components as part of the creation of external/public/internal/private cloud computing environments. HPC|Works –To provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines. HPC|Storage –To demonstrate how to build high-performance storage solutions and their affect on application performance and productivity. One of the main interests of the HPC|Storage subgroup is to explore Lustre based solutions, and to expose more users to the potential of Lustre over high-speed networks. HPC|GPU –To explore usage models of GPU components as part of next generation compute environments and potential optimizations for GPU based computing. HPC|FSI –To explore the usage of high-performance computing solutions for low latency trading, more productive simulations (such as Monte Carlo) and overall more efficient financial services.

8 8 HPC Advisory Council HPC Advisory Council (HPCAC) –360+ members –http://www.hpcadvisorycouncil.com/http://www.hpcadvisorycouncil.com/ –Application best practices, case studies (Over 150) –Benchmarking center with remote access for users –World-wide workshops –Value add for your customers to stay up to date and in tune to HPC market 2013 Workshops –USA (Stanford University) – January 2013 –Switzerland – March 2013 –Germany (ISC13) – June 2013 –Spain – Sep 2013 –China (HPC China) – Oct 2013 For more information –www.hpcadvisorycouncil.comwww.hpcadvisorycouncil.com –info@hpcadvisorycouncil.cominfo@hpcadvisorycouncil.com

9 9 ISC14 – Student Cluster Competition University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the ISC14 show-floor The Student Cluster Competition is designed to introduce the next generation of students to the high performance computing world and community

10 10 FLOW-3D/MP Performance Study Research performed under the HPC Advisory Council activities –Participating vendors: Intel, Dell, Mellanox –Compute resource - HPC Advisory Council Cluster Center Objectives –Give overview of FLOW-3D/MP performance –Compare different MPI libraries, network interconnects and others –Understand FLOW-3D/MP communication patterns –Provide best practices to increase FLOW-3D/MP productivity

11 11 FLOW-3D/MP FLOW-3D/MP is a powerful and highly-accurate CFD software –Provides engineers valuable insight into many physical flow processes FLOW-3D/MP is the ideal computational fluid dynamics software –To use in the design phase as well as in improving production processes –Provides special capabilities for accurately predicting free-surface flows FLOW-3D/MP is a standalone, all-inclusive CFD package –Includes an integrated GUI that ties components from problem setup to post- processing

12 12 Test Cluster Configuration Dell PowerEdge R720xd 32-node Jupiter cluster –16-node Dual-Socket Eight-Core Intel E5-2680 @ 2.70 GHz CPUs –16-node Dual-Socket Ten-Core Intel E5-2680 V2 @ 2.80 GHz CPUs –Memory: 64GB memory, DDR3 1600 MHz –OS: RHEL 6.2, OFED 1.5.3-3.1.0 (for v4.2) and 2.0-3.0.0 (for v5.0) InfiniBand SW stack –Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 Mellanox Connect-IB FDR InfiniBand adapters and ConnectX-3 Ethernet adapters Mellanox SwitchX SX6036 InfiniBand switch MPI (vendor provided): Intel MPI 3.2.0.011 (v4.2) and 4.0.0.025 (v5.0) Application: FLOW-3D/MP 4.2 and 5.0 Benchmarks: –Lid Driven Cavity Flow - Completely filled with fluid (in this case air) –P2 Engine Block - A casting application with free-surface (one-fluid simulation with sharp interface tracking)

13 13 About Intel ® Cluster Ready Intel® Cluster Ready systems make it practical to use a cluster to increase your simulation and modeling productivity –Simplifies selection, deployment, and operation of a cluster A single architecture platform supported by many OEMs, ISVs, cluster provisioning vendors, and interconnect providers –Focus on your work productivity, spend less management time on the cluster Select Intel Cluster Ready –Where the cluster is delivered ready to run –Hardware and software are integrated and configured together –Applications are registered, validating execution on the Intel Cluster Ready architecture –Includes Intel® Cluster Checker tool, to verify functionality and periodically check cluster health FLOW-3D/MP is Intel Cluster Ready certified

14 14 PowerEdge R720xd Massive flexibility for data intensive operations Performance and efficiency –Intelligent hardware-driven systems management with extensive power management features –Innovative tools including automation for parts replacement and lifecycle manageability –Broad choice of networking technologies from 1GbE to IB –Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits –Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities –Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots –Large memory footprint (Up to 768GB / 24 DIMMs) –High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

15 15 FLOW-3D/MP Performance – MPI vs Hybrid FLOW-3D/MP supports OpenMP Hybrid mode –Up to 327% of higher performance than MPI mode at 16 nodes Hybrid version enables higher scalability versus pure MPI version –Hybrid version delivers better scalability after 4 nodes –MPI processes would spawn OpenMP threads for computation on CPU cores –Streamline and reduce communication endpoints to improve scalab ility Intel E5-2680 V2 FDR InfiniBand 327% *Performance Rating = Jobs/Day

16 16 Default for FLOW-3D/MP Hybrid mode is to run 1 PPN of 16 threads –Which uses -genv I_MPI_PIN_DOMAIN node as specified in runhyd_par script For best performance on 2P Platforms: –Use -genv I_MPI_PIN_DOMAIN socket in runhyd_par script –2PPN of 8/10 threads can yield better performance than 1PPN of 16/20 threads –The flag allows to each MPI process to spawn threads within its own socket –Instead of both MPI processes sharing the same socket FLOW-3D/MP Performance – Hybrid Mode 110% 178% FDR InfiniBand FLOW-3D/MP 5.0

17 17 2PPN of 8 threads can provide better performance than 1PPN of 16 threads –Threads of the MPI process causes threads to spawn within the same socket –With the I_MPI_PIN_DOMAIN=socket specified in the runhyd_par script Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads –With the I_MPI_PIN_DOMAIN=node specified in the runhyd_par script The flag is modified to socket to allow spawning of threads within a socket –For the case of 2PPN of 8 threads FLOW-3D/MP Performance – Hybrid Mode FDR InfiniBand 9% 35% FLOW-3D/MP 4.2

18 18 FLOW-3D/MP Performance – Processors Intel E5-2680 (Sandy Bridge) cluster outperforms Intel Xeon E5670 cluster –Performs 70% better than X5670 cluster at 16 nodes System components used: –Sandy Bridge: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks –Westmere: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk FLOW-3D/MP 4.2 70% FDR InfiniBand

19 19 FLOW-3D/MP Performance – Processors Intel E5-2680 v2 (Ivy Bridge) cluster outperforms the Intel E5-2680 cluster –Performs 8% better than X5670 cluster at 16 nodes System components used: –Sandy Bridge: 2-socket 8-core Intel E5-2680 @ 2.7GHz –Ivy Bridge: 2-socket 10-core Intel E5-2680 V2 @ 2.8GHz 8% 5% FDR InfiniBand FLOW-3D/MP 5.0

20 20 FLOW-3D/MP Performance – Software Versions FLOW-3D/MP v5.0 outperforms v4.2 in scalability in Hybrid mode –Provides up to 37% faster in Hybrid mode at 16-node Running in MPI mode shows slightly longer runtime than previous version –Appears to be caused by a change in communication algorithm –More MPI collective operations are being used compared to prior version 16 MPI Processes/Node 37% *Performance Rating = Jobs/Day

21 21 FLOW-3D/MP Performance – Network InfiniBand FDR provides better scalability performance than Ethernet –Scalability gap widens as more nodes involved in simulation –FDR InfiniBand provides up to 246% better performance than 1GbE –FDR InfiniBand delivers up to 39% better performance than 10GbE –Hybrid mode is shown 39% 16 MPI Processes/Node 246% *Performance Rating = Jobs/Day

22 22 FLOW-3D/MP Profiling – Time Ratio InfiniBand FDR reduces the communication time at scale –InfiniBand FDR consumes about 17% of total runtime on 16-node Hybrid job –10GbE consumes 39% of total time, while 1GbE consumes about 75% IB RDMA technology allows communication to bypass CPU involvement –Reduces CPU overhead in handling communication –Which leaves more time for application processing

23 23 FLOW-3D/MP Profiling – # of MPI Calls Overall runtime reduces as more nodes take part of the MPI job –More compute nodes reduce runtime by spreading out the workload Computation time drops while the communication time stays flat –As cluster scales, MPI time stays constantly at the same level 16 MPI Processes/NodePure MPI Mode

24 24 FLOW-3D/MP Profiling – Communication Time The most time consumed MPI functions are: –FDR: MPI_Allreduce(51%), MPI_Bcast(24%), MPI_Waitall(16%) InfiniBand reduces more time in Collective Operations than Ethernet –Collective communications are most used in FLOW-3D v5.0 –Those communications account for the highest communication time

25 25 FLOW-3D/MP Profiling – # of MPI Calls There is a wide range of message sizes seen: –MPI_Allreduce: Concentration between 4B to 16B –MPI_Waitall: Around 4MB –MPI_Bcast: Around 1-4MB

26 26 FLOW-3D/MP Performance – File System Storing data files on local FS or tmpfs would improve performance –Scalability is limited by NFS when running at scale after 8 nodes –NFS used in this case is over 1GbE network Higher is better InfiniBand FDR 25%

27 27 FLOW-3D/MP Profiling – Disk IO Time File IO access occurs during certain period during the MPI solver –Large spikes for writing the restart and spatial data –Files are directed to write to local instead of NFS to avoid IO bottleneck

28 28 FLOW-3D/MP – Summary Scalability –FLOW-3D/MP v5.0 Hybrid mode enables higher scalability versus pure MPI version Hybrid version delivers good scalability to 16 nodes (320 cores); >37% faster than v4.2 Performance –Intel Ivy Bridge-EP and FDR InfiniBand enable FLOW-3D/MP to scale to 320 cores –Allocate MPI process to proper socket in Hybrid mode allows performance to jump 178% –Hybrid mode allows FLOW-3D/MP to scale at 16 nodes, up to 327% against MPI mode Network –InfiniBand FDR allows the best scalability performance with 56Gbps rate Outperforms by 246% over 1GbE at 16-node (320 cores) Outperforms by 39% over 10GbE at 16-node (320 cores) –RDMA technology in InfiniBand allows bypassing CPU for network transfer This offload reduces CPU overhead in handling communication; thus CPU can focus on application Profiling –MPI Communication time is spent mostly on MPI_Allreduce at 51% of overall MPI time –InfiniBand can process Collective Operations in network faster than Ethernet –Large concentration on small messages, typical for latency sensitive HPC applications

29 29 All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein Thank You Special thanks to Anup Gokarn of Flow Science Questions? –Pak Lui –pak@hpcadvisorycouncil.compak@hpcadvisorycouncil.com


Download ppt "Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council."

Similar presentations


Ads by Google