Presentation is loading. Please wait.

Presentation is loading. Please wait.

生命科学、气象行业 高性能计算解决方案及成功案例分享

Similar presentations


Presentation on theme: "生命科学、气象行业 高性能计算解决方案及成功案例分享"— Presentation transcript:

1 生命科学、气象行业 高性能计算解决方案及成功案例分享
凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司 Confidential

2 内容 生命科学高性能计算解决方案 GPU加速解决方案 高性能存储解决方案
WRF V3.3 ( 气象行业应用) 在 Dell R720 服务器 程序测试及优化 gcc 编译器器 Intel 编译器 成功案例分享 Confidential

3 生命科学 HPC GPU 方案

4 在生命科学领域中 很多用户采用GPU加速解决方案
Confidential

5 CPU + GPU 计算 Confidential

6 HPCC GPU 异构平台 Confidential

7 支持GPU的 Dell 服务器方案(2012年,12代服务器)
External Solutions (PowerEdge C) Internal Solutions C6220 C6145 T620 R720 C410x GPU:Socket Ratio 1:1 2:1 Total System Boards 8 4 2 1 Total HIC IB Capable Yes Yes* Total GPU 16 Per GPU B/W MSRP (M2075) $117,000 $86,900 $114,000 $85,250 $19,000 $13,000 Power Envelope (est) 5.525 kW 4.118 kW 5.030 kW 3.802 kW Theoretical GFLOPs TBD 9,326 8,932 2,431 1,401 Est. GFLOPs 2,891 1,697 GFLOPS/Rack U 413 339 486 701 $/GFLOPS 39 50 9 Rack Size 7 5 GPU/Rack U 2.3 3.2 0.8 1.0 Confidential

8 GPU 扩展箱方案 (GPU外置方案) Dell PowerEdge C410x
PCIe EXPANSION CHASSIS CONNECTING 1-8 HOSTS TO 1-16 PCIe Great for: HPC including universities, oil & gas, biomed research, design, simulation, mapping, visualization, rendering, and gaming 3U chassis, 19” wide, 143 pounds Management: On-board BMC; IPMI 2.0; dedicated management port PCI express modules: 10 front, 6 rear Power supplies: 4 x 1400W hot-plug, high efficiency PSUs; N+1 power redundancy PCI form factors: HH/HL and FH/HL Up to 225W per module Services vary by region: IT Consulting, Server and Storage Deployment, Rack Integration (US only), Support Services PCIe inputs: 8PCIe x16 IPASS ports PCI fan out options: x16 to 1 slot, x16 to 2 slot, x16 to 3 slot, x16 to 4 slot GPUs supported: NVIDIA M1060, M2050, M2070 (TBD) Thermals: high-efficiency 92mm fans; N + 1 fan redundancy Confidential

9 PowerEdge C410x PCIe 模块 Serviceable PCIe module (taco) capable of supporting any half-height, half-length (HH/HL) or full-height/half-length (FH/HL) cards FH/FL cards supported with extended PCIe module Future-proofing on next generations of NVIDIA and AMD ATI GPU cards Power connector for GPGPU card LED Board-to-board connector for X16 Gen PCIe signals and power GPU card Confidential

10 PowerEdge C410x Configurations
Enabling HPC applications to optimize cost / performance equation off single x16 1 GPU / x16 8GPU/7U 2 GPU / x16 16GPU/7U x16 x16 Host x16 x16 PCI Switch GPU Host HIC PCI Switch GPU HIC x16 C6100 C410x C6100 GPU C410x iPass cable 7U = (1) C410x + (2) C6100 iPass cable 7U = (1) C410x + (2) C6100 3 GPU / x16 12GPU/5U 4 GPU / x16 16GPU/5U x16 x16 x16 x16 Host PCI Switch GPU Host PCI Switch GPU HIC HIC C6100 x16 x16 GPU C6100 GPU x16 x16 iPass cable GPU iPass cable GPU C410x x16 GPU C410x 5U = (1) C410x + (1) C6100 5U = (1) C410x + (1) C6100 GPU/U ratios assume PowerEdge C6100 host with 4 servers per 2U chassis Confidential

11 Flexibility of the PowerEdge C410x
Increases to 8:1 possible with dual x16 x16 GPU x16 GPU iPass cable x16 iPass cable GPU x16 GPU x16 PCI Switch x16 x16 GPU x16 PCI Switch GPU Host HIC Host HIC HIC HIC x16 x16 x16 x16 PCI Switch GPU PCI Switch GPU x16 x16 GPU GPU iPass cable iPass cable C410x x16 GPU x16 GPU C410x Confidential

12 PowerEdge C6100 Configurations “2:1 Sandwich”
Details Two C6100 8 system boards 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host Single port x16 HIC (iPASS) Single C410x 16 GPUs (fully populated) PCIe x8 per GPU Total space = 7U C6100 C410x C6100 Summary C6100 “2:1 Sandwich” One Dell C410x (16 GPUs) Two C6100 (8 nodes) One x16 slot for each node to 2 GPUs 7U total 16 GPUs total 8 nodes total (2 GPUs per board) Note: This configuration is equivalent to using the C6100 and the NVIDIA S2050 but this configuration is more dense Confidential

13 PowerEdge C6100 Configurations “4:1 Sandwich”
Details C410x One C6100 4 system boards 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host Single port x16 HIC (iPASS) Single C410x 16 GPUs (fully populated) PCIe x4 per GPU Total space = 5U C6100 Summary C6100 “4:1 Sandwich” One Dell C410x (16 GPUs) One C6100 (4 nodes) One x16 slot for each node to 4 GPUs 5U total 16 GPUs total 4 nodes total (4 GPUs per board) Confidential

14 PowerEdge C6100 Configurations “8:1 Sandwich” (Possible Future Development)
Details C410x One C6100 4 system boards 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host Single port x16 HIC (iPASS) Two C410x 32 GPUs (fully populated) PCIe x2 per GPU Total space = 8U See later table for metrics C6100 C410x Summary C6100 “8:1 Sandwich” Two Dell C410x (32 GPUs) One C6100 (4 nodes) One x16 slot for each node to 8 GPUs 8U total 32 GPUs total 4 nodes total (8 GPUs per board) Confidential

15 PowerEdge C6145 Configurations “8:1 Sandwich”
5U of Rack Space Details One C6145 2 system boards 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS) One C410x 16 GPUs (fully populated) PCIe x4-x8 per GPU Total space = 5U C6145 C410x Details C6145 “16:1 Sandwich” One Dell C410x (16 GPUs) One C6145 (2 nodes) Two-Four HIC slots for each node to 16 GPUs 5U total 16 GPUs total 2 nodes total (16 GPUs per board) Dell Confidential

16 PowerEdge C6145 Configurations “16:1 Sandwich”
8U of Rack Space Details One C6145 2 system boards 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS) Two C410x 32 GPUs (fully populated) PCIe x4 per GPU Total space = 8U C410x C6145 C410x Details C6145 “16:1 Sandwich” Two Dell C410x (32 GPUs) One C6145 (2 nodes) Four HIC slots for each node to 16 GPUs 8U total 32 GPUs total 2 nodes total (16 GPUs per board) Dell Confidential

17 PowerEdge C410x Block Diagram
GPUs x 16 Switch Level 2 x 4 Switch Level 1 x 8 Host Connections x 8

18 C410X BMC控制台配置界面

19 GPU 扩展箱支持服务器列表 HIC/C410x Support Matrix
Dell external GPU solution support Hardware Interface Card (HIC) in PCIe slot connects to external GPU(s) in C410x Dell ‘slot validates’ NVIDIA interface cards to verify power, thermals, etc. Server C410x Support Planned Support Date C6100 Yes Now C6105 RTS+ Now – BIOS or later C6145 RTS C1100 Precision R5500 Now – Disable SSC in BIOS R710 M610x R410 R720 R720xd R620 C6220 Add 12G Servers to graphic

20 生命科学应用测试: GPU-HMMER 1.8X 2.7X 2.8X 2.9X
1.8X speedup. Speedup appears to decrease with increasing length. Dell High Performance Computing

21 GPU:Host Scaling : GPU-HMMER
Speedup 1.8X 3.6X 7.2X Speedup shown is for the last/worst case. Similar to our poster child NAMD, scales nicely to at least 4 GPUS. We don’t know how much it will scale. Recall that in head-to-head testing, the difference between external and internal was less that ¼ of a percent. Going external is definitely the4 way to go in this case. With the C410x and 4 GPUs, get results in half the time compared to Internal 2-x16. Dell High Performance Computing

22 GPU:Host Scaling: NAMD
Speedup 4.7X 8.2X 15.2X 9.5X “Since the CPU performance is 0.10, it is very easy to compute the scaling factors. It may be important to not settle for or squabble over 8-9X speedup when 15X is available and we don’t know if it will continue with even more GPUs. The 13% difference between 2 internal vs. 2 external is now insignificant. 4 external GPUs are about 37% faster than 2 internal GPUs.” Dell High Performance Computing

23 GPU:Host Scaling : LAMMPS JL-Cut
Speedup 8.5X 13.5X 14.4X 14.0X Speedup for last test only relative to a single CPU. The difference between the 2 internal and 2 external is similar to what we have seen before X with 2 -4 GPUs. We would conclude that LAMMPS JL-Cut benefits from 1-2 GPUs, but does not scale beyond that. Dell High Performance Computing

24 生命科学 存储方案

25 生命科学 计算、数据容量增长率

26 The Lustre Parallel File System
Key Lustre Components: Clients (compute nodes) “Users” of the file system where applications run The Dell HPC Cluster Meta Data Server (MDS) Holds meta-data information Object Storage Server (OSS) Provides back-end storage for the users’ files Additional OSS units increase throughput linearly Meta Data Server (MDS) Clients OSS

27 Confidential

28

29 InfiniBand (IPoIB) NFS Performance: Sequential Read
Peaks: NSS Small: 1 node doing IO (fairly level until 4 nodes) NSS Medium: 4 nodes doing IO (not much drop-off) NSS Large: 8 nodes doing IO (good performance over range)

30 Infiniband (IPoIB) NFS Performance: Sequential Write
Peaks: NSS Small: 1 node doing IO (steady drop off to 16 nodes) NSS Medium: 2 nodes doing IO (good performance for up to 8 nodes) NSS Large: 4 nodes doing IO (good performance over range)

31 Confidential

32 WRF V3.3 应用程序测试调优

33 Dell 测试环境 Dell R720 BIOS Setting OS
cpu : 2x Intel Sandy Bridge E , Memory: 8x 8MB (64GB Memory) Harddisk: 2x 300 GB 15Krpm (Raid 0) BIOS Setting disable HT memory optimized High Performance enable ( Power Max) OS Redhat Enterprise Linux 6.3 Confidential

34 Gcc 测试 gcc, gfortran, gc++ Zlib 1.2.5 HDF5 1.8.8 Netcdf 4 WRF V3.3
Confidential

35

36 测试结果 输出文件 wrf : 2011年11月30日 至 2011年12月5日 (13H9M53S)
wrf.exe starts at: Sun Apr 29 09:35:36 CST 2012 … wrf: SUCCESS COMPLETE WRF wrf.exe completed at: Sun Apr 29 22:45:29 CST 2012 Confidential

37 配置文件 # Settings for x86_64 Linux, gfortran compiler with gcc (smpar) DMPARALLEL = OMPCPP = D_OPENMP OMP = fopenmp OMPCC = fopenmp SFC = gfortran SCC = gcc CCOMP = gcc DM_FC = mpif90 -f90=$(SFC) DM_CC = mpicc -cc=$(SCC) FC = $(SFC) CC = $(SCC) -DFSEEKO64_OK LD = $(FC) RWORDSIZE = $(NATIVE_RWORDSIZE) PROMOTION = # -fdefault-real-8 # uncomment manually ARCH_LOCAL = DNONSTANDARD_SYSTEM_SUBR CFLAGS_LOCAL = w -O3 -c -DLANDREAD_STUB LDFLAGS_LOCAL = CPLUSPLUSLIB = ESMF_LDFLAG = $(CPLUSPLUSLIB) FCOPTIM = O3 -ftree-vectorize -ftree-loop-linear -funroll-loops FCREDUCEDOPT = $(FCOPTIM) FCNOOPT = O0 FCDEBUG = # -g $(FCNOOPT) FORMAT_FIXED = ffixed-form FORMAT_FREE = ffree-form -ffree-line-length-none FCSUFFIX = BYTESWAPIO = fconvert=big-endian -frecord-marker=4 FCBASEOPTS_NO_G = w $(FORMAT_FREE) $(BYTESWAPIO) FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) MODULE_SRCH_FLAG = TRADFLAG = traditional CPP = /lib/cpp -C -P AR = ar ARFLAGS = ru M = m4 -G RANLIB = ranlib CC_TOOLS = $(SCC) Confidential

38 Wrf.out …. WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 16
WRF TILE 1 IS IE JS JE WRF TILE 2 IS IE JS JE WRF TILE 3 IS IE JS JE WRF TILE 4 IS IE JS JE WRF TILE 5 IS IE JS JE WRF TILE 6 IS IE JS JE WRF TILE 7 IS IE JS JE WRF TILE 8 IS IE JS JE WRF TILE 9 IS IE JS JE WRF TILE 10 IS IE JS JE WRF TILE 11 IS IE JS JE WRF TILE 12 IS IE JS JE WRF TILE 13 IS IE JS JE WRF TILE 14 IS IE JS JE WRF TILE 15 IS IE JS JE WRF TILE 16 IS IE JS JE WRF NUMBER OF TILES = 16 ….. Confidential

39 系统资源分析 CPU CPU: (mpstat –P ALL)
Linux el6.x86_64 (r720)      04/29/2012      _x86_64_        (16 CPU) 04:06:40 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle 04:06:40 PM  all   85.27    0.00    2.62    0.01    0.00    0.00    0.00    0.00   12.10 04:06:40 PM    0   85.71    0.00    2.58    0.01    0.00    0.00    0.00    0.00   11.69 04:06:40 PM    1   85.05    0.00    2.77    0.05    0.00    0.04    0.00    0.00   12.09 04:06:40 PM    2   85.26    0.00    2.69    0.00    0.00    0.00    0.00    0.00   12.05 04:06:40 PM    3   85.24    0.00    2.65    0.01    0.00    0.00    0.00    0.00   12.10 04:06:40 PM    4   87.36    0.00    1.90    0.00    0.00    0.00    0.00    0.00   10.73 04:06:40 PM    5   84.97    0.00    2.70    0.00    0.00    0.00    0.00    0.00   12.33 04:06:40 PM    6   85.23    0.00    2.64    0.00    0.00    0.00    0.00    0.00   12.13 04:06:40 PM    7   84.97    0.00    2.71    0.00    0.00    0.00    0.00    0.00   12.32 04:06:40 PM    8   85.33    0.00    2.60    0.00    0.00    0.00    0.00    0.00   12.06 04:06:40 PM    9   85.32    0.00    2.57    0.00    0.00    0.00    0.00    0.00   12.11 04:06:40 PM   10   84.88    0.00    2.77    0.00    0.00    0.00    0.00    0.00   12.35 04:06:40 PM   11   84.93    0.00    2.69    0.00    0.00    0.00    0.00    0.00   12.38 04:06:40 PM   12   85.16    0.00    2.62    0.00    0.00    0.00    0.00    0.00   12.21 04:06:40 PM   13   85.00    0.00    2.69    0.00    0.00    0.00    0.00    0.00   12.31 04:06:40 PM   14   84.91    0.00    2.75    0.00    0.00    0.00    0.00    0.00   12.34 04:06:40 PM   15   85.02    0.00    2.65    0.00    0.00    0.00    0.00    0.00   12.33 Confidential

40 系统资源分析 (Memory) Memory : (free) total used free shared buffers cached
Swap:               0   Confidential

41 系统资源分析 (IO, HDD) IO: (iostat)
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn sda               9.01                    dm-0              0.64        12.63         1.99           49016 dm-1              0.01         0.10         0.00       2576          0 dm-2                                HDD : (df) Filesystem           1K-blocks      Used Available Use% Mounted on /dev/mapper/vg_r720-lv_root                              11% / tmpfs                         88     1% /dev/shm /dev/sda1               495844     37433       8% /boot /dev/mapper/vg_r720-lv_home                          14% /home Confidential

42 Intel 测试 Confidential

43 Confidential

44 Intel links improving-performance-on-intel-architecture/ and-intelr-mpi/ Confidential

45 Intel Compilers Flags Confidential

46 Intel 调优 1。Reducing MPI overhead: -genv I_MPI_PIN_DOMAIN omp
1。Reducing MPI overhead: -genv I_MPI_PIN_DOMAIN omp -genv KMP_AFFINITY=compact -perhost 2。 Improving cache and memory bandwidth utilization: numtiles = X 3。Using Intel® Math Kernel Library (MKL) DFT for polar filters: Depending on workload, Intel® MKL DFT may provide up to 3x speedup of simulation speed 4。Speeding up computations by reducing precision: -fp-model fast=2 -no-prec-div -no-prec-sqrt Confidential

47 案例分享

48 华大基因研究院

49 清华大学生命科学院

50 Success References in Life Science
国内 Beijing Genome Institute (BGI) Tsinghua University Life Institute Beijing Normal University Jiang Su Tai Cang Life Institute The 4th Military Medical University 国外 David H. Murdock Research Institute Virginia Bioinformatics Institute University of Florida speeds up memory intensive gene UCSF National Center for Supercomputing Applications Confidential

51 谢谢! Confidential


Download ppt "生命科学、气象行业 高性能计算解决方案及成功案例分享"

Similar presentations


Ads by Google