Presentation is loading. Please wait.

Presentation is loading. Please wait.

Global Marketing 1 Confidential 生命科学、气象行业 高性能计算解决方案及成功案例分享 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司.

Similar presentations


Presentation on theme: "Global Marketing 1 Confidential 生命科学、气象行业 高性能计算解决方案及成功案例分享 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司."— Presentation transcript:

1 Global Marketing 1 Confidential 生命科学、气象行业 高性能计算解决方案及成功案例分享 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司

2 Global Marketing 2 Confidential 生命科学高性能计算解决方案 –GPU 加速解决方案 – 高性能存储解决方案 WRF V3.3 ( 气象行业应用 ) 在 Dell R720 服务器 程序测试及优化 – gcc 编译器器 –Intel 编译器 成功案例分享 内容

3 Global Marketing 生命科学 HPC GPU 方案

4 Global Marketing 4 Confidential 在生命科学领域中 很多用户采用 GPU 加速解决方案

5 Global Marketing CPU + GPU 计算 5 Confidential

6 Global Marketing HPCC GPU 异构平台 6 Confidential

7 Global Marketing 支持 GPU 的 Dell 服务器方案 (2012 年,12 代服务器 ) 7 Confidential C6220 C6145 T620 R720 C410x C6220C6145 GPU:Socket Ratio1:1 2:1 1:1 2:1 1:1 Total System Boards Total HIC IB CapableYes Yes* Yes Total GPU Per GPU B/W MSRP (M2075)$117,000 $86,900 $114,000 $85,250 $19,000 $13,000 Power Envelope (est)5.525 kW kW kW kW Theoretical GFLOPs TBD 9,326 8,932 2,431 1,401 Est. GFLOPs TBD 2,891 1,697 TBD GFLOPS/Rack U TBD $/GFLOPS TBD Rack Size GPU/Rack U External Solutions (PowerEdge C) Internal Solutions

8 Global Marketing GPU 扩展箱方案 (GPU 外置方案 ) Dell PowerEdge C410x 8 PCIe EXPANSION CHASSIS CONNECTING 1-8 HOSTS TO 1-16 PCIe 3U chassis, 19” wide, 143 pounds PCI express modules: 10 front, 6 rear PCI form factors: HH/HL and FH/HL Up to 225W per module PCIe inputs: 8PCIe x16 IPASS ports PCI fan out options: x16 to 1 slot, x16 to 2 slot, x16 to 3 slot, x16 to 4 slot GPUs supported: NVIDIA M1060, M2050, M2070 (TBD) Thermals: high-efficiency 92mm fans; N + 1 fan redundancy Management: On-board BMC; IPMI 2.0; dedicated management port Power supplies: 4 x 1400W hot-plug, high efficiency PSUs; N+1 power redundancy Services vary by region: IT Consulting, Server and Storage Deployment, Rack Integration (US only), Support Services Confidential Great for: HPC including universities, oil & gas, biomed research, design, simulation, mapping, visualization, rendering, and gaming

9 Global Marketing PowerEdge C410x PCIe 模块 Serviceable PCIe module (taco) capable of supporting any half-height, half- length (HH/HL) or full-height/half-length (FH/HL) cards FH/FL cards supported with extended PCIe module Future-proofing on next generations of NVIDIA and AMD ATI GPU cards 9 Power connector for GPGPU card Board-to-board connector for X16 Gen PCIe signals and power GPU card LED Confidential

10 Global Marketing 4 GPU / x1616GPU/5U3 GPU / x1612GPU/5U 2 GPU / x1616GPU/7U1 GPU / x168GPU/7U PowerEdge C410x Configurations Enabling HPC applications to optimize cost / performance equation off single x16 PCI Switch GPU x16 GPU Host PCI Switch GPU x16 Host PCI Switch GPU x16 Host PCI Switch GPU x16 GPU/U ratios assume PowerEdge C6100 host with 4 servers per 2U chassis Confidential 10 HIC x16 iPass cable C410x HIC C410x iPass cable x16 HIC C410x x16 iPass cable Host HIC C410x x16 iPass cable 7U = (1) C410x + (2) C6100 5U = (1) C410x + (1) C6100 C6100

11 Global Marketing Flexibility of the PowerEdge C410x Increases to 8:1 possible with dual x16 11 PCI Switch GPU x16 GPU Host PCI Switch GPU x16 Confidential PCI Switch GPU x16 Host PCI Switch GPU x16 C410x x16 C410x iPass cable HIC iPass cable HIC iPass cable

12 Global Marketing PowerEdge C6100 Configurations “2:1 Sandwich” 12 C410x C6100 C6100 “2:1 Sandwich” One Dell C410x (16 GPUs) Two C6100 (8 nodes) One x16 slot for each node to 2 GPUs 7U total 16 GPUs total 8 nodes total (2 GPUs per board) Two C system boards 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host Single port x16 HIC (iPASS) Single C410x 16 GPUs (fully populated) PCIe x8 per GPU Total space = 7U Note: This configuration is equivalent to using the C6100 and the NVIDIA S2050 but this configuration is more dense Confidential Details Summary

13 Global Marketing PowerEdge C6100 Configurations “4:1 Sandwich” 13 C410x C6100 C6100 “4:1 Sandwich” One Dell C410x (16 GPUs) One C6100 (4 nodes) One x16 slot for each node to 4 GPUs 5U total 16 GPUs total 4 nodes total (4 GPUs per board) One C system boards 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host Single port x16 HIC (iPASS) Single C410x 16 GPUs (fully populated) PCIe x4 per GPU Total space = 5U Confidential Details Summary

14 Global Marketing PowerEdge C6100 Configurations “8:1 Sandwich” (Possible Future Development) 14 C410x C6100 C6100 “8:1 Sandwich” Two Dell C410x (32 GPUs) One C6100 (4 nodes) One x16 slot for each node to 8 GPUs 8U total 32 GPUs total 4 nodes total (8 GPUs per board) One C system boards 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host Single port x16 HIC (iPASS) Two C410x 32 GPUs (fully populated) PCIe x2 per GPU Total space = 8U See later table for metrics C410x Confidential Details Summary

15 Global Marketing PowerEdge C6145 Configurations “8:1 Sandwich” Dell Confidential C6145 C6145 “16:1 Sandwich” One Dell C410x (16 GPUs) One C6145 (2 nodes) Two-Four HIC slots for each node to 16 GPUs 5U total 16 GPUs total 2 nodes total (16 GPUs per board) Details 5U of Rack Space C410x One C system boards 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS) One C410x 16 GPUs (fully populated) PCIe x4-x8 per GPU Total space = 5U Details

16 Global Marketing PowerEdge C6145 Configurations “16:1 Sandwich” Dell Confidential C410x C6145 C6145 “16:1 Sandwich” Two Dell C410x (32 GPUs) One C6145 (2 nodes) Four HIC slots for each node to 16 GPUs 8U total 32 GPUs total 2 nodes total (16 GPUs per board) Details 8U of Rack Space C410x One C system boards 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS) Two C410x 32 GPUs (fully populated) PCIe x4 per GPU Total space = 8U Details

17 Global Marketing PowerEdge C410x Block Diagram GPUs x 16 Switch Level 2 x 4 Switch Level 1 x 8 Host Connections x 8

18 Global Marketing C410X BMC 控制台配置界面

19 Global Marketing GPU 扩展箱支持服务器列表 Dell external GPU solution support –Hardware Interface Card (HIC) in PCIe slot connects to external GPU(s) in C410x –Dell ‘slot validates’ NVIDIA interface cards to verify power, thermals, etc. HIC/C410x Support Matrix ServerC410x Support Planned Support Date C6100YesNow C6105RTS+ Now – BIOS or later C6145RTSNow C1100YesNow Precision R5500Yes Now – Disable SSC in BIOS R710YesNow M610xYesNow R410YesNow R720RTS R720xdRTS R620RTS C6220RTS

20 生命科学应用测试 : GPU-HMMER Dell High Performance Computing X 2.8X 2.7X 1.8X

21 GPU:Host Scaling : GPU-HMMER Dell High Performance Computing 21 Speedup 1.8X 3.6X 7.2X 3.6X Speedup 1.8X 3.6X 7.2X 3.6X

22 GPU:Host Scaling: NAMD Dell High Performance Computing 22 Speedup 4.7X 8.2X 15.2X 9.5X Speedup 4.7X 8.2X 15.2X 9.5X

23 GPU:Host Scaling : LAMMPS JL-Cut Dell High Performance Computing 23 Speedup 8.5X 13.5X 14.4X 14.0X Speedup 8.5X 13.5X 14.4X 14.0X

24 Global Marketing 生命科学 存储方案

25 生命科学 计算、数据容量增长率

26 The Lustre Parallel File System Key Lustre Components: 1. Clients (compute nodes) “Users” of the file system where applications run The Dell HPC Cluster 2. Meta Data Server (MDS) Holds meta-data information 3. Object Storage Server (OSS) Provides back-end storage for the users’ files Additional OSS units increase throughput linearly Meta Data Server (MDS) Clients OSS …

27 27 Confidential

28

29 InfiniBand (IPoIB) NFS Performance: Sequential Read Peaks: –NSS Small: 1 node doing IO (fairly level until 4 nodes) –NSS Medium: 4 nodes doing IO (not much drop-off) –NSS Large: 8 nodes doing IO (good performance over range)

30 Infiniband (IPoIB) NFS Performance: Sequential Write Peaks: –NSS Small: 1 node doing IO (steady drop off to 16 nodes) –NSS Medium: 2 nodes doing IO (good performance for up to 8 nodes) –NSS Large: 4 nodes doing IO (good performance over range)

31 31 Confidential

32 Global Marketing WRF V3.3 应用 程序测试调优

33 Dell 测试环境 Dell R720 –cpu : 2x Intel Sandy Bridge E , –Memory: 8x 8MB (64GB Memory) –Harddisk: 2x 300 GB 15Krpm (Raid 0) BIOS Setting –disable HT –memory optimized –High Performance enable ( Power Max) OS –Redhat Enterprise Linux 6.3 Confidential 33

34 Gcc 测试 gcc, gfortran, gc++ Zlib HDF Netcdf 4 WRF V3.3 Confidential 34

35

36 测试结果 输出文件 wrf : 2011 年 11 月 30 日 至 2011 年 12 月 5 日 (13H9M53S) –wrf.exe starts at: Sun Apr 29 09:35:36 CST 2012 … –wrf: SUCCESS COMPLETE WRF –wrf.exe completed at: Sun Apr 29 22:45:29 CST 2012 Confidential 36

37 配置文件 # Settings for x86_64 Linux, gfortran compiler with gcc (smpar) DMPARALLEL = 1 OMPCPP = -D_OPENMP OMP = -fopenmp OMPCC = -fopenmp SFC = gfortran SCC = gcc CCOMP = gcc DM_FC = mpif90 -f90=$(SFC) DM_CC = mpicc -cc=$(SCC) FC = $(SFC) CC = $(SCC) -DFSEEKO64_OK LD = $(FC) RWORDSIZE = $(NATIVE_RWORDSIZE) PROMOTION = # -fdefault-real-8 # uncomment manually ARCH_LOCAL = -DNONSTANDARD_SYSTEM_SUBR CFLAGS_LOCAL = -w -O3 -c -DLANDREAD_STUB LDFLAGS_LOCAL = CPLUSPLUSLIB = ESMF_LDFLAG = $(CPLUSPLUSLIB) FCOPTIM = -O3 -ftree-vectorize -ftree-loop-linear -funroll-loops FCREDUCEDOPT= $(FCOPTIM) FCNOOPT= -O0 FCDEBUG = # -g $(FCNOOPT) FORMAT_FIXED = -ffixed-form FORMAT_FREE = -ffree-form -ffree-line-length-none FCSUFFIX = BYTESWAPIO = -fconvert=big-endian -frecord-marker=4 FCBASEOPTS_NO_G = -w $(FORMAT_FREE) $(BYTESWAPIO) FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) MODULE_SRCH_FLAG = TRADFLAG = -traditional CPP = /lib/cpp -C -P AR = ar ARFLAGS = ru M4 = m4 -G RANLIB = ranlib CC_TOOLS = $(SCC) Confidential 37

38 Wrf.out 38 Confidential …. WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 16 WRF TILE 1 IS 1 IE 250 JS 1 JE 10 WRF TILE 2 IS 1 IE 250 JS 11 JE 20 WRF TILE 3 IS 1 IE 250 JS 21 JE 30 WRF TILE 4 IS 1 IE 250 JS 31 JE 39 WRF TILE 5 IS 1 IE 250 JS 40 JE 48 WRF TILE 6 IS 1 IE 250 JS 49 JE 57 WRF TILE 7 IS 1 IE 250 JS 58 JE 66 WRF TILE 8 IS 1 IE 250 JS 67 JE 75 WRF TILE 9 IS 1 IE 250 JS 76 JE 84 WRF TILE 10 IS 1 IE 250 JS 85 JE 93 WRF TILE 11 IS 1 IE 250 JS 94 JE 102 WRF TILE 12 IS 1 IE 250 JS 103 JE 111 WRF TILE 13 IS 1 IE 250 JS 112 JE 120 WRF TILE 14 IS 1 IE 250 JS 121 JE 130 WRF TILE 15 IS 1 IE 250 JS 131 JE 140 WRF TILE 16 IS 1 IE 250 JS 141 JE 150 WRF NUMBER OF TILES = 16 …..

39 系统资源分析 CPU CPU: (mpstat –P ALL) Linux el6.x86_64 (r720) 04/29/2012 _x86_64_ (16 CPU) 04:06:40 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 04:06:40 PM all :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM :06:40 PM Confidential 39

40 系统资源分析 (Memory) Memory : (free) Confidential 40 total used free shared buffers cached Mem: /+ buffers/cache: Swap:

41 系统资源分析 (IO, HDD) Confidential 41 IO: (iostat) Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda dm dm dm HDD : (df) Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_r720-lv_root % / tmpfs % /dev/shm /dev/sda % /boot /dev/mapper/vg_r720-lv_home % /home

42 Intel 测试 42 Confidential

43 43

44 Intel links intel-compilers-on-linux-and-improving-performance-on-intel- architecture/http://software.intel.com/en-us/articles/building-the-wrf-with- intel-compilers-on-linux-and-improving-performance-on-intel- architecture/ installation-bkm-with-inter-compilers-and-intelr-mpi/http://software.intel.com/en-us/articles/wrf-and-wps-v311- installation-bkm-with-inter-compilers-and-intelr-mpi/ dfhttp://www.hpcadvisorycouncil.com/pdf/WRF_Best_Practices.p df Confidential 44

45 Intel Compilers Flags 45 Confidential

46 Intel 调优 46 Confidential 1 。 Reducing MPI overhead: -genv I_MPI_PIN_DOMAIN omp -genv KMP_AFFINITY=compact -perhost 2 。 Improving cache and memory bandwidth utilization: numtiles = X 3 。 Using Intel® Math Kernel Library (MKL) DFT for polar filters: Depending on workload, Intel® MKL DFT may provide up to 3x speedup of simulation speed 4 。 Speeding up computations by reducing precision: -fp-model fast=2 -no-prec-div -no-prec-sqrt

47 Global Marketing 案例分享

48 华大基因研究院

49 清华大学生命科学院

50 Success References in Life Science 国内 –Beijing Genome Institute (BGI) –Tsinghua University Life Institute –Beijing Normal University –Jiang Su Tai Cang Life Institute –The 4 th Military Medical University –… 国外 –David H. Murdock Research Institute –Virginia Bioinformatics Institute –University of Florida speeds up memory intensive gene –UCSF –National Center for Supercomputing Applications –… Confidential 50

51 51 Confidential 谢谢!


Download ppt "Global Marketing 1 Confidential 生命科学、气象行业 高性能计算解决方案及成功案例分享 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司."

Similar presentations


Ads by Google