Presentation is loading. Please wait.

Presentation is loading. Please wait.

LLNL-PRES-600932 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Similar presentations


Presentation on theme: "LLNL-PRES-600932 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."— Presentation transcript:

1 LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC PyHPC 2012 Workshop Cyrus Harrison, Lawrence Livermore National Laboratory Paul Navrátil, Texas Advanced Computing Center, Univ of Texas at Austin Maysam Moussalem, Department of Computer Science, Univ of Texas at Austin Ming Jiang, Lawrence Livermore National Laboratory Hank Childs, Lawrence Berkeley National Laboratory Friday Nov 16, 2012

2 2 LLNL-PRES  Motivation  System Architecture Framework Components Execution Strategies  Evaluation Methodology  Evaluation Results

3 3 LLNL-PRES

4 4 is a Python-fueled HPC research success story.  Our goal: Start to address uncertainty with future HPC hardware architectures and programing models.  This work: Explores moving a key visualization and analysis capability to many-core architectures.  Why Python? Productivity + powerful tools (PLY, NumPy, PyOpenCL) Motivation

5 5 LLNL-PRES  Creating new fields from existing fields in simulation data.  A critical component of scientific visualization and analysis tool suites.  Example Expressions: Motivation

6 6 LLNL-PRES  Are present in many post-processing tools: Paraview, VisIt, etc.  Include three key components: A set of primitives that can be used to create derived quantities. An interface which allows users to compose these primitives. A mechanism which transforms and executes the composed primitives.  Ongoing issues: Lack of flexibility to exploit many-core architectures Inefficiency in executing composed primitives Motivation

7 7 LLNL-PRES Unique Contributions: 1) First-ever implementation targeting many-core architectures. 2) A flexible Python infrastructure that enables the design and testing of a wide range of execution strategies. 3) An evaluation exploring the tradeoffs between runtime performance and memory constraints. Motivation

8 8 LLNL-PRES

9 9 Framework Components  Host Application Interface Our framework is designed to work in-situ for codes with a NumPy interface to mesh data fields. Ndarrays are used as the input/output data interface.  PLY-based front-end Parser Transforms user expressions into a dataflow speciation.  Dataflow Network Module Coordinates OpenCL execution using PyOpenCL. Designed to support multiple execution strategies. System Architecture

10 10 LLNL-PRES Host Application Python Dataflow Network Data User Expressions Expression Parser PLY OpenCL Target Device(s) Execution Strategies PyOpenCL System Architecture

11 11 LLNL-PRES Basic Features  Simple “create and connect” API for network definition. The API used by the parser front-end is usable by humans.  Execution is decoupled from network definition and traversal: A Topological sort is used to ensure precedence. Results are managed by a reference-counting registry.  Straight forward filter API is used to implement derived field primitives.  Network structure can be visualized using graphviz. System Architecture

12 12 LLNL-PRES OpenCL Environment  Built using PyOpenCL  Records and categorizes OpenCL timing events: OpenCL Host-to-device Transfers (Inputs) OpenCL Kernel Executions OpenCL Device-to-host Transfers (Results)  Manages OpenCL Device buffers: Tracks allocated device buffers, available global device memory, and global memory high-water mark. Enables reuse of allocated buffers. System Architecture

13 13 LLNL-PRES Execution Strategies  Control data movement and how the OpenCL kernels of each primitive are composed to compute the final result.  Implementations leverage the features of our dataflow network module: Precedence from the dataflow graph Reference counting for intermediate results  OpenCL kernels for the primitives are written once and used by all strategies. System Architecture

14 14 LLNL-PRES Roundtrip:  Dispatches a single kernel for each primitive.  Transfers each intermediate result from OpenCL target device back to the host environment. Staged:  Dispatches a single kernel for each primitive.  Stores intermediate results in the global memory of the OpenCL target device. Fusion:  Employs kernel fusion to construct and execute a single OpenCL kernel that composes all selected primitives. System Architecture

15 15 LLNL-PRES mag = sqrt(x*x+y*y+z*z) Corresponding Dataflow Network Corresponding Dataflow Network Example Expression x x mult y y z z add sqrt mag System Architecture

16 16 LLNL-PRES f1 = mult(x,x) x f1 f2 = mult(y,y) y f2 f3 = mult(z,z) z f3 f4 = add(f1,f2) f2 f4 f1 f5 = add(f4,f3) f3 f5 f4 f6 = sqrt(f5) f5 OpenCL Host OpenCL Target x x mult y y z z add sqrt mag mag = sqrt(x*x+y*y+z*z) mult add sqrt x x y y z z mag System Architecture

17 17 LLNL-PRES f1 = mult(x,x) x f6 = sqrt(f5) OpenCL Host OpenCL Target x x mult y y z z add sqrt mag mag = sqrt(x*x+y*y+z*z) f2 = mult(y,y) y f3 = mult(z,z) z f4 = add(f1,f2) f5 = add(f4,f3) mult add sqrt x x y y z z mag System Architecture

18 18 LLNL-PRES f1 = mult(x,x) f2 = mult(y,y) f3 = mult(z,z) f4 = add(f1,f2) f5 = add(f4,f3) f6 = sqrt(f5) OpenCL Host OpenCL Target x x mult y y z z add sqrt mag mag = sqrt(x*x+y*y+z*z) z y x mult add sqrt x x y y z z mag System Architecture

19 19 LLNL-PRES Roundtrip Staged Fusion x3 x4 x5 System Architecture

20 20 LLNL-PRES

21 21 LLNL-PRES  Evaluation Expressions: Detection of vortical structures in a turbulent mixing simulation.  Host Application: VisIt  Three Studies: Single Device Performance Single Device Memory Usage Distributed-Memory Parallel  Test Environment: LLNL’s Edge HPC Cluster Provides OpenCL access to both NVIDIA Tesla M2050s and Intel Xeon processors. Evaluation Methodology

22 22 LLNL-PRES  We selected three expressions used for vortex detection and analysis. Evaluation Methodology du = grad3d(u,dims,x,y,z) dv = grad3d(v,dims,x,y,z) dw = grad3d(w,dims,x,y,z) w_x = dw[1] - dv[2] w_y = du[2] - dw[0] w_z = dv[0] - du[1] w_mag = sqrt(w_x*w_x + w_y*w_y + w_z*w_z) v_mag = sqrt(u*u + v*v + w*w) Vector Magnitude: Vorticity Magnitude: Q-criterion:

23 23 LLNL-PRES Evaluation Methodology du = grad3d(u,dims,x,y,z) dv = grad3d(v,dims,x,y,z) dw = grad3d(w,dims,x,y,z) s_1 = 0.5 * (du[1] + dv[0]) s_2 = 0.5 * (du[2] + dw[0]) s_3 = 0.5 * (dv[0] + du[1]) s_5 = 0.5 * (dv[2] + dw[1]) s_6 = 0.5 * (dw[0] + du[2]) s_7 = 0.5 * (dw[1] + dv[2]) w_1 = 0.5 * (du[1] - dv[0]) w_2 = 0.5 * (du[2] - dw[0]) w_3 = 0.5 * (dv[0] - du[1]) w_5 = 0.5 * (dv[2] - dw[1]) w_6 = 0.5 * (dw[0] - du[2]) w_7 = 0.5 * (dw[1] - dv[2]) s_norm = du[0]*du[0] + s_1*s_1 + s_2*s_2 + s_3*s_3 + dv[1]*dv[1] + s_5*s_5 + s_6*s_6 + s_7*s_7 + dw[2]*dw[2] w_norm = w_1*w_1 + w_2*w_2 + w_3*w_3 + w_5*w_5 + w_6*w_6 + w_7*w_7 q_crit = 0.5 * (w_norm - s_norm)

24 24 LLNL-PRES  A timestep of a Rayleigh–Taylor instability simulation.  DNS simulation with intricate embedded vortical features. Data courtesy of Bill Cabot and Andy Cook, LLNL 27 billion cells 3072 sub-grids (each 192x129x256 cells) 3072 sub-grids (each 192x129x256 cells) Evaluation Methodology

25 25 LLNL-PRES  12 sub-grids varying from 9.3 to million cells.  Fields: Mesh coords (x,y,z) Velocity vector field (u,v,w) Data courtesy of Bill Cabot and Andy Cook, LLNL Velocity Magnitude Sub-grids for single device evaluation Evaluation Methodology

26 26 LLNL-PRES MPI Data Compute Engine Data Pytho n Clients GUI CLI Viewer (State Manager) Viewer (State Manager) network connection Python Client Interface (State Control) Python Filter Runtime (Direct Mesh Manipulation) Parallel ClusterLocal Components Evaluation Methodology VisIt’s Python Interfaces

27 27 LLNL-PRES Single Device Evaluation  Recorded runtime performance and memory usage  Two OpenCL Target devices: — GPU: Tesla M2050 (3 GB RAM ) — CPU: Intel Xeons (96 GB RAM, shared with host environment )  144 test cases per device: Three test expressions Our three strategies and a reference kernel Data: 12 RT3D sub-grids — Sizes range from 9.6 million to 113 million cells. Evaluation Methodology

28 28 LLNL-PRES Distributed Memory Parallel Test  A “Smoke” test  Q-criterion using the Fusion strategy  128 nodes using two M2050 Teslas per node  Data: Full mesh from a single RT3D timestep 3072 sub-grids each with 192x192x256 cells 27 billon total cells + ghost data Each of the 256 Teslas stream 12 sub-grids. Evaluation Methodology

29 29 LLNL-PRES

30 30 LLNL-PRES Velocity Magnitude Evaluation Results

31 31 LLNL-PRES Vorticity Magnitude Evaluation Results

32 32 LLNL-PRES Q-criterion Evaluation Results

33 33 LLNL-PRES Velocity Magnitude Evaluation Results

34 34 LLNL-PRES Vorticity Magnitude Evaluation Results

35 35 LLNL-PRES Q-criterion Evaluation Results

36 36 LLNL-PRES ExpressionStrategy Device Writes Device Reads Kernel Executions Velocity Magnitude Roundtrip 1166 Staged 316 Fusion 311 Vorticity Magnitude Roundtrip 3212 Staged 7118 Fusion 711 Q-criterion RoundTrip Staged 7167 Fusion 711 Evaluation Results

37 37 LLNL-PRES Q-criterion of 27 billion cell mesh Q-criterion of 27 billion cell mesh Evaluation Results

38 38 LLNL-PRES Strategy Comparison:  Roundtrip: Slowest and least constrained by target device memory.  Staged: Faster than Roundtrip and most constrained by target device memory.  Fusion: Fastest and least amount of data movement. Device Comparison:  GPU: Best runtime performance for test cases that fit into the 3 GB of global device memory.  CPU: Successfully completed all test cases. Evaluation Results

39 39 LLNL-PRES  Our framework provides flexible path forward for exploring strategies for efficient derived field generation on future many-core architectures.  The Python ecosystem made this research possible.  Future work: Distributed-memory parallel performance Strategies for streaming and using multiple devices on-node  Thanks PyHPC 2012! Contact Info: Cyrus Harrison


Download ppt "LLNL-PRES-600932 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344."

Similar presentations


Ads by Google