Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill, NC 27514 george@hiperism.com http://www.hiperism.com HiPERiSM Consulting, LLC.

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com George Delic HiPERiSM Consulting, LLC UPDATE ON A NEW PARALLEL SPARSE CHEMISTRY SOLVER FOR CMAQ 12 th Annual CMAS Conference, Chapel Hill, NC 30 October, 2013

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Overview: CMAQ from HiPERiSM and the U.S. EPA  Overview: CMAQ from HiPERiSM and the U.S. EPA  Hardware platforms  Software and compilers  Episode studied  Thread parallel performance metrics  2 compilers, 2 platforms (24hr run)  Chemistry solver parallel efficiency (1 hr run)  Accuracy metrics for sparse solution of Ax=y  CMAQ numerical performance  Numerical error in U.S. EPA code  Concentrations for O3, NO2 at hour 23  Lessons learned  Conclusions  Next Steps for CMAQ development

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Hardware platforms  Intel: 2x4-core CPU’s = 8 cores  W5590 Nehalem™  3.3 GHz  AMD: 4x12-core CPU’s = 48 cores  6176SE Opteron™  2.3 GHz

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Software and compilers  OS  Linux 64-bit  CMAQ versions (Rosenbrock solver * )  U.S. EPA’s uses JSPARSE (serial)  HiPERiSM uses FSPARSE (parallel)  Compilers(legend)  Intel 12.1 (ifort/Intel)  Portland 13.4(pgf90) (*) Requires a sparse linear solver in a linear system Ax=y for chemistry solution: FSPARSE replaces JSPARSE

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Episode studied  Grid used  279 X 240 Eastern US domain at 12 Km grid spacing and 34 vertical layers  CMAQ 4.7.1 24-hour episode  August 09, 2006, using the CB05 mechanism with Chlorine extensions and the Aero 4 version for PM modeling.  total output file size: ~ 37.7 GB (137 variables)

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Thread parallel performance metrics  SPEEDUP: U.S. EPA time / Thread parallel time  PARALLEL SCALING: S P = T 1 / T P  PARALLEL EFFICIENCY: E P = S P / P T 1 is runtime for a single thread T P is runtime for P threads

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com 2 compilers, 2 platforms (24hr run) ← CMAQ wall clock time (hours) for EPA and parallel versions with 1 to 8 threads on Intel and AMD platforms Parallel CMAQ speedup versus EPA for 1 to 8 threads on Intel and AMD platforms →

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Chemistry solver parallel efficiency (pgf90 on Intel node, 1hr run) Parallel efficiency > 87% with 2-6 threads. Parallel efficiency by thread count (2-8)

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com CMAQ 4.6.1 MPI Efficiency & (estimated OpenMP speedup) MPI (OpenMP) hoursSpeed-up (OpenMP) MPI efficiency 215.11.996% 48.23.5 (x 1.3)88% 85.15.7 (x 1.4)71% 163.38.7 (x 1.5)54% Portland compiler on x86_64 cluster

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Accuracy Metrics for sparse solution of Ax = y ValueNorm 1) Statistic 2) Residualnorm(Ax-y,inf)mean or sample Solutionnorm(x,inf)mean or sample 1) Used the “inf” norm, or maximum value, over the vector Ax-y of length equal to the number of chemistry species. 2) Mean over cells in each block, or sampled at one cell in each of 47,430 blocks over the grid domain.

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com CMAQ numerical performance At the end of the first simulation hour this shows the norm of the residual Ax-y at the last call to the CMAQ chemistry solver sampled in cell 48 for each of 47,430 blocks norm(Ax-y,inf) in JSPARSE ( ■ ) and FSPARSE ( ■ ) methods

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Numerical error in U.S. EPA code Uses mixed mode arithmetic (DP & SP) Inconsistent promotion of SP to DP for constants and variables Worst case in CALCKS for thermal and photolytic reaction rates computed in SP Inherited SP values amplify precision loss in three Rosenbrock solve stages Use of ATOL = 1E-07 is moot

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Concentrations for O3 at hour 23 Histogram of all 66,960 concentration values of Layer 1 in decade bins: difference in predictions ( ■ ) and concentration value ( ■ )

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Concentrations for NO2 at hour 23 Histogram of all 66,960 concentration values of Layer 1 in decade bins: difference in predictions ( ■ ) and concentration value ( ■ )

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Lessons learned 1.Limitations due to EPA’s inconsistent use of mixed mode arithmetic 2.FSPARSE method is more precise by many orders of magnitude 3.FSPARSE method allows relaxation of chemistry time step convergence parameter ATOL 1.JSPARSE & FSPARSE showed good agreement for values of O3, NO2, NO, H2O2 2.Degraded agreement for species such as ASO4I 3.Remaining differences result from cumulative errors in EPA code. Numerical precisionSpecies Concentrations

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Conclusions  CMAQ computational performance shows speedup in the range 1.4-1.5 with two compilers on two platforms in a thread parallel model for the Rosenbrock solver when compared to the U.S. EPA release  The FSPARSE algorithm yields more precision in a sparse matrix chemistry solver when compared to the U.S. EPA release  The FSPARSE algorithm offers performance gains that are portable across platforms and compilers

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Next steps for CMAQ development  Short term goals  OpenMP parallel model extensions to other code portions of CMAQ  Explore port of FSPARSE to GPGPU technology  Long term goals  Plan for code architecture (re)design throughout the whole of CMAQ to change the memory footprint & increase computational efficiency  Develop thread safe version of CMAQ with the Gear solver

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Extra Slides

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Chemistry solver time step count (pgf90 on Intel node, 1hr run) ← CMAQ time step count for EPA and parallel (single thread) versions with ATOL=1E-07 CMAQ time step count for parallel (single thread) version with ATOL=1E-05 →

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Chemistry solver scaling & speedup (pgf90 on Intel node, 1hr run) ← Parallel CMAQ scaling by thread count versus single thread with ATOL=1E-05. Parallel CMAQ speedup by thread count (1-8) versus EPA →

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Concentrations for ASO4I at hour 23 Histogram of all 66,960 concentration values of Layer 1 in decade bins: difference in predictions ( ■ ) and concentration value ( ■ )

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Parallel paradigm nomenclature Parallel paradigms MPI = Message Passing Interface (coarse grain chunks of work) OpenMP = a thread based model (fine grain chunks of work) Vector/SSE = instruction level (really fine grain tasks) GPGPU = General Purpose Graphical Processing Unit (multi-grain tasks) Band- width increases & latency decreases ↓

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Software Evolution  Compiler technology has grown  CMAQ software development for computational efficiency is lagging  CMAQ users need more throughput as problem size grows  Penalty for not adapting to growth:  Lost performance (more than10x)  Decrease in efficiency & throughput

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Riding the revolution  HPC Mantra  “Map the model to the architecture”  Shared Memory Parallel model  OpenMP port with up to 24 threads  GPGPU port with upto 100’s of threads  Decision points  Assessing the level-of-effort to adapt  Blending with existing MPI models

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com CMAQ has not kept up-to-date with HPC growth  Why?  Architecture has evolved rapidly to support multiple levels of parallelism  CMAQ traditionally uses only one level of parallelism  Model development has effectively moved CMAQ work load balance in the direction of more scalar work

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Parallel CMAQ approach (old parallel school: 1980’s)  Data parallelism  Partition data domain (i.e. grid)  Distribute partitions to cluster nodes  Apply MPI  To distribute coarse work chunks  Co-ordinate synchronization & data collection

Copyright HiPERiSM Consulting, LLC, 2013 http://www.hiperism.com Proposed parallel CMAQ approach (new parallel school: 2000’s)  Task parallelism (OpenMP)  Distribute tasks to parallel thread teams  Utilize separate cores (one per thread)  Instruction level parallelism (Vector)  Construct code that vectorizes  Utilize vector instructions on commodity processors  Target same code to GPGPU  All instruction-level parallel loops also parallelize for a GPGPU target

Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Similar presentations

Presentation on theme: "Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,

Similar presentations

Presentation on theme: "Copyright HiPERiSM Consulting, LLC, 2013 George Delic, Ph.D. HiPERiSM Consulting, LLC (919)484-9803 P.O. Box 569, Chapel Hill,"— Presentation transcript:

Similar presentations

About project

Feedback