QDP++ and Chroma Robert Edwards Jefferson Lab Collaborators: Balint Joo.

Slides:



Advertisements
Similar presentations
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
Advertisements

Intermediate Code Generation
SciDAC Software Infrastructure for Lattice Gauge Theory
Nuclear Physics in the SciDAC Era Robert Edwards Jefferson Lab SciDAC 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Lattice QCD Comes of Age y Richard C. Brower XLIst Rencontres de Moriond March QCD and Hadronic interactions at high energy.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
QDP++ and Chroma Robert Edwards Jefferson Lab
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Algorithms for Lattice Field Theory at Extreme Scales Rich Brower 1*, Ron Babich 1, James Brannick 2, Mike Clark 3, Saul Cohen 1, Balint Joo 4, Tony Kennedy.
SciDAC Software Infrastructure for Lattice Gauge Theory DOE meeting on Strategic Plan --- April 15, 2002 Software Co-ordinating Committee Rich Brower ---
HackLatt MILC with SciDAC C Carleton DeTar HackLatt 2008.
MILC Code Basics Carleton DeTar KITPC MILC Code Capabilities Molecular dynamics evolution –Staggered fermion actions (Asqtad, Fat7, HISQ,
HackLatt MILC Code Basics Carleton DeTar HackLatt 2008.
SciDAC Software Infrastructure for Lattice Gauge Theory DOE Grant ’01 -- ’03 (-- ’05?) All Hands Meeting: FNAL Feb. 21, 2003 Richard C.Brower Quick Overview.
1 I/O Management in Representative Operating Systems.
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
An Introduction to Rational Rose Real-Time
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Dr Chris Maynard Application Consultant, EPCC UKQCD software for lattice QCD P.A. Boyle, R.D. Kenway and C.M.
Enhancing GPU for Scientific Computing Some thoughts.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower Annual Progress Review JLab, May 14, 2007 Code distribution see
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
Dynamical Chirally Improved Quarks: First Results for Hadron MassesC.B. Lang : Dynamical Chirally Improved Quarks: First Results for Hadron Masses C. B.
QCD Project Overview Ying Zhang September 26, 2005.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chroma I: A High Level View Bálint Joó Jefferson Lab, Newport News, VA given at HackLatt'06 NeSC, Edinburgh March 29, 2006.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower & Robert Edwards June 24, 2003.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Hadron Spectroscopy from Lattice QCD
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
HackLatt MILC Code Basics Carleton DeTar First presented at Edinburgh EPCC HackLatt 2008 Updated 2013.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Chroma: An Application of the SciDAC QCD API(s) Bálint Joó School of Physics University of Edinburgh UKQCD Collaboration Soon to be moving to the JLAB.
Aspects of the Class Structure in Chroma Bálint Joó Jefferson Lab, Newport News, VA given at HackLatt'06 NeSC, Edinburgh March 29, 2006.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Realized by: Massimo Di Pierro Presented at: ACAT 2000, Fermilab “Matrix Distributed Processing” for Lattice/Grid Parallel Computations.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
What is QCD? Quantum ChromoDynamics is the theory of the strong force
A QCD Grid: 5 Easy Pieces? Richard Kenway University of Edinburgh.
An Introduction to Lattice QCD and Monte Carlo Simulations Sinya Aoki Institute of Physics, University of Tsukuba 2005 Taipei Summer Institute on Particles.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Holding slide prior to starting show. Processing Scientific Applications in the JINI-Based OGSA-Compliant Grid Yan Huang.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Topics to be covered Instruction Execution Characteristics
Project Management – Part I
ILDG Implementation Status
Jun Doi Tokyo Research Laboratory IBM Research
Optimizing Transformations Hal Perkins Winter 2008
Chroma: An Application of the SciDAC QCD API(s)
Chapter 12 Pipelining and RISC
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Programming Parallel Computers
Presentation transcript:

QDP++ and Chroma Robert Edwards Jefferson Lab Collaborators: Balint Joo

Lattice QCD – extremely uniform Periodic or very simple boundary conditions SPMD: Identical sublattices per processor Lattice Operator: Dirac operator:

Software Infrastructure Goals: Create a unified software environment that will enable the US lattice community to achieve very high efficiency on diverse multi-terascale hardware. TASKS: LIBRARIES: I. QCD Data Parallel API  QDP II. Optimize Message Passing  QMP III. Optimize QCD Linear Algebra  QLA IV. I/O, Data Files and Data Grid  QIO V. Opt. Physics Codes  CPS/MILC/Croma/etc. VI. Execution Environment  unify BNL/FNAL/JLab TASKS: LIBRARIES: I. QCD Data Parallel API  QDP II. Optimize Message Passing  QMP III. Optimize QCD Linear Algebra  QLA IV. I/O, Data Files and Data Grid  QIO V. Opt. Physics Codes  CPS/MILC/Croma/etc. VI. Execution Environment  unify BNL/FNAL/JLab

Data layout over processors Overlapping communications and computations C(x)=A(x) * shift(B, + mu): – Send face forward non-blocking to neighboring node. – Receive face into pre-allocated buffer. – Meanwhile do A*B on interior sites. – “Wait” on receive to perform A*B on the face. Lazy Evaluation (C style): Shift(tmp, B, + mu); Mult(C, A, tmp);

Optimised Dirac Operators, Inverters Level 3 QDP (QCD Data Parallel) Lattice Wide Operations, Data shifts Level 2 QMP (QCD Message Passing) QLA (QCD Linear Algebra) Level 1 QIO XML I/O LIME SciDAC Software Structure Exists, implemented in MPI, GM, gigE and QCDOC Wilson Op, DWF Inv for P4; Wilson and Stag. Op for QCDOC Exists in C, C++, scalar and MPP using QMP

QMP Simple Example char buf[size]; QMP_msgmem_t mm; QMP_msghandle_t mh; mm = QMP_declare_msgmem(buf,size); mh = QMP_declare_send_relative(mm,+x); QMP_start(mh); // Do computations QMP_wait(mh); Receiving node coordinates with the same steps except mh = QMP_declare_receive_from(mm,-x); Multiple calls

Data Parallel QDP/C,C++ API Hides architecture and layout Operates on lattice fields across sites Linear algebra tailored for QCD Shifts and permutation maps across sites Reductions Subsets Entry/exit – attach to existing codes

QDP++ Type Structure Lattice Fields have various kinds of indices Color: U ab (x) Spin:    Mixed:   a (x), Q ab  (x) Tensor Product of Indices forms Type:  Gauge Fields: Fermions: Scalars: Propagators: Gamma: Lattice Scalar Lattice Scalar Matrix(Nc) Vector(Nc) Scalar Matrix(Nc) Scalar Vector(Ns) Scalar Matrix(Ns) Complex Scalar Complex   Lattice Color Spin Complexity QDP++ forms these types via nested C++ templating Formation of new types (eg: half fermion) possible

Data-parallel Operations Unary and binary: -a; a-b; … Unary functions: adj(a), cos(a), sin(a), … Random numbers: // platform independent random(a), gaussian(a) Comparisons (booleans) a <= b, … Broadcasts: a = 0, … Reductions: sum(a), … Fields have various types (indices): Tensor Product

QDP Expressions Can create expressions QDP/C++ code multi1d u(Nd); LatticeDiracFermion b, c, d; int mu; c[even] = u[mu] * shift(b,mu) + 2 * d; PETE: Portable Expression Template Engine Temporaries eliminated, expressions optimised

Linear Algebra Implementation Naïve ops involve lattice temps – inefficient Eliminate lattice temps -PETE Allows further combining of operations (adj(x)*y) Overlap communications/computations Full performance – expressions at site level // Lattice operation A = adj(B) + 2 * C; // Lattice temporaries t1 = 2 * C; t2 = adj(B); t3 = t2 + t1; A = t3; // Merged Lattice loop for (i =... ;... ;...) { A[i] = adj(B[i]) + 2 * C[i]; }

QDP++ Optimization Optimizations “under the hood” Select numerically intensive operations through template specialization. PETE recognises expression templates like: z = a * x + y from type information at compile time. Calls machine specific optimised routine (axpyz) Optimized routine can use assembler, reorganize loops etc. Optimized routines can be selected at configuration time, Unoptimized fallback routines exist for portability

Performance Test Case - Wilson Conjugate Gradient LatticeFermion psi, p, r; Real c, cp, a, d; Subset s; for(int k = 1; k <= MaxCG; ++k) { // c = | r[k-1] |**2 c = cp; // a[k] := | r[k-1] |**2 / // Mp = M(u) * p M(mp, p, PLUS); // Dslash // d = | mp | ** 2 d = norm2(mp, s); a = c / d; // Psi[k] += a[k] p[k] psi[s] += a * p; // r[k] -= a[k] M^dag.M.p[k] ; M(mmp, mp, MINUS); r[s] -= a * mmp; cp = norm2(r, s); if ( cp <= rsd_sq ) return; // b[k+1] := |r[k]|**2 / |r[k-1]|**2 b = cp / c; // p[k+1] := r[k] + b[k+1] p[k] p[s] = r + b*p; } In C++ significant room for perf. degradation Performance limitations in Lin. Alg. Ops (VAXPY) and norms Optimization: Funcs return container holding function type and operands At “=“, replace expression with optimized code by template specialization Performance: QDP overhead ~ 1% peak Wilson: QCDOC MHz, 4^4/node VAXPY operations Norm squares

Chroma A lattice QCD toolkit/library built on top of QDP++ Library is a module – can be linked with other codes. Features: Utility libraries (gluonic measure, smearing, etc.) Fermion support (DWF, Overlap, Wilson, Asqtad) Applications: Spectroscopy, Props & 3-pt funcs, eigenvalues Heatbath, HMC Optimization hooks – level 3 Wilson-Dslash for Pentium, QCDOC, BG/L, IBM SP-like nodes (via Bagel) eg: McNeile: computes propagators with CPS, measure pions with Chroma all in same code

Software Map Features: Show dir structure

Chroma Lib Structure Chroma Lattice Field Theory library Support for gauge and fermion actions –Boson action supportBoson action support –Fermion action supportFermion action support Fermion actions Fermion boundary conditions Inverters Fermion linear operators Quark propagator solution routines –Gauge action supportGauge action support Gauge actions Gauge boundary conditions IO routines –EnumsEnums Measurement routines –Eigenvalue measurementsEigenvalue measurements –Gauge fixing routinesGauge fixing routines –Gluonic observablesGluonic observables –Hadronic observablesHadronic observables –Inline measurementsInline measurements Eigenvalue measurements Glue measurements Hadron measurements Smear measurements –Psibar-psi measurementsPsibar-psi measurements –Schroedinger functionalSchroedinger functional –Smearing routinesSmearing routines –Trace-log supportTrace-log support Measurement routines – –Eigenvalue measurementsEigenvalue measurements – –Gauge fixing routinesGauge fixing routines – –Gluonic observablesGluonic observables – –Hadronic observablesHadronic observables – –Inline measurementsInline measurements Eigenvalue measurements Glue measurements Hadron measurements Smear measurements – –Psibar-psi measurementsPsibar-psi measurements – –Schroedinger functionalSchroedinger functional – –Smearing routinesSmearing routines – –Trace-log supportTrace-log support Gauge field update routines – –HeatbathHeatbath – –Molecular dynamics supportMolecular dynamics support Hamiltonian systems HMC trajectories HMD integrators HMC monomials HMC linear system solver initial guess Utility routines – –Fermion manipulation routinesFermion manipulation routines – –Fourier transform supportFourier transform support – –Utility routines for manipulating color matricesUtility routines for manipulating color matrices – –Info utilitiesInfo utilities

Fermion Actions  Actions are factory objects (foundries)  Do not hold gauge fields – only params  Factory/creation functions with gauge field argument  Takes a gauge field - creates a State & applies fermion BC.  Takes a State – creates a Linear Operator (dslash)  Takes a State – creates quark prop. solvers  Linear Ops are function objects  E.g., class Foo {int operator() (int x);} fred; // int z=fred(1);  Argument to CG, MR, etc. – simple functions  Created with XML

Fermion Actions - XML  Tag FermAct is key in lookup map of constructors  During construction, action reads XML  FermBC tag invokes another lookup WILSON 0.11 SIMPLE_FERMBC false XPath used in chroma/mainprogs/main/propagator.cc /propagator/Params/FermionAction/FermAct

HMC and Monomials  HMC built on Monomials  Monomials define N f, gauge, etc.  Only provide Mom à deriv(U) and S(U). Pseudoferms not visible.  Have N f =2 and rational N f =1  Both 4D and 5D versions. TWO_FLAVOR_WILSON_FERM_MONOMIAL WILSON …… CG_INVERTER 1.0e MINIMAL_RESIDUAL_EXTRAPOLATION_4D_PREDICTOR 7 ….

Gauge Monomials Gauge monomials: Plaquette Rectangle Parallelogram Monomial constructor will invoke constructor for Name in GaugeAction …. WILSON_GAUGEACT_MONOMIAL WILSON_GAUGEACT 5.7 PERIODIC_GAUGEBC

Chroma – Inline Measurements HMC has Inline meas. Chroma.cc is Inline only code. Former mainprogs now inline meas. Meas. are registered with constructor call. Meas. given gauge field – no return value. Only communicate to each other via disk (maybe mem. buf.??) MAKE_SOURCE..../source_0 MULTIFILE PROPAGATOR..../source_0./propagator_0 MULTIFILE ….

Binary File/Interchange Formats Metadata – data describing data; e.g., physics params Use XML for metadata File formats: – Files mixed mode – XML ascii+binary – Using DIME (similar to MIME) to package – Use BinX (Edinburgh) to describe binary Replica-catalog web-archive repositories

For More Information U.S. Lattice QCD Home Page: The JLab Lattice Portal High Performance Computing at JLab