JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
Computer Architecture
Threads, SMP, and Microkernels
MPI Message Passing Interface
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
INSTRUCTION SET ARCHITECTURES
The University of Adelaide, School of Computer Science
The Assembly Language Level
Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Compiler Challenges for High Performance Architectures
Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
QDP++ and Chroma Robert Edwards Jefferson Lab
ILDG File Format Chip Watson, for Middleware & MetaData Working Groups.
Computer Systems/Operating Systems - Class 8
SciDAC Software Infrastructure for Lattice Gauge Theory DOE meeting on Strategic Plan --- April 15, 2002 Software Co-ordinating Committee Rich Brower ---
Reference: Getting Started with MPI.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 17 Parallel Processing.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Topic Overview One-to-All Broadcast and All-to-One Reduction
SciDAC Software Infrastructure for Lattice Gauge Theory DOE Grant ’01 -- ’03 (-- ’05?) All Hands Meeting: FNAL Feb. 21, 2003 Richard C.Brower Quick Overview.
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
Advanced Computer Architectures
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
QCD Project Overview Ying Zhang September 26, 2005.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower & Robert Edwards June 24, 2003.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
4-1 Chapter 4 - The Instruction Set Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.
Computer Architecture And Organization UNIT-II General System Architecture.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Introduction to MMX, XMM, SSE and SSE2 Technology
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
What is QCD? Quantum ChromoDynamics is the theory of the strong force
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Announcements Assignment 1 due Wednesday at 11:59PM Quiz 1 on Thursday 1.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
My Coordinates Office EM G.27 contact time:
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Instruction Set Principles
Chapter 12 Pipelining and RISC
Lecture 11: Machine-Dependent Optimization
Presentation transcript:

JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation over MPI is available in C [Edwards] GM implementation underway [Chen] –Basic linear algebra routines (Level 1) Prototype design specification available [Edwards] Prototype code available [M4 version - Pochinsky] New (not accepted) implementation [Pochinsky] Subset optimized for Pentium 4 [Edwards, McClendon] –Data-parallel (Level 2) Prototype design specification available [Edwards] Example implementation – SZIN [Edwards] C interface first version, considering C++ version –Optimized Dirac operators, CG (Level 3) Parallel-threaded Pentium 4 version available [Edwards, McClendon]

Lattice QCD – extremely uniform Periodic or very simple boundary conditions Problem uniform Discretize on lattice – finite difference Dirac operator

Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide subgrid layout and communications (bundling subgrid faces) between nodes. Basic idea something like Fortran 90 – data- parallel support However, given uniform structure, can implement API without directly writing a compiler. Providing a meta-language (??) Data layout over processors

QCD-API Dirac Operators, CG Routines etc. C, C++, etc. (Organized by MILC or SZIN or CPS etc.) Level 3 Data Parallel QCD Lattice Operations (overlapping Algebra and Messaging) A = SHIFT(B, mu) * C; Global sums, etc Level 2 Lattice Wide Lin Alg (No Communication) e.g. A = B * C Lattice Wide Data Movement (Pure Communiation, non-blocking) e.g Atemp = SHIFT(A, mu) Level 1 Single Site Linear Algebra API SU(3), gamma algebra etc. Message Passing API (Know about mapping of Lattice onto Network Geomentry)

API Design Criteria Design must allow for overlapping of computation and communications. Hide data layout. Internally, a query can be made to return layout info to support a very limited number of layouts. Allow different underlying layouts to optimize implementations on a wide range of machines. Notably, (super)-scalar microprocessors with primary and secondary caches, and also classic vector machines. Support for multi-process or multi-threaded computations hidden from user control. Operations do not (by default) make conversions of objects from one layout into another layout so as to perform the operation. An error is generated if objects are in incompatible layouts.

Motivation Consider the equation dest(x) = src1(x)*src2(x+dir) (for all x) This can be decomposed astmp1(x) = src2(x+dir) dest(x) = src1(x)*tmp1(x) Implementation 1: This can be implemented as two functions as follows: Shift(tmp1, src2, FORWARD, dir); Multiply(dest, src1, tmp1, REPLACE); where FORWARD and dir give the sign of the shift and the direction, and REPLACE is an option indicating that dest is overwritten (allowing other options, like += later). Implementation 2: If the Shift is changed to also return its result, this can be rewritten as: Multiply(dest, src1, Shift(src2, FORWARD, dir), REPLACE);

Performance Considerations Predefined channels are opened between neighboring nodes of a d dimensional grid. Overlapping communications and computations, e.g. C(x)=A(x)*shift(B,d): –On clusters of scalar nodes, the face of a subnode is sent non-blocking to a neighboring node say in the forward direction. –The neighboring node in the corresponding backward direction sends its face into a preallocated buffer. –While this is going on, the specified operation is performed on the interior sites. –A wait is issued and the operation is performed on the face. Data layout over processors

/* Nd = number of dimensions Nc = number of colors Volume = Nd-volume Forward is an enum. There is also a Backward. The type specifiers below are suggestive, but not fleshed out yet */ Lattice_Gauge u[Nd], tmp_0, tmp_1; Lattice_Real wplaq_tmp; double tmp, w_plaq; int mu,nu; w_plaq = 0; for(mu=1; mu < Nd; ++mu) for(nu=0; nu < mu; ++nu) { /* tmp_0 = u(x+mu,nu)*u_dag(x+nu,mu) */ Multiply(tmp_0, Shift(u[nu],FORWARD,mu), Shift(Conjugate(u[mu]),FORWARD,nu), REPLACE); /* tmp_1 = tmp_0*u_dag(x,nu)=u(x+mu,nu)*u_dag(x+nu,mu)*u_dag(x,nu) */ Multiply(tmp_1,tmp_0,Conjugate(u[nu]),REPLACE); /* tmp_0 = u(x,mu)*tmp_1 = u(x,mu)*u(x+mu,nu)*u_dag(x+nu,mu)*u_dag(x,nu) */ Multiply(tmp_0,u[mu],tmp_1,REPLACE); /* wplaq_tmp = tr(tmp_0) = tr(u(x,mu)*u(x+mu,nu)*u_dag(x+nu,mu)*u_dag(x,nu)) */ Trace(wplaq_tmp,tmp_0,REAL_PART); /* w_plaq += sum(wplaq_tmp); */ Sum(w_plaq,wplaq_tmp,ADD); } w_plaq = 2.0*w_plaq/(float)(volume*Nd*(Nd-1)*Nc); Level 2 Example

Data Types Fields have various types (indices): Lattice Fields or scalar (same on all sites) Index type (at a site) Gauge fields:Product(Matrix(Nc),Scalar) Fermions:Product(Vector(Nc),Vector(Ns)) Scalars:Scalar Propagators:Product(Matrix(Nc),Matrix(Ns)) Support compatible operations on types: Matrix(color)*Matrix(spin)*Vector(color,spin)

C++ Interface for Level 2 Unary operations: operate on one source and return (or into) a target Lattice_Field Shift(Lattice_field source, enum sign, int direction); void Copy(Lattice_Field dest, Lattice_Field source, enum option); void Trace(double dest, Lattice_Field source, enum option); Binary operations: operate on two sources and return into a target void Multiply(Lattice_Field dest, Lattice_Field src1, Lattice_Field src2, enum option); void Compare(Lattice_Bool dest, Lattice_Field src1, Lattice_Field src2, enum compare_func); Broadcasts: broadcast throughout lattice void Fill(Lattice_Field dest, float val); Reductions: reduce through the lattice void Sum(double dest, Lattice_Field source);

C Interface for Level 2 Binary: Multiplication is a fundamental operation and quite important for speed. void QCDF_mult_T1T2T3_op3(Type3 *r, const Type1 *a, const Type2 *b) Multiply operations where T1, T2, T3 are shortened type names for the type Type1, Type2 and Type3 LatticeGaugeF, LatticeDiracFermionF, LatticeHalfFermionF, LatticePropagatorF and op3 are options like nnrr = a*bnnnr = -a*b ncrr = a*conj(b)ncnr = -a*conj(b) cnrr = conj(a)*bcnnr = -conj(a)*b ccrr = conj(a)*conj(b)ccnr = -conj(a)*conj(b) nnar = r + a*bnnsr = r – a*b ncar = r + a*conj(b)ncsr = r – a*conj(b) cnar = r + conj(a)*bcnar = r – conj(a)*b ccar = r + conj(a)*conj(b)ccsr = r - conj(a)*conj(b)

C Example of Level 2 calling Level 1 void QCDF_mult_T1T2T3_op3(Type3 *r, const Type1 *a, const Type2 *b) { for(I=0; I < volume; ++I) QCDF_mult_T1T2T3_op3_S(r[I], a[I], b[I]) } Scalar processor void QCDF_mult_T1T2T3_op3(Type3 *r, const Type1 *a, const Type2 *b) { thread_dispatch(&foo, subgrid_volume, r, a, b) } void foo(int start, int end, va_list args) { for(I=start; I < end; ++I) QCDF_mult_T1T2T3_op3_S(r[I], a[I], b[I]) } Parallel Threaded

Communications-Computations Overlap Type *QCDF_shift_T(Type *a, int d) { for(I=0; I < size_of_face; ++I) QCDF_copy_T1T2_repl(msgmem_f[d][I], a[offset[I]]) QMP_start(msgh_b[d]) // start receive QMP_start(msgh_f[d]) // start send } void QCDF_mult_T1T2T3_op3(Type3 *d, const Type1 *b, const Type2 *t) { thread_dispatch(&interior, interior_volume, d, b, t) QMP_wait(msgh_b) // Wait on receive thread_dispatch(&face, face_size, d, b, t) QMP_wait(msgh_f) // Wait on send } void interior(int start, int end, va_list args) // similar for face { for(I=start; I < end; ++I) QCDF_mult_T1T2T3_op3_S(d[off3[I]], b[off1[I]], t[off2[I]]) } T = shift(A, d) sending face in d & receive in -d D = B * T calls wait on send/receive

SZIN: Data-Parallel LHPC (Lattice Hadron Physics Collaboration) uses SZIN –M4 preprocessed object-oriented data-parallel programming interface –Base code is C –Supports scalar nodes, multi-process SMP, multi-threaded SMP, vector, CM-2, CM-5, clusters of SMP’s –The Level 2 routines are generated at M4 time into a repository –Supports overlapping communications/computations –No exposed Level 1 layer –Can/does use (transparent to user) use any Level 1 or 2 routines –Optimized Level 3 routines (Dirac operator) available –Code publicly available by tar and CVS Plan to move to conventional OO language

Linear Algebra (Level 1)Intel Pentium with Streaming SIMD Extensions (SSE) Streaming SIMD (Single Instruction Multiple Data) Extensions which are length 4 vector instructions Main benefit is lower instruction bandwidth and easy scheduling First introduced in Pentium 2 as MMX instructions (32 bit integer) In Pentium 3, SSE has eight 128 bit registers used in single prec., so each vector register holds and operates on four 32 bit registers In Pentium 4, SSE2 introduced which allows use of same registers in 64 bit (double prec.) mode Pentium 4 capable of four 32 bit operations per clock (e.g., two mult- adds) or two 64 bit operations (e.g., one mult-add) per clock

Level 1 Multiply Performance (32bit)

Level 1 Multiply Performance (64bit)

Single Node

Parallel Version on Single Node

Typical steps in LQCD calculation Some code generates gauge configurations –Largish binary image –Need header describing context Read configuration and compute quark propagators (solve Dirac eq.) –Even larger binary image –Carry gauge header and new propagator header – nested headers Read propagators and configuration to compute matrix elements (hadron decays, etc.) –Generates important but much smaller (often ascii) data sets. –Usually large number of quantum numbers (channels)

I/O and Data Formats Currently using Namelist (from Fortran) data format. Supports nested groups and resources (tags) within groups. Switching now to new nested header format for configurations, propagators, input files, etc. Will propose to use XML for ascii formated headers –These headers can be used for a user extensible replica(ted) catalog of locations of configurations and propagators. –Search user data using SQL to find locations. Use XML input and output data files –Parameter input for program execution –Output for cooked data can be sent over net to receiver –Want SQL-like data base for storing data and retrieving Problems: –Long computation time for typical SQL commits or rollbacks –SQL data types too primitive – want multi-D arrays, complex numbers.