Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide.

Slides:



Advertisements
Similar presentations
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
Advertisements

MPI Message Passing Interface
The C ++ Language BY Shery khan. The C++ Language Bjarne Stroupstrup, the language’s creator C++ was designed to provide Simula’s facilities for program.
SciDAC Software Infrastructure for Lattice Gauge Theory
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
QDP++ and Chroma Robert Edwards Jefferson Lab
SciDAC Software Infrastructure for Lattice Gauge Theory DOE meeting on Strategic Plan --- April 15, 2002 Software Co-ordinating Committee Rich Brower ---
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
7M701 1 Class Diagram advanced concepts. 7M701 2 Characteristics of Object Oriented Design (OOD) objectData and operations (functions) are combined 
Unified Modeling Language (UML)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 Real-Time System Design Developing a Cross Compiler and libraries for a target system.
SciDAC Software Infrastructure for Lattice Gauge Theory DOE Grant ’01 -- ’03 (-- ’05?) All Hands Meeting: FNAL Feb. 21, 2003 Richard C.Brower Quick Overview.
Guide To UNIX Using Linux Third Edition
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
COMP 14: Intro. to Intro. to Programming May 23, 2000 Nick Vallidis.
Pointer analysis. Pointer Analysis Outline: –What is pointer analysis –Intraprocedural pointer analysis –Interprocedural pointer analysis Andersen and.
1 An introduction to design patterns Based on material produced by John Vlissides and Douglas C. Schmidt.
Design Synopsys System Verilog API Donations to Accellera João Geada.
C# vs. C++ What's Different & What's New. An example C# public sometype myfn { get; set; } C++ public: sometype myfn { sometype get (); void set (sometype.
Parallel Programming in Java with Shared Memory Directives.
Programming Languages and Paradigms Object-Oriented Programming.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower Annual Progress Review JLab, May 14, 2007 Code distribution see
C++ Code Analysis: an Open Architecture for the Verification of Coding Rules Paolo Tonella ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
Chapter 3 Introduction to Collections – Stacks Modified
Elements of Computing Systems, Nisan & Schocken, MIT Press, Chapter 6: Assembler slide 1www.idc.ac.il/tecs Assembler Elements of Computing.
QCD Project Overview Ying Zhang September 26, 2005.
Architectural Design portions ©Ian Sommerville 1995 Establishing the overall structure of a software system.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower & Robert Edwards June 24, 2003.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation.
Copyright © 2012 Pearson Education, Inc. Chapter 13: Introduction to Classes.
Copyright 2003 Scott/Jones Publishing Standard Version of Starting Out with C++, 4th Edition Chapter 13 Introduction to Classes.
Learners Support Publications Classes and Objects.
GPCE'04, Vancouver 1 Towards a General Template Introspection Library in C++ István Zólyomi, Zoltán Porkoláb Department of Programming Languages and Compilers.
C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Copyright © Curt Hill Generic Classes Template Classes or Container Classes.
Copyright 2004 Scott/Jones Publishing Alternate Version of STARTING OUT WITH C++ 4 th Edition Chapter 7 Structured Data and Classes.
C++ Programming Basic Learning Prepared By The Smartpath Information systems
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
Templates Templates for Algorithm Abstraction. Slide Templates for Algorithm Abstraction Function definitions often use application specific adaptations.
CS212: Object Oriented Analysis and Design
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Design Reuse Earlier we have covered the re-usable Architectural Styles as design patterns for High-Level Design. At mid-level and low-level, design patterns.
Computer Graphics 3 Lecture 1: Introduction to C/C++ Programming Benjamin Mora 1 University of Wales Swansea Pr. Min Chen Dr. Benjamin Mora.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
1 Future of Abstraction Alexander Stepanov. 2 Outline of the Talk Outline of the Talk  What is abstraction?  Abstraction in programming  OO vs. Templates.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
1 Java Server Pages A Java Server Page is a file consisting of HTML or XML markup into which special tags and code blocks are inserted When the page is.
CINT & Reflex – The Future CINT’s Future Layout Reflex API Work In Progress: Use Reflex to store dictionary data Smaller memory footprint First step to.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Chapter 17 – Templates. Function Templates u Express general form for a function u Example: template for adding two numbers Lesson 17.1 template Type.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
20-753: Fundamentals of Web Programming Copyright © 1999, Carnegie Mellon. All Rights Reserved. 1 Lecture 15: Java Basics Fundamentals of Web Programming.
Defining Data Types in C++ Part 2: classes. Quick review of OOP Object: combination of: –data structures (describe object attributes) –functions (describe.
DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
Java Programming Language
Computer Engg, IIT(BHU)
Introduction to C Programming Language
CS212: Object Oriented Analysis and Design
Constructors and Other Tools
Classes and Objects.
Pointer analysis.
ENERGY 211 / CME 211 Lecture 8 October 8, 2008.
Presentation transcript:

Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide subgrid layout and communications (bundling subgrid faces) between nodes Data-parallel – basic idea something like Fortran 90 with more complex types Implement via a programming interface (API), e.g. a C interface Data layout over processors

QCD Data Types Fields have various types (indices): Lattice Fields or scalar (same on all sites) Index type (at a site) Gauge fields:Product(Matrix(Nc),Scalar) Fermions:Product(Vector(Nc),Vector(Ns)) Scalars:Scalar Propagators:Product(Matrix(Nc),Matrix(Ns)) Support compatible operations on types: Matrix(color)*Matrix(spin)*Vector(color,spin)

QCD-API Dirac Operators, CG Routines etc. C, C++, etc. (Organized by high level codes etc.) Level 3 Data Parallel QCD Lattice Operations (overlapping Algebra and Messaging) A = SHIFT(B, mu) * C; Global sums, etc Level 2 Lattice Wide Lin Alg (No Communication) e.g. A = B * C Lattice Wide Data Movement (Blocking Communication) e.g Atemp = SHIFT(A, mu) Level 1 QLA: Linear Algebra API SU(3), gamma algebra etc. QMP: Message Passing API Communications I/O (XML) Serialization/ Marshalling

Level 2 Functionality Object types/declarations: LatticeGaugeF, ReadF, ComplexF, … Unary operations: operate on one source and return (or into) a target void Shift(LatticeGaugeF target, LatticeGaugeF source, int sign, int direction, Subset s); void Copy(LatticeFermionF dest, LatticeFermionF source, Subset s); void Trace(LatticeComplexF dest, LatticeGaugeF source, Subset s); Binary operations: operate on two sources and return into a target void Multiply(LatticePropagatorF dest, LatticePropagatorF src1, LatticeGaugeF src2, Subset s); void Gt(LatticeBooleanF dest, LatticeRealF src1, LatticeRealF src2, Subset s); Broadcasts: broadcast throughout lattice void Fill(LatticeGaugeF dest, RealF val, Subset s); Reductions: reduce through the lattice void Norm2(RealD dest, LatticeFermionF source, Subset s);

C Interface for Level 2 Binary: Multiplication has many varieties void QDP_F_T3_eqop_T1op1_mult_T2op2 (restrict Type3 *r, Type1 *a, Type2 *b, const Subset s) Multiply operations where T1, T2, T3 are shortened type names for the types Type1, Type2 and Type3 LatticeGaugeF, LatticeDiracFermionF, LatticeHalfFermionF, LatticePropagatorF, ComplexF, LatticeComplexF, RealF, LatticeRealF, … and eqop,op1,op2 considered as a whole imply operations like eq,,r = a*beqm,,r = -a*b eq,,Ar = a*conj(b)eqm,,Ar = -a*conj(b) eq,A,r = conj(a)*beqm,A,r = -conj(a)*b eq,A,Ar = conj(a)*conj(b)eqm,A,Ar = -conj(a)*conj(b) peq,,r = r + a*bmeq,,r = r – a*b peq,,Ar = r + a*conj(b)meq,,Ar = r – a*conj(b) peq,A,r = r + conj(a)*bmeq,A,r = r – conj(a)*b peq,A,Ar = r + conj(a)*conj(b)meq,A,Ar = r - conj(a)*conj(b)

QDP++ Why C++ ? –Many OOP’s benefits: Simplify naming with operator overloading Strong type checking – hard to call wrong routine Can create expressions! Closures straightforward - return types easily Can straightforwardly automate memory allocation/destruction Good support tools (e.g., doxygen – documentation generator) –Maintenance: Generate large numbers of variant routines via templates Contrary to popular opinion, there are compilers that meet standards. Minimum level is GNU g++. Expect to use full 3.0 standards Simple to replace crucial sections (instantiations) with optimized code Initial tests – g++ generating decent assembly! –User acceptance: Language in common use

Flavors Operation syntax comes in two variants: Global functional form (optimization straightforward): void Multiply_replace(Type3& dest, const Type1& src1, const Type2& src2); Operator form (optimizations require delayed evaluation): Type3 operator*(const Type1& src1, const Type2& src2); void operator=(Type3& dest, const Type3& src1); Were the types Type1, Type2 and Type3 are the names again LatticeGaugeF, LatticeDiracFermionF, … Declarations: Objects declared and allocated across the lattice (not initialized) LatticeGaugeF a, b;

Templates for Operations Without templates, multiple overloaded declarations required: void Multiply_op3(LatticeComplexF& dest, const RealF& src1, const LatticeComplexF& src2); void Multiply_op3(LatticeDiracFermionF& dest, const LatticeGaugeF& src1, const LatticeDiracFermionF& src2); Can merge similar operations with templates: template void Multiply_op3(T3& dest, const T1& src1, const T2& src2); Optimizations easily provided via specializations: template void Multiply_op3(LatticeDiracFermionF& dest, const LatticeGaugeF& src1, const LatticeDiracFermionF& src2) {/* Call your favorite QLA or QDP routine here */}

Templates for Declarations Expect to have data driven (and not method driven) class types: class LatticeDiracFermionF { private: DiracFermionF sites[layout.Volume()]; }; With templates: template class Lattice { private: T sites[layout.Volume()]; }; typedef Lattice LatticeDiracFermionF; Type composition: The fiber (site) types can themselves be made from templates (implementation choice)

Implementation Example Use a container class to hold info on object Easily separate functional and operator syntatic forms Closures implemented at this container level Underlying object class does real work No additional overhead – use inlining QLA – optimized linear algebra routines QMP – message passing / communications QDP object syntax Operator syntax Functional syntax // Functional form inline void Multiply_replace(QDP & dest, const QDP & src1, const QDP & src2) { dest.obj().mult_rep(src1.obj(), src2.obj()); } // Operator form without closure optimizations inline QDP operator*(const QDP & src1, const QDP & src2) {QDP tmp; tmp.obj().mult_rep(src1.obj(), src2.obj()); return tmp;}

Example // Declarations and construction – no default initialization LatticeGaugeF u, tmp0, tmp1; // Initialization Gaussian(u); Zero(tmp0); tmp1 = 1.0; // Two equivalent examples Multiply_replace(tmp0, u, tmp1); tmp0 = u * tmp1; // Three equivalent examples Multiply_conj2_ shift2_ replace(tmp1, u, tmp0, FORWARD, mu); Multiply_replace(tmp1, u, Conj(Shift(tmp0,FORWARD,mu))); tmp1 = u * Conj(Shift(tmp0,FORWARD,mu)); { // Change default subset as a side-effect of a context declaration Context foo(Even_subset); tmp1 = u * Conj(Shift(tmp0,FORWARD,mu)); // Only on even subset // Context popped on destruction }

SZIN SZIN and the Art of Software Maintenance –M4 preprocessed object-oriented data-parallel programming interface –Base code is C –High level code implemented over architectural dependent level 2 layer –Supports scalar nodes, multi-process SMP, multi-threaded SMP, vector, CM-2, CM-5, clusters of SMP’s –The Level 2 routines are generated at M4 time into a repository –Supports overlapping communications/computations –No exposed Level 1 layer –Can/does use (transparent to user) any external Level 1 or 2 routines –Optimized Level 3 routines (Dirac operator) available –Code publicly available by tar and CVS with many contributors – no instances of separate code camps Data parallel interface –SZIN is really the data-parallel interface and not the high level code –Some projects have only used the low level code and library –Could easily produce a standalone C implementation of the Level 2 API

Problems with SZIN Why change current system? Cons: –Maintenance: M4 has no scope. Preprocessor run over a file - no knowledge of C syntax or scope – bizarre errors can result M4 has no programming support for holding type info. Must invent it Interface problem – awkward to hook onto level 1 like optimizations –Extensibility: Problems with C – awkward to write generic algorithms, e.g. CG for different Dirac operators (like Clover requiring sigma.F but not Wilson), or algorithms on lattices or just single sites. No real language support for closures C very limited on type checking No automated memory allocation/free-ing. No simple way to write expressions. –User acceptance: M4 not familiar (and limited) to most users

Current Development Efforts QDP++ development –Level-2 C/C++ API not yet finalized by SciDAC – documentation underway –Currently testing the basic object model code needed for the single node version –Shortly start the parallel version – uses object model code –Will incorporate QLA library routines once agreed upon by SciDAC Stand-alone demonstration/prototyping suite –Some test routines and drivers for physics routines live in the QDP++ test directory –David/Richard/Andrew/others writing some simple stand-alone routines (like pure gauge) as a prototyping effort –Expect to produce a user’s guide SZIN port –Porting existing SZIN code with merging of MIT and Maryland code bases –The M4 architectural support part is completely removed and only uses QDP –Will become another high level code base using implementation dependent QDP

Hardware Acquisitions FY02 –NOW: 128 single processor P4, 2Ghz, Myrinet Expect ~ 130 Gflops –Late summer: probably 256 single processor P4, 2Ghz, 3-dimensional Gigabit-ethernet Expect  200 Gflops FY03 –Late summer: probably 256 single processor P4, 2.4Ghz, 3-dimensional Gigabit-ethernet Expect  240 Gflops