Presentation on theme: "Looking for a (standard) Common Format for (Quantum)"— Presentation transcript:
1 Looking for a (standard) Common Format for (Quantum) MotivationVocabolarywrappersLooking for a (standard) Common Format for (Quantum)Computational ChemistryA WG activity within COST action 23 (WG D23/0006/01)Elda Rossi, Andrew Emerson – CINECAGian Luigi Bendazzoli, Antonio Monari – Univeristà di BolognaRenzo Cimiraglia, Celestino Angeli, Stefano Borini - Università di FerraraDaniel Maynau, Stefano Evangelisti - IRSAMC – ToulouseJosé Sanchez-Marin - Universitat de ValenciaPeter Szalay - Eötvös Loránd UniversityRosa Caballol - Universitat Rovira i Virgili TarragonaLooking for a (standard) common format for Computational (Quantum) ChemistryThis talk is about the activity recently carried out in the framework of “COST in Chemistry”, EC funded projectThe final aim is to provide a “workflow” tool to allow researchers to collaborate by exchanging different programs. To this end, the first problem we faced, and that is the core of the presentation, was that of defining a common format for Quantum Chemistry programs.An XML-based format is proposed, designed to describe in a quite general way a Quantum Mechanical system. This format is used for a repository where all data on the system under investigation are maintained. From the repository, data are retrived and converted to the input stream of the specific program to be run. The conversion is done by a wrapper code, specifically designed for each single program. Two possible ways to write the wrappers are discussed, using respectively the Fortran and Python programming language.
2 Motivation for the work VocabolarywrappersTo build a meta-system for supporting research collaboration in the field of“Localised Orbitals in post-SCF methods … Linear Scaling methods in a Multi-Reference context”
3 The scenario Different platforms MotivationVocabolarywrappersDifferent laboratories need to collaborateDifferent “home-made” codes need to be used together since they give different views of the same problemGeneral purpose “basic” codes needed to pre-compute data in a sort of pipelineProgrammes should remain on their original sites under the responsibility of their authorsDifferent platformsNetwork connections (grid architecture)Workflow
4 The need of a Common Format MotivationVocabolarywrappersThe first problem we faced:How different codes (on different platforms) can communicatewe need a Common Format for (at least) Quantum Chemistry codes
5 Preliminary steps Looking around … MotivationVocabolarywrappersLooking around …CML available since long timeXML is use by Accelrys for internal filesXML is used by ArgusLab for internal filesAll of them not completed suited for computational chemistrymainly structural chemistry, no Quantum Chemistry propertiesXML seems the best technology so we took the decision to try another XML based formatHDF5 looked nice for storing large binary data typical of QC
6 MotivationVocabolarywrappersWhy XMLPro: Standard Self referencing Extensible Some experience (CML) already existsCon: (Verbose) Very few applications to “science” Seems difficult to be used with Fortranapparently no specific data types for quantum mechanical concepts or data-types (i.e. integrals, wavefunction, orbitals, coefficients, energy levels, etc.)
7 Which codes to be integrated: the meta-system building blocks MotivationVocabolarywrappersCAS-DI (Multi-Reference Configuration Interaction)EPCISO (Spin-orbit Configuration Interaction)NEVPT (MR PT, P-Variational approaches)LOCNAT (Localized Multireference algorithm)FCI (Full Configuration Interaction)PROP (Property Calculation)COLUMBUS (General ab-initio electronic package)DALTON (General ab-initio package)MolPRO, MolCAS, Nwchem, Gaussian,…
8 How should work the engine MotivationVocabolarywrappersIN-wrapperLeaves the program unchangedOne wrapper for each program – If a code is added only one wrapper to be writtenIN-filesData Repository XML/HDFProgramOUT-filesOUT-wrapper
9 code3 y u Input tool Data repository x code2 v w code1 Repository MotivationVocabolarywrappersRepositoryWrappersWorkflow enginecode3yuInput toolDatarepositoryxcode2vwwrapperwrapperIn filescode1Out files
10 QCML: an XML format for QC MotivationVocabolarywrappersQCML: an XML format for QCIn order to be as general as possible we need to write down a hierarchical schema of Quantum Chemistry quantitiesAs a first approximation three domains can be identifiedBase FACTS initial data for describing the physics of the systemDERIVED quantities computed from FACTS using QC Fact algorithms (Energies, Props, integrals, coeff, …)W-FLOW which codes are in the pipeline, specific input Parameters data, …A base fact is a fact that is a given in the world and is remembered (stored) in the system.A derived fact is created by an inference or a mathematical calculation from terms, facts, other derivations, or even action assertions.
11 FACT: molecule <system title date program author> MotivationVocabolarywrappers<system title date program author><molecule nElectrons charge spinMultiplicity spaceSymmetry><symmetry> groupName/><geometry type unit numAtoms symmetryRef ><atom symbol isotope x3 y3 z3/><basis name type numOrbitals ><atomBase angularMomMAX symbol ><angularMom value symbol numOrbitals><orbital id numPrimitives><exps/> <coeffs/>Symmetry: group name & other symmetry dataGeometry: only cartesian, full or unique for symBasis: by name or fully definedFACTSDERIVEDW-FLOW
13 FACT: molecule/basis Name: vdz Alias: molpro:vdz/molcas:Vdz/g03:V-dz/… MotivationVocabolarywrappersName: vdzAlias: molpro:vdz/molcas:Vdz/g03:V-dz/…Type: sphericaldefinedFor: C+O+H+…atomBase: H (max angularMom, numPrimitives, numAO)angularMom: s (numAO)orbital: 1st (numPrimitives)exponents:coefficients:orbital: 2nd …angularMom: p …atomBase: OatomBase: CThis archive could be organised in XML form, or better use already available data banks (EMSL Basis Set Library)FACTSDERIVEDW-FLOW
14 DERIVED data: computedData MotivationVocabolarywrappers<system …><computedData><energy unit levelOfTheory quality value><state spaceSymmetry spinMultiplicity excitationLevel /><property unit levelOfTheory quality value><state “bra” spaceSymmetry spinMultiplicity excitationLevel /><state “ket” spaceSymmetry spinMultiplicity excitationLevel /><operator order name/><file address URL/>A “schema” has been written for QCMLFACTSDERIVEDW-FLOW
15 DERIVED : computedData/file MotivationVocabolarywrappersProblem with large binary datasetsinclude the reference not the actual dataTwo possible strategies:Leave data in their native format and translate them only when needed. Maintain different version (formats) of the same dataDefine a “standard” format for binary data and convert them anywayThe second was the solution of choiceHDF5 appears to be a good solution
16 HDF MissionMotivationVocabolarywrappersTo develop, promote, deploy, and support open and free technologies that facilitate scientific data storage, exchange, access, analysis and discovery.Format and software for scientific dataStores images, multidimensional arrays, tables, etc.Emphasis on storage and I/O efficiencyFree and commercial software supportEmphasis on standardsUsers from many engineering and scientific fields
17 Example HDF5 file “/” (root) “/foo” 3-D array palette Table MotivationVocabolarywrappers“/” (root)“/foo”3-D arraylat | lon | temp----|-----|-----12 | 23 | 3.115 | 24 | 4.217 | 21 | 3.6paletteTableRaster imageLike HDF4, HDF5 has a grouping structure.The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all.Raster image2-D array
18 Example HDF5 file “/” (root) “/MO” “/MO” “/AO” “/bi” “/mono” “/mono” MotivationVocabolarywrappers“/” (root)“/MO”“/MO”“/AO”“/bi”“/mono”“/mono”“/bi”“/coefficients”KineticOverlapRepulsionKinetic+ RepulsionPropertyOrb | occ | energy----|-----|-----1 | 0 | 0.352 | 0.5| 0.263 | 2. | 0.69Table4-D arrayThis shows that you can mix objects of different types according to your needs. Typically, there will be metadata stored with objects to indicate what type of object they are.Like HDF4, HDF5 has a grouping structure. The main difference is that every HDF5 file starts with a root group, whereas HDF4 doesn’t need any groups at all.
19 } HDF Software General Applications Application Programming MotivationVocabolarywrappersGeneral ApplicationsUtilities and applications for manipulating, viewing, and analyzing data.ApplicationProgrammingInterfacesLow-levelInterface}HDF I/O libraryHigh-level, object-specific APIs.Low-level API for I/O to files, etc.HDFfileFile or other data sourceIt is useful to think about HDF software in terms of layers.At the bottom layer is the HDF5 file or other data source.Above that are two layers corresponding the the HDF library.First there is a low level interface that concentrates on basic I/O: opening and closing files, reading and writing bytes, seeking, etc. HDF5 provides a public API at this level so that people can write their own drivers for reading and writing to places other than those already provided with the library. Those that are already provided include UNIX stdio, and MPI-IO.Then comes the high-level, object -specific interface. This is the API that most people who develop HDF5 applications use. This is where you create a dataset or group, read and write datasets and subsets, etc.At the top are applications, or perhaps APIs used by applications. Examples of the latter are the HDF-EOS API that supports NASA’s EOSDIS datatypes, and the DSL API that supports the ASCI data models.
20 HDF file structure for QC MotivationVocabolarywrappersRoot AO <i/j> <i/T/j> <i/Vnuc/j> <i/T/j>+<i/Vnuc/j> <ij/kl> MO <i/T/j> <i/V/j> coeff(i,j) Property <i/p/j>NorbNameQCML_refNorbSpin Polar.: a=babOrb Classif: CoreActiveVirtualOrb Energies:Orb Symm: [1-order]+ format metadata (integer, binary, Endian-ism, …)
21 workflow parametersFuture work (web-services, bottom-up approach (top-down?),Each “code” must be divided into “elementary recipes” (catalog of recipes)The interface of each “recipe” must be described (idl, xml, …) in addition to more dynamical informationsA “master of cerimony” must exist with the following tasks:ORB-like “Inter-client” communicationJob planningUser inteface (grid abstraction)….FACTSDERIVEDW-FLOW
22 QCML processing: wrappers MotivationVocabolarywrappersOne couple of wrappers for each code in the metasystemThey should be written & maintained by the authors of the chemical codesXML processing can be used (DOM) but … what language???Fortran: no easy and stable DOM availableScripting languages (Perl/Python/Java): not known by chemistsWe tried both ways (Fortran & Python)
23 Fortran DOM: drawbacks MotivationVocabolarywrappersThe only problem is the Fortran bindingIt doesn’t exist (at least last year …)DOM is OO and Fortran is notIt exists a C binding (Gdome2)Gdome2 was installed – very hard work – on a mainframe platform (it was conceived for Linux)We are currently converting it to Fortran, by adopting the DOM recommendations (simplified …)
24 Why Fortran GOOD Users don't need to learn a new language MotivationVocabolarywrappersGOODUsers don't need to learn a new languageHomogeneous environmentBADTricky: need an external library (f77xml) built on top of gdome2Porting problems for gdome2/libxml2 may arise
25 F77xml library Still in development MotivationVocabolarywrappersStill in developmentv0.4 is out (experimental, with limited features)v1.0 upcoming, API changed to be nearly DOM2 compliantWritten in C on top of gdome2Designed for interfacing to F77 (also F90 soon)Reduced namespace pollutionCons:F77 syntax is difficult (DOM2 + tricks)F90 syntax is simplerA pre-processor will convert F90 syntax to F77
26 F77xml library - V1.0 example MotivationVocabolarywrappersGdome2 (C)GdomeNode* gdome_el_firstChild (GdomeElement *self,GdomeException *exc);F90Call f77xml_el_firstChild(nodeCode, elemCode, exc)First position:Return valueNodeCode, elemCode,excmapped to INTEGERF77Func='el_firstChild'Call xp3t1(nodeCode,func,elemCode,exc)Multiplexer function: x: p3: 3 parameters (+ name function) t1: type 1 parameter schema (code/code/error)
27 Why Python GOOD Very Easy Object Oriented Language MotivationVocabolarywrappersGOODVery Easy Object Oriented LanguageWorks well with stringsSimple ed efficient DOM interface for XMLPresent in almost all UNIX/LINUX distributionBADUsers do need to learn a new languageMaybe less powerful than PerlUsually not used by chemists
28 At the present a prototype does work with molpro-fci chain. Python WrapperMotivationVocabolarywrappersAt the present a prototype does work with molpro-fci chain.It takes information from xml-repositoryWrites down proper MOLPRO and FCI inputStarts the two programsWith a different XML file users should only specify the file name and some simple parameters (orbital guess for FCI)
29 Wrappers in the futureMotivationVocabolarywrappersWe have to develop a script to write initial XML fileAny wrappers should be able to take information from output and append them in XML fileUser interface could be done with a GUI using TkInter a package integrated in Python
30 Python or notMotivationVocabolarywrappersPython is very simple to learn and works very efficiently with xmlScripts written in Python (at least for prototypes) are quite clear, linear and easy to maintain or upgradePossibility of a GUI could make our project much more user-friendly
31 What we have done … MolPro Start here FCI Stop here MolPro IN-file Single platform: IBM SP4Two code chainsMolPro to FCIMolPro to CasDIIN-wrapperMolProOUT-wrapperFCIDUMPStart hereQCML RepositoryHDF5 RepositoryIN-wrapperBin file for FCIFCI IN-fileIN-wrapperFCIStop here
32 In conclusion … Two important hints on data… Use some XML dialect for describing simple structured dataUse HDF5 for storing large array and binary dataNeed of a good and easy API to XML & HDFHow to manage the workflowHow to manage the grid connection
33 XML processorA set of rules and interfaces to interact with XML data using a user- programThere are two main API specifications done by the w3c consortium:DOM: Document Object ModelSAX: Simple API for XMLto do something useful with XML, you must be able to programmatically access the data. A software module capable of reading XML documents and providing access to their content and structure is referred to as an XML processor or an XML API.While developers are free to implement their own XML APIs, it is in their best interests to leverage industry-accepted standard APIs. By accepting an industry standard API, a developer can write code for a given API implementation that should be capable of unningunder any other compliant implementation of the same API without modifications.There are two main API specifications that have gained popularity among developers today and are striving to become industry standards: the Document Object Model (DOM) and the Simple API for XML (SAX).
34 Building the wrappersInstead of … reinventing the wheel … DOM – Document Object Modeldefines a platform- and language-neutral interface to the structure of XML documents. This interface allows to dynamically access and update the document.From the specification:DOM providesa standard set of objects for representing XML documents,a standard model of how these objects can be combined,and a standard interface for accessing and manipulating them.