Presentation is loading. Please wait.

Presentation is loading. Please wait.

The CCPN Project Technical Introduction Expanded from presentation, Utrecht Feb. 2007.

Similar presentations


Presentation on theme: "The CCPN Project Technical Introduction Expanded from presentation, Utrecht Feb. 2007."— Presentation transcript:

1 The CCPN Project Technical Introduction Expanded from presentation, Utrecht Feb. 2007

2 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status, resources and plans ■ Credits

3 The CCPN Project Collaborative Computing Project for NMR Started in 1999 Unifying platform for NMR software similar to CCP4 for X-ray crystallography Community-based, open-source, software development

4 Context High-throughput structural biology relies on software pipelines. Programs that do various tasks must be combined, preferably so that there are alternative programs for each task. Yet, connecting one program to another can be quite difficult in practice. Often all you have is an incomplete set of partial and badly maintained conversion scripts.

5 Native Anarchy Convert Task1 Task2 Convert Task2 Task1 Convert Task3 Convert Task3 Convert Task3

6 Disunified Software Every program does things its own way. - proprietary data formats. No provision for data exchange. - limited choice, and data loss. No provision for data harvesting. - lost and neglected data. No basis for task or code sharing. - you must keep re-inventing the wheel. Software pipelines require interoperability.

7 CCPN Activities Data standards and software integration –Data Model with associated subroutine libraries. –Tools – a framework for data modeling. –Collaborations Software development and distribution Meetings and workshops

8 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status, resourcesand plans ■ Status, resources and plans ■ Credits

9 Data standard - objectives –Lossless data transfer between programs –Completeness, integrity of data –Data harvesting All data retained till deposition Allows data mining

10 Conversion with Data Standard Data Standard Convert Task1 Convert Task2 Convert Task2 Convert Task1 Convert Task1 Convert Task3 Convert Task3 Convert Task3

11 Software integration - objectives Possible to work between programs Link and integrate software Allow shared modules –All data in one shared storage

12 Runtime storage hub Data Storage Convert Task1 Convert Task2 Convert Task2 Convert Task1 Convert Task1 Convert Task3 Convert Task3 Convert Task3

13 CCPN focus Data creation rather than querying final results Applications rather than documents Information store rather than message passing Full details rather than summaries Precise specifications

14 Framework requirements Precisely defined standard –A single central description –Validation directly against standard Support many parallel implementations Ensure data consistency and validity Easy to maintain and modify Programmer-friendly

15 Model Driven Architecture Abstract data model – UML No stable format – stable API –easier to maintain as model changes Support XML and SQL, underneath Support several programming languages Automatic code generation to make it feasible Chosen approach

16 Preferences All science in one, shared multipackage model made in collaboration. API, storage and I/O libraries reflect model exactly CCPN framework provides storage, I/O and all bookkeeping Others do science and user interaction Program-specific needs in extensions

17 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status,esources and plans ■ Status, resources and plans ■ Credits

18 CCPN code generation The applications that the user sees remain independent, but do all their data access through the CCPN APIs. These in turn store and load the data transparently from disk (XML or database). All subroutine libraries are generated fully automatically from an abstract logical model, that describes the structure and characteristics of the data without reference to the implementation. The logical model is made in UML from a series of connected packages, designed by teams of domain specialists for different scientific domains. All generated code is by definition consistent, since it derives from the same model. Code generation is fully automatic, and the resulting subroutine libraries can be used directly, without manual tweaking. Implementation-specific code is stored as code snippets in tagged values inside the UML; it amounts to less than 0.5% (character count) of the total. The API implementations are object oriented and are guaranteed to be stable – unlike the underlying storage formats. Storage is transparent, and applications should work equally well with either a file-based or a database implementation. Some features, like transaction handling, roll-back, or access control, will only be properly implemented in database implementations.

19 Code Generation Framework Domain Experts MEMOPS framework Software Developers User Documentation Application Deposition APIs Python Java C Perl Storage SQL XML Handcoded(< 1%) UML Model Package 1 Package 2 Package 3 Autogeneration Wrappers

20 CCPN APIs Application Programming Interface –Is the standard –Subroutine library –Data accessed in memory as if stored inside the data model –Classes and objects, with functions (get, set, …) Comes with: –Integrated, transparent I/O (file or database) –Complete validity checking –Protection against casual change (data encapsulation) –Versioning and backwards compatibility

21 Developer Benefits Standard shared by many programs –Comes with reference data No need to write I/O or checking code Concentrate on the science, not book keeping Extendible –Program-specific data in all objects –New model packages Functionality –Utility functions (e.g. Molecule from 1-letter codes) –Notification system (for GUI)

22 Languages and Formats PythonJava CPerl XML SQL Analysis, FormatConverter Extend-NMR Bruker TopSpin, NMRVIEW Processing Extend-NMR, EU-NMR, AUTOPSY, (Varian) (CYANA) (Bioinformatics) MSD NMR database, BIOXDM ? PIMS?? BIOXDM ? EuroCarbDB (Bioinformatics) Format Language For all languages: Model description Documentation For all formats: Schemas I/O mappings Implementations marked in pink are available.

23 API generator architecture Reduce burden of adding new languages, formats –Languages (Python, Java, C, …) –Storage formats (XML, SQL) Language & Format Independent (80% of the code) Format dependent only Language dependent only Language & Format dependent Code required for new language Code required for new format Most of the logic

24 API functions ‘get’ and ‘set’ (Attributes and links) ‘add’ and ‘remove’ (Multiple attributes and links) ‘sorted’ (Multiple unordered links) ‘findFirst’ and ‘findAll’ (Multiple links) –Simple filtering (attribute == value) create and ‘new’ (Objects) –Normal and ‘factory function’ object creation delete (Objects) –‘Delete’ function. checkValid, checkAllValid (Objects) ‘current’ (e.g current NmrProject) API classes are strongly coupled. For efficiency reasons object-to-object links are two-way, with information stored on both sides, and this forces objects to interact directly with the internal representations of objects they are linked to.

25 Code Generation UML is edited in a commercial editor (ObjectDomain) that includes a Python scripting environment. An ObjectDomain- specific script extracts the model description into a Python in-memory representation, that is then written to disk as a series of XML files. The UML editor is not used further, and another one could be substituted simply by providing a new data extraction script. Subsequent code generation is done outside the editor. The XML model description is read into memory as a series of Python objects, based on the Python classes of a MetaModel. Code generators then uses the information in the in-memory description to generate all necessary files

26 Code Generation ObjectDomain UML data edit UML MetaModel In-Memory Model Python objects On-disk model XML file API code Schemas Mappings etc. Autogeneration Generation: CCPN Python generators Auxiliary: CVS/SVN, Ant, Eclipse CCPN code Off-the-shelf Application code files CCPN generated Legend:

27 Runtime implementations File and database implementations are organised differently, with varying use of off-the-shelf code. For both implementations a large fraction of eth API code is taken up by constraint checking, to make sure that data remain consistent with the rules embedded in the model.

28 Science code User Interface Utility functions Runtime Python XML Application Python API XML I/O code XML I/O mappings Data Storage XML files User application Data get, set. Validity check Generic XML read/write User data in CCPN XML format What to do for which element CCPN code Off-the-shelf Application code files CCPN generated Legend: XML parser

29 Science code User Interface Utility functions Runtime Java SQL Application Java API Hibernate Hibernate mappings Database User application Data get, set. Validity check User data in database Presentation layer Database Schema Object-DB synchronisation Commit, input check CCPN code Off-the-shelf Application code files CCPN generated Legend: PIMS: Backing beans, JSF Client/Server ? PIMS: Servlets, JSP (JSF)

30 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status,esourcesand plans ■ Status, resources and plans ■ Credits

31 Model requirements Comprehensive –Contains all data you need to run an application –Final and intermediate results General –can handle all likely ways of working Unequivocal –Normalised - data stored only once –Data stored in one way, std. units, … Unavoidably complex

32 Data Model Coverage Preparation Experiments Molecule Description NMRCitations Nuclei and Isotopes Annotation Classifications Organisms, Taxonomy Specific Programs People Access Control Structure Targets Crystallisation Project Tracking Structure and Coordinates Residue Template Molecular System Samples LIMS MOLECULE UTILITY X-RAY CRYSTALLOGRAPHY Beamlines

33 CCPN Packages Grouping – model, code and data –e.g. NMR, X-ray, Molecular description Package ‘import’ –e.g. Coordinates loads MolSystem information Allows lazy loading –Only load relevant data, code. –Load on demand Different groups can ‘own’ different packages –New domains – new collaborators

34 Main Concepts Model is Normalised –Every piece of information stored in only one place Classes and Objects – Inheritance and subtypes (single inheritance only) – Elements (attributes and links) and operations Most operations not entered explicitly – Classes may be abstract Attributes – String, numeric, special complex types – Compare by value Links – Connection to other objects – Generally two-way – All objects have identity – Most –to-many links are sets

35 Example: ChemComp

36 More Concepts Parent links –Join all objects in a tree, rooted in MemopsRoot/Project Local keys – Uniquely identify an object relative to its siblings – One or more attributes or links – Use serial number if no other candidate – Combined to form global keys Derived attributes and links – Not stored, but calculated when needed – Behave like normal attributes

37 ChemElement - Details ChemElement +symbol: Token +atomNumber: Int +atomicRadius: Float +covalentRadius: Float +mass: Float +name: Token Isotope +massNumber: Int +gyroMagneticRatio: Float +spin: Word +mass: Float +abundance: Float +receptivity: Float +magneticMoment: Float +quadrupoleMoment: Float ccp.nmr.NmrConstraint.ChemShiftConstraintList +isotopeCode: Word +getIsotope() ccp.nmr.Nmr.ExpDimRef +serial: Int +sf: Float +isotopeCodes: Word +measurementType: ExpMeasurementType = shift +isFolded: Boolean = False +unit: Word +isSplitting: Boolean = False +isAxisReversed: Boolean = True +maxAliasedFreq: Float +minAliasedFreq: Float +hasAliasedFreq: Boolean +variableIncrFraction: FloatRatio +constantTimePeriod: Float +nominalRefValue: Float +baseFrequency: Float +getIsotopes() +getHasAliasedFreq() ccp.nmr.Nmr.ShiftReference +serial: Int +isotopeCode: Word +molName: ShiftReferenceMolecule +atomGroup: Line +unit: Word = ppm +value: Float +referenceType: ShiftRefType +indirectShiftRatio: Double +getIsotope() ccp.nmr.Nmr.Resonance +serial: Int +name: Word +assignNames: Word +isotopeCode: Word +details: Text +getIsotope() memops.Implementation.Project ccp.molecule.ChemComp.ChemAtom +chirality: AtomChirality +nuclGroupType: Word +elementSymbol: Word +shortVegaType: Line +waterExchangeable: Boolean = False +getChemElement() ChemElementStore +name: Line * 1 1 * 1 +covalentlyBound ** 1 1 * 1 * +currentChemElementStore

38 TopObjects and Implementation TopObjects –Every package has a TopObject (e.g. ChemComp, NmrProject) –TopObjects are children of MemopsRoot/Project –File Implementations have one file per TopObject. –No links between objects from same package with different TopObjects –TopObjects have universal identifiers (GUID) Implementation package contains no real data, only file and database locations

39 Implementation 1.1

40 Constraints Frozen (unchangeable) attribute – Change-once to be introduced Cardinality – 0..1, 1..1, 2..2, n..m, * (0..infinity) Collections – Unique, unordered = set – Nonunique, ordered = list – Unique, ordered = list with uniqueness checking Ad-hoc (handcoded) constraints – DataType (value) – Attribute and Link (object and value) – Class (object) Constraints are always enforced.

41 Handcode In constraints on DataType, Class, or Element In function bodies – May override an autogenerated function e.g. a getter for a derived element – May define a new function, e.g. MemopsRoot.saveAll To be added at the end of constructors, e.g. to create more objects automatically To be added at the beginning of delete, e.g. to prevent delete in some situations. Handcode is entered separately for each language

42 Simple data types Complex data types (example)

43 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status,esourcesand plans ■ Status, resources and plans ■ Credits

44 API facilities Data access only (get, set, create, delete, …) – CCPN does not do things with the data – that is up to you Data Encapsulation –Internal lists etc. not returned outside objects Functions only defined where they make sense – No ‘add’ functions for frozen or 2..2 links Functions raise error if they cannot do their job

45 API examples Examplkes are java specific. Many (but not all) would look identical in Python. In Python we also use keyword arguments, and allow the standard Python syntax ( value = obj.attr, obj.attr = value).

46 ChemComp – basis for examples

47 Get x = myChemComp.getMerckCode() returns value x = myChemCompVar.getChemAtoms() returns list or set of values Function always present

48 Set myChemComp.setMerckCode(x) sets value – x may be null if constraints permit myChemCompVar.setChemAtoms(atList) – accepts any collection values – removes old values and sets new ones – only function that can reorder ordered elements Function not present for frozen elements, or for parent or child links May be present for derived elements First checks cardinality and handcoded constraints (unless called from a constructor)

49 Add, remove myChemCompVar.addChemAtom(atom) myChemCompVar.removeChemAtom(atom) Function not present for frozen elements, parent elements, child elements, or where cardinality is n..n. May be present for derived elements First checks cardinality and handcoded constraints (unless called from a constructor)

50 Sorted and current myList = myChemComp.sortedChemAtoms() myList = myChemCompVar.sortedChemAtoms() –For child links sorts on local key (here ‘name’, ‘subtype’) –For other link sorts on full key (here ‘molType’, ‘ccpCode’, ‘name’, ‘subtype’) Function present for unordered multiple links myChemComp = myProject.getCurrentChemComp() Only between MemopsRoot/Project and TopObjects. Remembered between sessions

51 findFirst and findAll myBond = myChemComp.findFirstChemAtom(constraintMap) myAtom = myChemComp.findFirstChemAtom(“name”, “HA”, “subType”, 1) myAtom = myChemComp.findFirstChemAtom() atList = myChemComp.findAllChemAtoms(“name”, “HA”) – Selects or filters based on ‘self.keyword == value’ – findFirst mainly used where there is at most one element that fits, or where you want a random element. Function present for all multiple links, also if derived

52 Delete myObj.delete() – sets myObj.isDeleted=True – removes all known links from undeleted objects to myObj – Recursively deletes any object that: has a frozen or add-once link to myObj (this includes all children) has a mandatory (n..n) link to myObj has an n..m link to myObj if there are fewer than n objetcs left in the link

53 Object Creation newLinkAtom = new ChemAtom(myChemComp, “HA”, 2) newLinkAtom = myChemComp.newChemAtom(“HA”, 1) newAtom = myChemComp.newChemAtom(attrLinkMap) Suspends most validity checking, but calls myObj.checkValid() afterwards.

54 Validity Checking myObj.checkValid() – Checks that object and their attributes and links satisfy cardinality and handcoded constraints myObj.checkAllValid() – recursively calls checkValid for myObj and all child objects. If called as ‘checkValid(True)’ either function also checks DataType constraints, and that two-way links are self-consistent. –‘complete=True’ is slow!

55 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status,esourcesand plans ■ Status, resources and plans ■ Credits

56 Resources University of Cambridge (CCPN grant) Wayne Boucher Code generation, installation, C routines Rasmus Fogh Code generation, Data Model Tim Stevens CcpNmr Analysis and NEXUS, user interface Position unfilled Interfacing NMR applications EBI – MSD group Wim Vranken NMR Databases, reference information, CcpNmr FormatConverter Anne Pajon (PIMS) Java/SQL implementation, LIMS model, PIMS

57 Ongoing code generation upgrade  New capabilities  Result generally streamlined  Machinery easier to expand and maintain  New collection types (sets)  Complex datatype objects  Cleaner, more DB-friendly model  Changes  New MetaModel  All generation scripts (to be) reorganised  Model change for core (Implementation) package  NB Old application code must change!

58 CCPN / U Cambridge objectives  Short term  Further develop CcpNmr Analysis and NEXUS  Promote CCPN software and collaboration  Further integrate third-party NMR software  Upgrade all implementations to newest version  Support existing collaborations  Medium term  Expand data model to structural/systems biology  Sample description, molecular dynamics, …  Additional collaborations  Add new implementations  Long term  Help create LIMS system that includes structural NMR  Add Client/server code generation

59 Major Collaborations  EXTEND-NMR, EUNMR  NMR software integration  BIOXDM  Beamline crystallography online software  Model for Experimental crystallography  EUROCarbDB  Carbohydrate NMR  PIMS  LIMS system  Biochemistry/Biology laboratory data model  has suspended collaboration pending review

60 Languages and Formats PythonJava CPerl XML SQL Analysis, FormatConverter Extend-NMR Bruker TopSpin, NMRVIEW Processing Extend-NMR, EU-NMR, AUTOPSY, (Varian) (CYANA) (Bioinformatics) MSD NMR database, BIOXDM ? PIMS, BIOXDM ? EuroCarbDB? (Bioinformatics) Format Language Almost Ready Awaiting upgrade New Implementation requested

61 Python and C APIs  Python/XML  Available in newest stable release  API upgrade complete, partially tested  XML I/O upgrade in progress  C/XML  Full functionality available in newest stable release  Wrapper around Python API code  calls C libraries from the Python distribution  Native implementation planned – should be faster.  Python/database  Under discussion

62 Java APIs  Java/XML  Available in newest stable release  PIMS/Java/SQL  Currently working and in use in PIMS  Older, dead-end branch. Upgrade or lose support  Java/SQL  Requires completion of new-machinery upgrade. 1.New Model. Instantaneous 2.New DB Schema and Hibernate mapping. <1 man-month 3.New Java/XML API. Further 2 man months 4.New Java/SQL API. Further 2 man-months

63 Current Java/SQL/PIMS Implementation  Generation not integrated with main machinery.  Model structure two versions behind stable release  Some essential functionality in PIMS-specific code  LIMS model only. Hand coding missing for other model parts  Missing:  ‘sorted’, ‘checkValid’, ‘newChildClass’ factory functions Attribute, link, and class constraints. Constructor and destructor handcode. Support for some cardinalities (e.g. 2..2)  Works differently:  ‘findFirst’, ‘findAll’, constructors object composite key. Object print. Constraint evaluation policy. Handcode handling

64 Size statistics  Data Model  38 packages  467 classes, datatypes and dataobjtypes  XML description 14 500 lines  Python/XML API  API code 530 000 lines (17 000 000 chars)  XML I/O mappings 46 000 lines (1 100 000 chars)  Handcode 38 000 chars (0.2%)  Generation machinery  31 000 lines of Python

65 ■ CCPN aims ■ Data standards ■ Code Generation and libraries ■ Data Model structure ■ API details ■ Status,esourcesand plans ■ Status, resources and plans ■ Credits

66 CCPN people Cambridge (Biochemistry) –Ernest Laue –Wayne Boucher –Rasmus Fogh –Tim Stevens –Dan O’Donovan –Wolfgang Rieping –Alan da Silva EBI (MSD), Hinxton –Kim Henrick –John Ionides –Wim Vranken –Anne Pajon Plus external collaborators

67 Funding BBSRC (UK research council) Industry –AstraZeneca, Dupont Pharma (now BMS), Genentech, GlaxoSmithKline European Community –EXTEND-NMR, EU-NMR, NMR-Life, NMRQUAL, and TEMBLOR contracts

68 Collaborations Institututions –Carnegie Mellon University, Institut Pasteur, University of Gothenburg, University of Nijmegen, University of Utrecht, University of Regensburg, FMP, BRI, EBI-MSD, BMRB, Bruker, Global Phasing, IGBMC, LEBS, NKI and RCSB Projects –EXTEND-NMR (NMR software) –PIMS (Biochemistry LIMS) –BIOXDM (Beamline crystallography) –EUROCarbDB (Carbohydrate data)

69 END END


Download ppt "The CCPN Project Technical Introduction Expanded from presentation, Utrecht Feb. 2007."

Similar presentations


Ads by Google