Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Life Cycle of Structural Biology Data: A discipline with a culture of sharing Chris Morris, STFC DI4R Sept 2016.

Similar presentations


Presentation on theme: "The Life Cycle of Structural Biology Data: A discipline with a culture of sharing Chris Morris, STFC DI4R Sept 2016."— Presentation transcript:

1 The Life Cycle of Structural Biology Data: A discipline with a culture of sharing
Chris Morris, STFC DI4R Sept 2016

2 West-Life: Life Sciences in the Cloud
Background

3 Structural Biologists are mature computer users
Protein Data Bank Log new entries by year PDB founded 1971 [Protein Data Bank Nature New Biology 233:223] First submissions by mailing punch cards First use of digital computers in 1940s FAIR principles come naturally to structural biologists

4 New scientific goals Larger macromolecular machines
Membrane association 4D (structure + dynamics) Transient interactions > 25,000 experimental sessions in 2015

5 New experimental methods
Combined techniques Users are not always experts Small samples Data noisy and incomplete Deliver results to other life scientists  Calls for integrative, user-friendly solutions

6 INSTRUCT user survey 73% working on eukaryotic rather than prokaryotic systems 84% working on complexes rather than single gene products Each research team routinely uses three-four different techniques 83% would use combined SB techniques more often if it was easier to get access to experimental facilities 73% of the cases found it hard to combine software tools for different techniques in integrated workflows

7 New data challenges Data volume:
Combined output of European SB facilities > LHC XFEL will double it DLS archived >1PB in 2015 Improve archiving of data and metadata Support for data moving / replication Improve automated pipelines for MX … create pipelines for other techniques

8 Crowdsourcing from the middle tier
Community includes: Life scientists who use computers End user programmers Algorithm developers We aim at easing the process of creating web- based services

9 West-Life H2020 Project 10 Partners: Budget: €4 000 0000
STFC (UK) (lead partner, Martyn Winn Coordinator) Dutch Cancer Institute (NKI) (NL) EMBL (DE) Masaryk University (MU) (CZ) Consejo Superior De Investigaciones Cientificas (CSIC) (ES) Consorzio Interuniversitario Risonanze Magnetiche Di Metallo Proteine (CIRMPP) (IT) INSTRUCT (UK) Utrecht University (NL) Luna (FR) – (SME) INFN (IT) Budget: € Duration: 36 months Started: 1 Nov 2015 Proposal ID

10 West-Life: Life Sciences in the Cloud
Data Life Cycle

11 SB Data Life Cycle “four stages ...: gathering of data; choosing the representation and encoding of all data within a numerical scoring function consisting of spatial restraints; configurational sampling to identify structural models with good scores; and analyzing the models, ... ” (Sali et al).

12 General Data Life Cycle
Life cycle defined by UK Data Archive Six stages: Creating Processing analysing preserving + giving access reusing Also Retiring data

13 Creating data Data acquired in experiment MX: gigabytes
NMR: megabytes to gigabytes EM: terabytes Metadata About experiment e.g. wavelength About sample e.g. construct In reality, usually lacking Facilities: 47 synchrotrons 200 NMR groups in Europe Growing numbers of electron microscopes

14 Creating data The wwPDB Hybrid/Integrative Methods Task Force “all relevant experimental data and metadata … should be archived” MyTardis, SBGrid, EMPIAR, BMRB Instruct: “storage of data is the responsibility of the User to whom it belongs … Instruct Centres aspire to offer an archive to store data, especially in cases where the data volume makes this more practical that transferring the data ...” DLS processing pipeline and archive Want “research project” view, not facility view Orthogonality: Data size Data lifetime Authentication method

15 Processing data: data reduction
“ choosing the representation and encoding of all data within a numerical scoring function consisting of spatial restraints ” [Sali et al]: MX: integration and merging, megabytes EM: class assigment, megabytes NMR: Fourier transform, gigabytes, then peaking picking Instruct: “supporting data must be deposited in a public database or …. made otherwise available within one year after publication of the results” Licenses: Free for academic use Not always open source Automated provenance PROV-O please, don’t re-invent!

16 Analysing data: structure determination and interpretation
“all structures are in fact integrative models that have been derived both from experimental measurements involving a physical sample of a biological macromolecule and prior knowledge of the underlying stereochemistry. ” [Sali et al] Initial structure Direct methods e.g. experimental phasing Molecular replacement uses existing model The refinement Uses experimental data and prior knowledge Processing on: MX: laptops NMR: cloud EM: HPC with Infiniband

17 Preserving data Giving access to data
“For all structural studies of macromolecules, coordinates and the related experimental data … must be deposited at a member site of the Worldwide Protein Data Bank” [IUCR author guidelines]

18 Reusing data 2012 to 2014 there were 5913 papers citing one or more PDB entries [Bousfield] Synoptic studies “ the totality of the data in the PDB provides a rich source of more generalized knowledge about proteins, their molecular biology, and evolution” [Furman et al, 2013] Molecular replacement 2015: 526,126,409 downloads from PDB Automatic search by MrBUMP, BALBES, etc PDB-REDO recalculates structures with new software FAIR principles are baked into SB workflow

19 Discarding data 3,404 PDB entries are marked as obsolete
Better samples Better analysis Rare cases of fabricated data Self-policed community Charter of PDB permits obsolete mark only if author or institution retracts E-Infrastructure facilitates research Should not facilitate research misconduct

20 Conclusion Common data infrastructure
giving a simple user interface and simple programmatic access to scattered data. workflows to use of datasets from different facilities and techniques. automatic acquisition of metadata Data management integrated with data processing Standard package for small experimental centres Support for crowdsourcing

21 Cryo-EM workshop @ ISGC 2016 in Taipei
?

22 Supplementary material

23 Reinvent nothing Existing best practise includes:
weNMR PaNData Diamond: pipelines and archives Scipion Data Life Cycle Lab Integration, not competition

24 AAI requirements Saved sessions, data access Collaborations
“I am the person you gave these credentials to…” Collaborations “I am the person you think I am” Remote experiments “I am definitely the person you think I am” Personal certificates Implausible that our community would use them at broad (but examples within WeNMR)

25 Current AAI status WeNMR uses SSO
Accepts eduGAIN and social media id Experimental facilities issue userids Moving to Moonshot / Umbrella integration with eduGAIN Check passports at gate … but moving to remote access Instruct issues userids Verifies identity by phone call to PI (small community) West-Life not started yet

26 AAI solutions Solution for homeless users
create a local id without associating it with a homeid We have colleagues not in eduGAIN Solutions to handle user attributes Stored locally updates are checked by administrator Preferred technology Shibboleth has become standard SAML probably sufficient for authorization Web access, with delegation

27 References Biasini et al. (2013). Acta Cryst. D69, 701-709.
Gutmanas et al. (2013). Acta Cryst. D69,    Karaca, E. & Bonvin, A. M. J. J. (2013). Acta Cryst. D69,    Marabini, et al. (2013). Acta Cryst. D69,    Morris, C. & Segal, J. (2012). IEEE Software, 29, 9-12.   Perrakis et al.  J. Struct. Biol. 175, DiMaio et al., Nature Methods, Improved protein crystal structures at low resolution by integrated refinement with Phenix and Rosetta, in press    

28 Main Concepts


Download ppt "The Life Cycle of Structural Biology Data: A discipline with a culture of sharing Chris Morris, STFC DI4R Sept 2016."

Similar presentations


Ads by Google