Presentation is loading. Please wait.

Presentation is loading. Please wait.

Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be.

Similar presentations


Presentation on theme: "Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be."— Presentation transcript:

1 Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be IEEE Computer Society!) July, 2012 Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be IEEE Computer Society!) July, 2012

2 Abstract Treating Data Like Software: A Case for Production Quality Data Jennifer Schopf ABSTRACT In this short paper, we describe the production data approach to data curation. We argue that by treating data in a similar fashion to how we build production software, that data will be more readily accessible and available for broad re-use. This includes considering third-party; planning for cyclical releases; bug fixes, tracking, and versioning; and issuing licensing and citation information with each release. TED-style (9 mins, control of your slides) talks Treating Data Like Software: A Case for Production Quality Data Jennifer Schopf ABSTRACT In this short paper, we describe the production data approach to data curation. We argue that by treating data in a similar fashion to how we build production software, that data will be more readily accessible and available for broad re-use. This includes considering third-party; planning for cyclical releases; bug fixes, tracking, and versioning; and issuing licensing and citation information with each release. TED-style (9 mins, control of your slides) talks 2

3 The Hard Problem of Data. Amount of data generated by scientists is growing exponentially And yet we still don’t know how to Collect data sets sustainably Tag data sets in ways that others will agree Discover data sets others have created Make our own data sets accessible to a broad audience Amount of data generated by scientists is growing exponentially And yet we still don’t know how to Collect data sets sustainably Tag data sets in ways that others will agree Discover data sets others have created Make our own data sets accessible to a broad audience 3

4 Data handling is really hard… …but maybe we can leverage what we know about building software: 45% scientists say they spend more time now developing software than they did 5 years ago 38% spent at least 1/5 th of their time developing software http://www.nature.com/news/2010/101013/full/467775a.html …but maybe we can leverage what we know about building software: 45% scientists say they spend more time now developing software than they did 5 years ago 38% spent at least 1/5 th of their time developing software http://www.nature.com/news/2010/101013/full/467775a.html 4

5 5

6 Today’s Question Can we leverage the (slightly more) formalized process of producing software to help us produce data? 6

7 “Personal” use (Pre-Prototype) Used by me I do all the coding My server for “repository” My coding “standards” I’m the end user Used by me I do all the coding My server for “repository” My coding “standards” I’m the end user 7

8 “Personal” use (Pre-Prototype) No testing besides use No documentation (~code comments) No “release” - Goes straight from code to compile to use (might have versioning) No testing besides use No documentation (~code comments) No “release” - Goes straight from code to compile to use (might have versioning) 8

9 Prototype Used by my “group” (~5-10 people?) Coding - I do most, but they might add Repository - Code mostly with me still No common coding standards We’re the end users, informally No testing besides use No real documentation, maybe a readme People might pick up new source once a day, or not bother Used by my “group” (~5-10 people?) Coding - I do most, but they might add Repository - Code mostly with me still No common coding standards We’re the end users, informally No testing besides use No real documentation, maybe a readme People might pick up new source once a day, or not bother 9

10 Prototype. Used by my “group” ~5-10 people? Coding I do most, but they might add No real testing, documentation, etc People might pick up new source once a day, or not bother Used by my “group” ~5-10 people? Coding I do most, but they might add No real testing, documentation, etc People might pick up new source once a day, or not bother 10

11 Moving Toward Production Used by someone I don’t know Coding by several folks Might have a common repository Coding “standards” depending Testing by my friends - end users Might get email with suggestions once in a while Might have common test cases Readme for doc Might have a “release” if there’s a repo. Used by someone I don’t know Coding by several folks Might have a common repository Coding “standards” depending Testing by my friends - end users Might get email with suggestions once in a while Might have common test cases Readme for doc Might have a “release” if there’s a repo. 11

12 Moving Toward Production Used by someone I don’t know Coding by several folks Might have a common repository Coding “standards” depending Some testing Readme for doc Might have a “release” if there’s a repo. Used by someone I don’t know Coding by several folks Might have a common repository Coding “standards” depending Some testing Readme for doc Might have a “release” if there’s a repo. 12

13 Production Software (for Academics) Used by a lot of people I don’t know 13

14 Production Software (for Academics) Coding by larger group Common repository with check-in procedures Agreed on coding standards Real sw architecture, naming, spacing, etc Coding by larger group Common repository with check-in procedures Agreed on coding standards Real sw architecture, naming, spacing, etc 14

15 Production Software (for Academics) Formal testing Unit tests, test harness, etc Documentation (and a bug fixing process) Formal release process License Formal testing Unit tests, test harness, etc Documentation (and a bug fixing process) Formal release process License 15

16 Production Software Features Production Software End User Considerations Multiple coders Repository with check-in procedures Coding conventions Formal testing Bug Fixes Documentation Commenting, readme Formal release process License 16

17 So how does this relate to data? Production Software End User Considerations Multiple coders Repository with check-in procedures Coding conventions Formal testing Bug Fixes Documentation Commenting, readme Formal release process License Production Data End User Considerations Mult. producers/collectors (Local) archive with check- in procedures Collection conventions Formal testing QA/QC, Bug fixes Documentation Metadata, workflow compat Formal release process to external archive License and Citation 17

18 Bottom Line As more people use your “stuff” you need to formalize how you approach it to make it still useful The more people you collaborate with to create your “stuff”, the more process you need to make sure things are coordinated As more people use your “stuff” you need to formalize how you approach it to make it still useful The more people you collaborate with to create your “stuff”, the more process you need to make sure things are coordinated 18

19 What is “data”? Observations? Data analysis results? Modeling results? Software? Metadata? (One person’s metadata is another person’s data…) Observations? Data analysis results? Modeling results? Software? Metadata? (One person’s metadata is another person’s data…) 19

20 “Data” refers to everything needed to have reproducible science “Data” refers to everything needed to have reproducible science 20

21 Who’s Using Your Data Sets This is all about sharing If no one else has access to your data/code, then it doesn’t matter Collaborative science Approach to science is fundamentally changing Your noise is someone else’s signal Reproducible science This is all about sharing If no one else has access to your data/code, then it doesn’t matter Collaborative science Approach to science is fundamentally changing Your noise is someone else’s signal Reproducible science 21

22 Local Archive Check-in In SW-world this involves some kind of code check-in to a repository Get a sanity check When data comes off an instrument or out of a notebook, there needs to be a (very basic) correctness check Columns in the right order Fields fully propagated Boundary conditions In SW-world this involves some kind of code check-in to a repository Get a sanity check When data comes off an instrument or out of a notebook, there needs to be a (very basic) correctness check Columns in the right order Fields fully propagated Boundary conditions 22

23 Testing, QA/QC, Bug fixes Make it reliable, make it useful Quality assurance echo’s running a test suite Check data ranges Correct for known instrument error Sometimes first derived data products One difference from SW Some people want the data pre- QA/QC Make it reliable, make it useful Quality assurance echo’s running a test suite Check data ranges Correct for known instrument error Sometimes first derived data products One difference from SW Some people want the data pre- QA/QC 23

24 Bug Fixes One of the fatal flaws with the “publish” approach to data Sometime data needs to be updated! You may find this, or someone else may Any fix should become a step in the QA/QC process Sometimes bug fixes are actually suggestions for new features Needed as well for the next time you collect data One of the fatal flaws with the “publish” approach to data Sometime data needs to be updated! You may find this, or someone else may Any fix should become a step in the QA/QC process Sometimes bug fixes are actually suggestions for new features Needed as well for the next time you collect data 24

25 Documentation Make it usable Need more than just metadata over time How was the data collected Details on instruments, QA/QC, etc How can the data be used And how should the data NOT be used Where can someone find out more about your science? Make it usable Need more than just metadata over time How was the data collected Details on instruments, QA/QC, etc How can the data be used And how should the data NOT be used Where can someone find out more about your science? 25

26 Common Metadata Conventions Make the data understandable Without basic metadata, no one can use the collected data (not even you with time) Shared standards “Biologists would rather use someone else’s toothbrush than use someone else’s metadata standards” -C. Stewart, IU Hundreds of ontologies to choose from, none are quite right Necessary starting point Make the data understandable Without basic metadata, no one can use the collected data (not even you with time) Shared standards “Biologists would rather use someone else’s toothbrush than use someone else’s metadata standards” -C. Stewart, IU Hundreds of ontologies to choose from, none are quite right Necessary starting point 26

27 Metadata and Standards 27

28 Formal Release Process (to external archive) Note “Release” – Not “publication” “T he data publication metaphor can be misleading and may even countermand aspects of good data stewardship.” -Mark Parsons and Peter Fox Is Data Publication the Right Metaphor? http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html Similar to software release – formal and planned for production quality Note “Release” – Not “publication” “T he data publication metaphor can be misleading and may even countermand aspects of good data stewardship.” -Mark Parsons and Peter Fox Is Data Publication the Right Metaphor? http://mp-datamatters.blogspot.com/2011/12/seeking-open-review-of-provocative-data.html Similar to software release – formal and planned for production quality 28

29 License Get credit for your work Creative commons license You keep your copyright but allow people to copy and distribute your work provided they give you credit — and only on the conditions you specify Every data set should come with citation information Get credit for your work Creative commons license You keep your copyright but allow people to copy and distribute your work provided they give you credit — and only on the conditions you specify Every data set should come with citation information 29

30 Open Source Software is Like a Free Puppy

31

32 Recap on building production data Local archive – get a sanity check Testing- make it reliable QA/QC, Bug fixes – make it useful Documentation – make it usable Metadata – make it understandable Formal release – make it stable Citation – get some credit Local archive – get a sanity check Testing- make it reliable QA/QC, Bug fixes – make it useful Documentation – make it usable Metadata – make it understandable Formal release – make it stable Citation – get some credit 32

33 Today’s Question Can we leverage the (slightly more) formalized process of producing software to help us produce data? 33

34 Managing Data Like Software Production Software End User Considerations Multiple coders Repository with check-in procedures Coding conventions Formal testing Bug Fixes Documentation Commenting, readme Formal release process License Production Data End User Considerations Mult. producers/collectors (Local) archive with check- in procedures Collection conventions Formal testing QA/QC, Bug fixes Documentation Metadata, workflow compat Formal release process to external archive License and Citation 34

35 Contact Points Jennifer Schopf jmschopf@gmail.com This talk based on content written up in: “Treating Data Like Software: A Case for Production Quality Data”, Proceedings of the Joint Conference on Digital Libraries, June 2012. http://delivery.acm.org/10.1145/2240000/2232846/p153-schopf.pdf Jennifer Schopf jmschopf@gmail.com This talk based on content written up in: “Treating Data Like Software: A Case for Production Quality Data”, Proceedings of the Joint Conference on Digital Libraries, June 2012. http://delivery.acm.org/10.1145/2240000/2232846/p153-schopf.pdf 35

36 Digital resources that are not properly curated do not remain accessible for long StudyResource TypeResource Half-life Koehler (1999 and 2002) Random Web pages2.0 years Nelson and Allen (2002) Digital Library Object 24.5 years Harter and Kim (1996) Scholarly Article Citations 1.5 years Rumsey (2002) Legal Citations 1.4 years Markwell and Brooks (2002) Biological Science Education Resources 4.6 years Spinellis (2003) Computer Science Citations 4.0 years Source: Koehler W. (2004) Information Research, 9 (2), 174 36

37 Digital resources that are not properly curated do not remain accessible for long StudyResource TypeResource Half-life Koehler (1999 and 2002) Random Web pages2.0 years Nelson and Allen (2002) Digital Library Object 24.5 years Harter and Kim (1996) Scholarly Article Citations 1.5 years Rumsey (2002) Legal Citations 1.4 years Markwell and Brooks (2002) Biological Science Education Resources 4.6 years Spinellis (2003) Computer Science Citations 4.0 years Source: Koehler W. (2004) Information Research, 9 (2), 174 37

38 Poor Data Practices 38 Time of publication Specific details General details Retirement or career change Death Time Information Content (Michener et al. 1997) Accident


Download ppt "Treating Data Like Software: A Case for Production Quality Data Jennifer M. Schopf WHOI Ocean Informatics Working Group (Also NSF – GEO/OAD) (Soon to be."

Similar presentations


Ads by Google