Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using DSpace as a Disciplinary Data Repository

Similar presentations


Presentation on theme: "Using DSpace as a Disciplinary Data Repository"— Presentation transcript:

1 Using DSpace as a Disciplinary Data Repository
Ryan Scherle National Evolutionary Synthesis Center

2 NESCent is funded by NSF, and jointly run by Duke, UNC, and NC State.
One of only 3 synthesis centers in the US.

3

4 NESCent’s Mission Support synthetic research Develop informatics tools Increase public understanding of science Promote a culture of data sharing Data sharing = promote open source, build a repository

5 A Repository of Data Underlying Journal Articles
*** Repository != database *** Data != publications *** Journal articles => relationships with publishers, collect primary data as it is produced. **** Coupling of publication and data submission.

6 Dryad Partners The project is led by NESCent and the UNC Metadata Research Center. Additional partners include University of New Mexico (integration w/ Knowledge Network for Biocomplexity), Yale (integration w/ TreeBASE), and NCSU (LOCKSS replication, server management). Primary funding comes from NSF, with IMLS support for one targeted project. NSF grants supporting Dryad include: NSF #EF NSF #DBI NSF #DBI

7 Databases in Biology GenBank MorphBank Morphobank PaleoDB Phylota Protein Data Bank TreeBASE Tree of Life AntWeb FishBase FlyBase HerpNet MaNIS ORNIS WormBase ZFIN Biology has a long history of databases, but they are specialized to store a single type of data or data about a single type of organism. There are many more – I made this list off the top of my head without really trying, and I’m not even a biologist.

8 The Goal Store all data underlying publications in evolutionary biology, ecology, and related disciplines, at the time of publication. GenBank TreeBASE Dryad ccaattggct gttcttcgat tctggcgagt Archiving at publication time is the only model that has been proven to work. Let’s say I went to the nearest river and started collecting frogs. I might sequence some DNA, measure various characteristics of the frogs, and use the data to build a phylogenetic tree illustrating the relationships between the species I found. I write a paper and send it off to be published. The sequences could be deposited in GenBank, and the tree could be deposited in TreeBASE. But what if someone wants to replicate my work? Where would they find the leg measurements I made? Or the latitude/longitude of my collection sites? This is the type of data Dryad wants to capture.

9 (Riju et al., 2007) hdl:10255/dryad.158
This is what most people think of when they imagine data from biology – DNA sequences. Here, we have a sequence with many possible mutation points indicated. (Riju et al., 2007) hdl:10255/dryad.158 (Riju et al., 2007) hdl:10255/dryad.158

10 (Payne et al., 2008) hdl:10255/dryad.222
Typical synthesis data, in spreadsheet form. The work of many researchers has been combined to investigate the size of the largest living organism at various points in history. (Payne et al., 2008) hdl:10255/dryad.222 (Payne et al., 2008) hdl:10255/dryad.222

11 (Sidlauskas 2007) hdl:10255/dryad.23
Add pictures of biologists and their data -- sidlauskas, mcclain, It’s very diverse, so for now we’re just archiving, and we will build more detailed access/analysis tools later But I didn’t come to talk about data, I came to talk about dspace (Sidlauskas 2007) hdl:10255/dryad.23 (Sidlauskas 2007) hdl:10255/dryad.23

12 (Taylor and Naish 2007) hdl:10255/dryad.31 (Taylor and Naish 2007)

13 (Price et al., 2004) hdl:10255/dryad.82
Other than Excel, the most common data format in evolutionary biology is Nexus, an ASCII format that can store genetic sequences and evolutionary (phylogenetic) trees. (Price et al., 2004) hdl:10255/dryad.82 (Price et al., 2004) hdl:10255/dryad.82

14 Joint Data Archiving Policy
Deposit at time of publication Repeatability Embargo Exceptions Coordination <<This journal>> requires, as a condition for publication, that data used in the paper should be archived in an appropriate public archive, such as <<list of approved archives here>>. The data should be given with sufficient details that, together with the contents of the paper, allow each result in the published paper to be re-created. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as the location of endangered species. Full list of partner journals at

15 Why DSpace? Aren’t data objects usually stored in Fedora? User registration Submission system Administrative interface Search/browse system Manakin Speed of initial implementation Data objects in Fedora – of course more universities are starting to use Dspace for data, but as far as I know, none of these are dedicated to data. Initially, cultural change is more important than the data itself. We needed to get something up quickly while we had the community’s interest. The submission system must be easy! Unfortunately, Dspace/Manakin has a steep learning curve Fedora would have taken longer to get to version 0.5, but perhaps the same time for 1.0 No one is more exicted about DuraSpace than I am; I miss a lot of the features of Fedora!

16

17 Disciplinary repositories
Don’t serve the needs of a single institution Lack a formal organization No formal structure The repository is the “organization” Must locate pockets of dedicated users Must integrate with other community resources our "audience" is researchers in the discipline, and we must implement features that appeal to them (it can be argued that the current dspace is optimized for appealing to librarians and university administrators) integration with other repositories is paramount --- people want to be able to find things in the other repositories they use. if we're not connected to the other repositories, people will say "why should I put it in yet another database/repository?" (ref the talk on "Biology doesn't need another database")

18 Data repositories Customized metadata fields are required
Data often lacks complete metadata Publications can provide context Data comes in a wide variety of formats Connections to other data are valuable

19 We performed many minor tweaks
Hid communities and collections -- Collections/communities model doesn’t fit well for an unstructured community Modified handles -- Needed to add a bit of branding to handles. Expanded metadata fields -- Added metadata fields that were useful to the community (thinking about removing many standard fields to reduce curator confusion. Added surrounding static pages -- Surrounding static pages provide instructions for usage and allow additional repository functionality (Dry-ed)

20 Embargo

21 Embargo We contracted for some of this functionality.

22 Search modifications Again, these modifications were performed Search results grouping More configurable advanced search OpenURL/COinS support

23 The default DSpace submission system is infamous
The default DSpace submission system is infamous. Even with the Configurable Submission System: too complex for the average user lots of metadata fields are off-putting doesn’t handle relationships between objects Simplify, simplify, simplify! (we're dealing with biologists)

24 (screenshot – submission summary page)
Submission “from scratch” is greatly simplified from the normal Dspace Our basic submission system collapses 7 “steps” into 3 (there are still a few steps, but they’re hidden within the 3 pages). Journal integration is even more simplified –the journal sends an acceptance letter with a link to the submission system, and all bibliographic metadata is automatically imported.

25 Submission system For data entry, all they have to do is upload the file – everything else is automatic or optional. Note that authors, keywords are automatically inherited from publication, but can be changed. Ongoing research for automated metadata Eventually, we want the author to get an , say “duhh… ok”, and everything else is automatic from there This is not the ideal implementation, we will better integrate w/ dspace in the future

26 What’s next? OAI harvesting Versioning Authority control
Ontology integration Curation interface Faceted search Replication services Tagging, annotation Integration with more journals Integration with partner repositories More submission enhancements Data-specific analysis tools Linking to outside sources – with a data repo, the linkages are more important. Articles have a certain type of structure, people typically just want to read the article. With data, people want to read the articlt, but they also want to know about related datasets Bending Dspace to your will is possible, but even with the latest improvements (Manakin, Config Submission), it’s still a serious pain. We're aware the community will move forward, but will always be tangential to our needs, and we’re willing to accept some extra overhead, but we can’t keep up this level of involvement forever. We’re putting a lot of hope into 2.0 and a long-term roadmap that is based more on the Fedora model. Better developer documentation is needed, especially how to hack Manakin. Too much of what I know comes from Caveat Lector and random mailing list posts. (But I'm as guilty as anyone, because I've learned a lot that I haven't posted yet) We want to implement new functionality so it best fits into the DSpace roadmap has been very helpful for this)

27 Broader Collaboration
Remember, data benefits from being remixed *** Dryad is part of the D1 cooperative **** integrating the suite of "investigator toolkit” **** D1 will want input from the community

28 To learn more… Repository: Project info: Source code:


Download ppt "Using DSpace as a Disciplinary Data Repository"

Similar presentations


Ads by Google