Open Dialogue on Digital Data management Pat Burns, Dean Dawn Paschal, Assistant Dean CSU Libraries Open Dialogue on Digital Data management October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Background NSF requires proposals submitted as of Jan. 18, 2011 to include plans for data management: http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j NIH & USDA also have similar requirements Other agencies looming: ‘Federal Research Public Access Act’ Maximizing the value of data by sharing Discoverability Access Preservation Management October 13, 2010 Open Dialogue, Data Mgmt.
Science ‘Then’ (5-10 years ago) Theory Computation Experiment October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Science ‘Now’ Data Theory Computation Experiment Data Data Data Data Data Data Data Data Data Data October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Science ‘Emerging’ Theory Experiment Data Computation October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Science 2.0 ‘Now’? Data Theory Experiment Data Computation October 13, 2010 Open Dialogue, Data Mgmt.
Large Digital Data Sets Satellite imagery can generate > 1 petabyte (1015 bytes) of data per day! Supercomputers also generate massive data sets Can we transport them? E.g., at 10 Gbits per second (note bits, not bytes: 1 byte = 8 bits) Time = 8x1015 bits/(1010 bits/sec) = 8x105 secs = 222 hours = 1 week, 2 days, 6 hours, 13 mins Can we store them? Requires 500 ea. 2 TByte disks @ $250 ea. = $12,500; @ 5 year lifetime = $2,500/yr. Requires 1 full rack in a data center: space, power, cooling, … October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Incoming! An individual researcher can generate many data sets We have many researchers who generate large data sets Number: Many x many = Very many! Size: Very many x Very big = Enormous! October 13, 2010 Open Dialogue, Data Mgmt.
Projected Needs (2009 CSU Survey) CSU-DR = 3 Tbytes!!! October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. How Can We Help? IT Libraries Storage capacity Transport capacity Back-up Sysadmin IT security/privacy Transcoding Data organization & structure IP issues Metadata Discoverability Preservation Joint Operations “there comes a time in one’s life, where one must grab the bull by the tail, and face the situation. System Stewards Data/Info Stewards The ‘front end’ Interactions w/ researchers The ‘back end’ October 13, 2010 Open Dialogue, Data Mgmt.
How Can We Help (cont’d)? Agreement upon a framework Draft of a framework, present to faculty Language for our faculty to include in their proposals Strategy, policy, procedures Definition of work flow(s) Architectures for operations & preservation Back-up vs. preservation, LOCKSS? October 13, 2010 Open Dialogue, Data Mgmt.
Policies: The ‘Front End’ DRM: IP/ownership issues: data sets not ‘copyrightable’ (not creative works) But there may be local, institutional IP policies that override this Note that IP ≠ copyright Creative Commons or Science Commons licensing may apply An embargo period is required What are the preservation periods? October 13, 2010 Open Dialogue, Data Mgmt.
NSB Data Type Definitions* Research collections (small, useful to individuals/teams for life of a project, limited curation, standards typically lacking) Resource collections (medium, useful to a community, follow group’s standards, mid- to long-term utility) Reference collections (large, serve many segments of science/engineering, conform to robust standards, indefinite support) *National Science Board October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Workflow Faculty provide data/information Enter metadata, user’s manuals, select embargo period, select licensing options, enter pubs, point to or supply data sets, … Librarians manage data/information Review metadata, ingest and make accessible, review periodically, deaccession periodically (annually?), manage data, interact w/ faculty IT staff implement and operate systems Operate system, backups , security, upgrading storage, transport, move to LOCKSS, etc. October 13, 2010 Open Dialogue, Data Mgmt.
Digital Assets - the 4 Pieces The Metadata, ideally on the CSU-DR Typical, what we collect today, e.g. lightweight metadata (probably not copyrightable) Contextual, e.g., user’s manuals (yes, copyrightable) Scholarly publications associated with the data – ideally on the CSU-DR The data itself – should be in the most appropriate place (pointers?) October 13, 2010 Open Dialogue, Data Mgmt.
Digital Assets Management 4. Data Sets Small Medium Large 2. User’s Manuals 1. Metadata 3. Pubs Disciplinary Repositories, SC Centers, etc. Local Storage Libraries-DR “The Cloud” “Pointers” October 13, 2010 Open Dialogue, Data Mgmt.
Architecture LOCKSS High-speed Networks Primary System The Digital Repository Preservation System October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. 8/19/2009 CSU Storage Project 45 TBytes (raw) for ~$8k October 13, 2010 Open Dialogue, Data Mgmt.
Strategy for Storage of Data Sets Small, < 100 GB, we would agree to store on the DR, but not forever Medium, we would agree to store on the DR for a limited time at a cost, or on a local server somewhere and we point to it Large, stored on a disciplinary DR somewhere, at a supercomputer center, or at a large instrument center We point to it (persistent URL?) How do we deal with exceptions? October 13, 2010 Open Dialogue, Data Mgmt.
What CSUL will Store & at What Cost PRESERVATION PERIOD SIZE (+ means beyond end of grant period) SMALL 0.1 TB MEDIUM 0.1-10 TB LARGE > 10 TB Short (1 yr.+) Free Maybe + Medium (2 yrs. +) $500/TB Maybe - Long (> 5 yrs. +) $1,000/TB No Forever is a long time….. October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Needs Libraries-IT partnership Define policies for usage Define practice for usage Definition of workflows Operations Develop needed tools Build an on-line, self-service submission tool + requirements for review of user-created metadata Establish systems Develop preservation infrastructure October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Issues Will the DR become a ‘Trusted Digital Repository?’ Will this enhance our proposals? What will be stored where? Will disciplinary digital repositories emerge, e.g. at NCAR and elsewhere? Flexibility is key How best to engage The VPR (probably already accomplished) The faculty Library staff: faculty and operational (DM Librarians at UNM?) October 13, 2010 Open Dialogue, Data Mgmt.
Open Dialogue, Data Mgmt. Discussion Is most welcome. October 13, 2010 Open Dialogue, Data Mgmt.