Presentation on theme: "Theme 3: Architecture. Q1: Who houses stuff, both records and identifiers All useful services and repositories are centralized (latency, etc.) … but centralizing."— Presentation transcript:
Theme 3: Architecture
Q1: Who houses stuff, both records and identifiers All useful services and repositories are centralized (latency, etc.) … but centralizing content will be costly, require agreements, create liabilities re: versioning, etc. etc. – problematic as a short-term goal Overall specialized repositories are proliferating, not converging If the content stays only in the subject-specific repositories (SSRs) –Provide opt-in storage services (funding model?) –Provide audit function re: repository compliance with standards (e.g. RLG/OCLC trusted repository guidelines) –Provide information/guidance on formats (risk, migration) –Extend JHOVE for key formats important to the community
Q1: Who houses stuff, both records and identifiers (cont.) Many formats in the field … most data in a small number of formats but data in the long tail is very important (engage GDFR?) Metadata may be more widely replicated than data External resources (SSRs): utilize OpenURL to facilitate (and distinguish between) access to data, services, metadata, etc. for a single item – link journal-hosted data with additional/ancillary data hosted by DRIADE? Service level agreements
Q2: It is productive to process full-text for automated generation of context metadata? Yes, but … There a variety of ways to do this … quantitative analysis less costly, natural language processing requires more investment More can be done if access full text is allowed (comb full text for linkages, etc.) Portal searches can also be contextualized using a bag of words approach to describing subfields as indexes Combination of statistical processing, natural language processing, rise of XML-based metadata, can help Can capture administrative/technical metadata in data flows
Q3: Does storing a local copy make sense for a SSR handshaking? Helps to assure persistent access to content (as with CiteSeer) … but comes with burden and responsibility Data vs. application – need to secure access to underlying data … replicating AJAX-y services very, very hard Versioning is a key issue here
Q4: Is everyone in agreement with the dont compete with Google conclusion? Yes and no: develop community-specific discovery environments … but also expose content to Google (expose, contextualize, refer to domain-specific systems) – leverage commonly used interfaces Google, Microsoft etc. now highly value highly-curated collections and are actively engaging them Googles current interface is the big thing now … be prepared to interface with the next big thing Worldcat.org as an advanced discovery environment for scholarly material: including (increasingly) data
Q5: What are the pros and cons of DOIs, handles, and other identifiers? One of most important issues DRIADE will face Persistent, actionable identifiers vs. unique identifiers in various sub-domains and individual institutions (an item will have many IDs) Question of DOI expense, connection to publishers Need community understanding of a canonical identifier Need a community discussion in terms of what is important about identifiers –Who controls/changes, software used, locally-hosted? –What cost? Branding? Need resolution data? –3 rd party assignment of persistent identifiers?
Q5: What are the pros and cons of DOIs, handles, and other identifiers? (cont.) Need to promote datasets to primary resources (not just subordinated to article) in references and discovery For multi-file datasets – need to link to surrogate or package Identifiers as micro-billboardsand generators of data about contextual use of data (resolution data)
Q6: Data and applications: where does the complexity live? Leave it up to the community to develop best practices over time Over-engineering here will make it harder to be responsive to change Facilitate and let practice develop within sub- communities (testbeds for innovation) Content packaging plays a role here: bundling data with services, documentation, etc. Utilize (and cultivate) web services and lightweight APIs to facilitate access across and between systems Some opportunities to dessicate replications from complex applications
Q7: How does death fit into the metadata lifecycle? Tombstoning for dead data Data euthanasia? Shifts in contact info (author, data custodian)
Q8: How to nurture bottom-up growth of data standards? Help to foster individual sub-communities, and cultivation of best practices at the sub-community level that can be used to inform other efforts or the broader infrastructure Sharing and re-use encourages consolidation of standards/best practicecultivating mechanisms for sharing/re-use may help with achieving data consistency Start from existing baseline standards -- perhaps offer broad generalized standards as a starting point?