Presentation is loading. Please wait.

Presentation is loading. Please wait.

More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library Even v.

Similar presentations


Presentation on theme: "More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library Even v."— Presentation transcript:

1 More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library Even v

2 How much metadata do we really need? That depends on the quality of the metadata...

3 Context of my remarks Experience developing for and now managing Harvard Library’s Digital Repository Service (DRS) (In production from 2000 – Present) – ~ 47 million files Recent multi-year overhaul of repository to the new DRS – Provided chance to analyze metadata & rethink approach

4 Prior to the new DRS Most all metadata was user-contributed – Expertise ranged from professional labs to curators, archivists and other staff Very little validation of user-contributed metadata Metadata elements had grown organically rather than systematically. For example...

5 Some elements weren’t specific enough File format one of: ICC, GIF, JPEG, TIFF, TDF, TEXT, PCD, AIFF, RealAudio, APP, WAV, WFR, JP2, JPF, ZIP, GZIP, PDF – Format variations and versions not recorded

6 Some elements were too specific Text abstract character repertoire one of: ‘US-ASCII’, ‘Unicode’ Text character map one of: ‘ISO_646.irv:1983’, ‘UTF-8’ – These weren’t validated so in reality the text could be in any character set but would be recorded as one of these regardless

7 Some generic elements only tracked for certain formats For images only: – enhancements – history – methodology – producer – production software – system And the above elements allowed free-text, leading to a variety of interpretation over time

8 Errors in relationship metadata Missing relationships (e.g. referenced in the METS descriptor file but lacking explicit relationships) Redundant relationships (files related more than once to the same files) Illogical relationships (only discoverable because of redundant metadata) – Examples: – Target images related to other target images – Non-target images described as target images – A METS descriptor file described as a scanned image – Objects merged into themselves

9 Strategies in the new DRS for improving metadata Automated format ingest, validation & metadata extraction at ingest Validation when files or ingested, added or removed or relationship metadata is changed Sync with catalogs, check and improve metadata on migration Pull descriptive metadata from catalogs at ingest or on request

10 File Information Tool Set (FITS) Identifies many file formats Validates a few file formats Extracts metadata from files Aggregates metadata from many tools Calculates basic file info (file size, MD5, etc.) Outputs technical metadata – Community-standard metadata schemas Identifies problem files – Conflicting tool opinions on format, metadata values – Unidentifiable file formats – Encrypted, rights metadata embedded in files

11 File Information Tool Set (FITS) Any file FITS wrapper + XSL JHOVE FITS wrapper + XSL DROID FITS wrapper + XSL NLNZ ME FITS wrapper + XSL ExifTool FITS wrapper + XSL File utility FITS wrapper + XSL FFIdent FITS XML Standard XML FITS XML + Tika, OIS Audio Information, ADL Tool, OIS File Information, OIS XML Metadata

12 FITS configured to get high quality metadata Metadata normalization – ‘JPEG2000’ = ‘JPEG 2000’ = ‘JPEG 2000 image’ – ‘inches’ = ‘2’ = ‘in.’ Plays to strengths of tools and downplays their weaknesses – Overall trust tool x over tool y – Don’t run tool x for format z Format tree (hierarchy of related formats) – ‘OpenDocument’ is more specific than ‘Zip’

13 Example of what we know about a file pre- and post-FITS adoption at ingest Pre-FITS (user-contributed metadata)Post-FITS adoption at Ingest Format = PDFFormat = Portable Document Format MIME media-type = application/pdfFormat version = 1.4 Format registry record: Registry: PRONOM Registry key: fmt/18 Page count: 24 Date created by application: T17:43:27-04:00 Title: JPCDHEP492 Creation application: ComSquare ImPDF Library v0.89 Admin flag: INHIBITOR

14 Additional strategies in the new DRS Move away from overly restrictive metadata elements where needed – Examples: – Allow free text for format names – Any text character set Add elements at the format-agnostic file level when they can apply to files in any format, e.g. producer or methodology Flag suspicious metadata (and content) for later analysis

15 Administrative flags Help pinpoint incorrect metadata, problem content or where metadata tools need improvement Some examples: – FAILED_METADATA_EXTRACTION – FORMAT_ID_CONFLICT – INCORRECT_METADATA – INHIBITOR – RIGHTS_METADATA

16 They said it better “It is quality rather than quantity that matters.” – Lucious Annaeous Senegal “Quality is not an act, it is a habit.” – Aristotle “Quality is never an accident. It is always the result of intelligent effort.” – John Ruskin

17 Thank you!


Download ppt "More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library Even v."

Similar presentations


Ads by Google