Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 Dr Birgit Plietzsch Arts Computing Advisor Swithun Crowe Developer for Arts and Humanities Computing projects

Similar presentations

Presentation on theme: "2 Dr Birgit Plietzsch Arts Computing Advisor Swithun Crowe Developer for Arts and Humanities Computing projects"— Presentation transcript:

1 2 Dr Birgit Plietzsch Arts Computing Advisor Swithun Crowe Developer for Arts and Humanities Computing projects & IT Services, University of St Andrews

2 3 1.Introduction to the University of St Andrews Digital Archiving Project (DAP) 2.The DAP Open Archival Information System 3.Developing the OAIS Ingest function in Alfresco

3 4 Digital Preservation is … the active management of digital information over time to ensure its accessibility long-term, error-free storage of digital information, with means for retrieval and interpretation, for the entire time span the information is required for. Long-term is defined as "long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely”. Retrieval means obtaining needed digital files from the long-term, error-free digital storage, without possibility of corrupting the continued error-free storage of the digital files. Interpretation means that the retrieved digital files, files that, for example, are of texts, charts, images or sounds, are decoded and transformed into usable representations. This is often interpreted as "rendering", i.e. making it available for a human to access. However, in many cases it will mean able to be processed by computational means. (Source: Wikipedia)

4 5 Legal requirements (e.g. Freedom of Information Act) Protection of institutional intellectual property Funding body requirements until 2008 Arts and Humanities Data Service for Arts and Humanities (national depository for arts and humanities research data) no such body exists now for the Arts and Humanities other subjects national support is patchy Moral obligations protection of cultural and corporate memory

5 6 proceedings of the Scottish Parliament from the first surviving act of 1235 to the union of 1707 10 years of research no print publication c16.5m words issues: inconsistent editorial practices obsolescence of software originally used long-term sustainability of research data

6 7 Pilot project Scope: data contained in electronic resources produced within the Faculty of Arts, University of St Andrews Aims: ensure long-term sustainability of RPS data investigate the requirements of digital archiving and obtain experience meet funding body requirement flexible implementation (to allow for additional future uses)

7 8 Concepts and Properties of Archives and Hosting in the Strategy and their Relationships ©Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0 Key: solid colour represents core properties and fading colour represents weaker properties of archives and hosting services. Concepts and Properties of Archives and Hosting in the Strategy and their Relationships © Charles Beagrie Ltd 2009. CreativeCommons Attribution-Share Alike3.0

8 9 1.Introduction to the University of St Andrews Digital Archiving Project (DAP) 2.The DAP Open Archival Information System 3.Developing the OAIS Ingest function in Alfresco

9 10 An Open Archival Information System (or OAIS) is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. reference model: ISO 14721:2003

10 11 Seven functions Ingest Archival Storage Data Management Administration Preservation Planning Access Management SIPSubmission Information Package AIPArchival Information Package DIPDissemination Information Package

11 12 Implementation Content Information: XML TIFF DOC Etc Preservation Description Information: PREMIS Descriptive Information: MODS Packaging Information: METS

12 13 What needs to be preserved? data layout functionality user experience What are the significant properties? generic low-level properties (e.g. basic data unit, byte-level encoding, data type, and logical schema) data type specific properties (example: text) underlying abstract forms (font, spacing, layout) sub-properties (e.g. font type, style, family, size, colour) How do we preserve? bit stream preservation emulation migration Adopted approach: data is preserved combination of bit stream preservation and file format migration upon ingest

13 14 description needs of different types of material electronic resources digital images video research papers University records etc. introduce flexibility future wider uses of the archive

14 15 expressed in MODS 3 layers use for pilot more models can be developed Project Research data Documen- tation Code Resource type Digital object Resource Discovery Metadata

15 16 Monolithic approach Repository framework: Fedora Commons issues with suitable front end for Ingest, Access, Preservation Planning, or Administration functions highly customisable Metadata MODS METS PREMIS DSpace issues with Archival Storage and Data Management functions EPrints issues with Administration and Access functions RODA technical issues No support for Preservation Planning Breakdown into OAIS requirements

16 Access Plato Testbed Plato Testbed 17 Software used Alfresco Fedora Commons fedora- Planets Suite www.openplanets Archival storage & Data Management Management Share Explorer Records Management Ingest Preservation Planning Administration

17 18

18 19 Version control of AIPs Alfresco / Fedora interaction? Access front end Fedora Commons front ends do not normally support OAIS functions Can extra properties be added to folders and files in Records Management site? We welcome ideas that might help us resolve the above three issues.

19 20 1.Introduction to the University of St Andrews Digital Archiving Project (DAP) 2.The DAP Open Archival Information System 3.Developing the OAIS Ingest function in Alfresco

20 21 FITS and PREMIS Technical metadata RPS and MODS Resource discovery metadata Antivirus scanning METS Wrapping files and metadata Introduction

21 22 FITS (File Information Tool Set) Consolidates file format metadata from 3 rd party tools Jhove, DROID, NLNZ ME, Exiftool and others Output as XML PREMIS (PREservation Metadata: Implementation Strategies) Data dictionary of semantic units, maps to XML Transform FITS XML to PREMIS using XSLT Introduction

22 23 Text property defined in custom aspect for storing FITS XML in node metadata Create temporary file containing content of node Run FITS on temporary file Put output into custom property Later on, transform this to PREMIS XML Can be run as space rule Compile to AMP using Alfresco SDK The action

23 24 alfresco.module.FitsAction.fits-action-messages alfresco/module/FitsAction/context/fitsModel.xml fits-action-context.xml

24 25 package; public class FitsActionExecuter extends ActionExecuterAbstractBase { public void setServiceRegistry(ServiceRegistry serviceRegistry); protected void addParameterDefinitions(List paramList); protected void executeImpl(Action action, NodeRef actionedUponNodeRef); } FitsActionExecuter

25 26 63 // make sure node exists 64 if (!nodeService.exists(actionedUponNodeRef)) 65 { 66 throw new Exception("no node"); 67 } 68 69 // make sure that node has fits aspect 70 QName fitsAspect = QName.createQName(fitsURI, "fitsAspect"); 71 if (!nodeService.hasAspect(actionedUponNodeRef, fitsAspect)) 72 { 73 this.nodeService.addAspect(actionedUponNodeRef, fitsAspect, null); 74 } 75 76 // create new FITS instance 77 Fits fits = new Fits(); 78 Fits.allowRounding = true; 79 FitsOutput result = null; FitsActionExecuter.executeImpl (fragment)

26 27 81 // put input into temp file 82 ContentReader reader = 83 contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT); 84 String fileName = 85 (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME); 86 File inputFile = 87 TempFileProvider.createTempFile("FitsActionExecuter_", "." + fileName); 88 reader.getContent(inputFile); 89 90 // transform into technical metadata 91 result = fits.examine(inputFile); 92 Document doc = result.getFitsXml(); 93 94 // put result of transformation into output 95 XMLOutputter serializer = new XMLOutputter(Format.getPrettyFormat()); 96 String output = serializer.outputString(doc); 97 98 // get property to write to 99 QName fitsProp = QName.createQName(fitsURI, "fitsOutput"); 100 nodeService.setProperty(actionedUponNodeRef, fitsProp, output); FitsActionExecuter.executeImpl (fragment cont.)

27 28 fmt/111 Fragment of FITS XML showing conflicting file formats

28 29 Microsoft Word OLE2 Compound Document Format Droid (3.0) fmt/111 puid Corresponding fragment of PREMIS XML

29 30 Records of the Parliaments of Scotland marked up in thousands of XML documents Using Text Encoding Initiative (TEI) TEI headers contain resource discovery metadata Extract metadata from documents and populate custom metadata fields Can be run as space rule Compile as AMP using Alfresco SDK Introduction

30 31 A committee appointed for controverted elections william_and_mary_t1689_3_1_d2_trans 16890314... TEI example Unique ID for document Document belongs to translated version of records from reign of William and Mary Main heading in document Pointer to session that document belongs to Date of document, in YYYYMMDD format

31 32 package; public class RPSMetadataExtracter extends org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter { public RPSMetadataExtracter(); protected Map extractRaw(ContentReader reader); } RPSMetadataExtracter

32 33 63 // set up parser 64 SAXParser sp = spf.newSAXParser(); 65 InputStream cis = reader.getContentInputStream(); 66 InputSource is = new InputSource(cis); 67 RPSSaxParser teip = new RPSSaxParser(); 68 69 // do parsing 70 teip.setProperties(map); 71 sp.parse(is, teip); 72 map = teip.getProperties(); 73 74 // loop over properties found 75 Set s = map.entrySet(); 76 Iterator it = s.iterator(); 77 while (it.hasNext()) 78 { 79 Map.Entry m = (Map.Entry); 80 putRawValue((String) m.getKey(), (String) m.getValue(), rawProperties); 81 } RPSMetadataExtracter.extractRaw

33 34 package; public class RPSSaxParser extends org.xml.sax.helpers.DefaultHandler { public void setProperties(Map prop); public Map getProperties(); public void startElement(String uri, String localName, String qName, Attributes attributes); public void endElement(String uri, String localName, String qName); public void characters(char[] ch, int start, int length); private void handleID(String id); private void handleDate(String d); } RPSSaxParser

34 35 // property names 21 private static final String KEY_ID = "rpsID"; 22 private static final String KEY_REIGN = "rpsReign"; 23 private static final String KEY_VERSION = "rpsVersion"; 24 private static final String KEY_HEADING = "rpsHeading"; 25 private static final String KEY_SESSION = "rpsSession"; 26 private static final String KEY_DATE = "rpsDate"; 27 private static final String KEY_TITLE = "cmTitle"; // some properties get set in RPSSaxParser.characters 185 if (true == inTitle) 186 { 187 rawProperties.put(KEY_TITLE, new String(ch, start, length)); 188 } 189 else if (true == inSession) 190 { 191 rawProperties.put(KEY_SESSION, new String(ch, start, length)); 192 } RPSSaxParser

35 36 # Namespaces namespace.prefix.rps= # Mapping of property names to Qualified names used in model rpsID=rps:id rpsReign=rps:reign rpsSession=rps:session rpsDate=rps:date rpsVersion=rps:version rpsHeading=rps:heading cmTitle=cm:title

36 37 RPS Metadata d:text rpsModel.xml (fragment showing aspect)

37 38 # I18N strings rpsID=RPS ID rpsReign=RPS Reign rpsSession=RPS Session rpsDate=RPS Date rpsVersion=RPS Version rpsHeading=RPS Heading

38 39 Metadata Object Description Schema MODS is a resource discovery metadata standard Working on defining MODS data models For Project, Resource Type and Digital Object levels Will move RPS metadata into MODS fields Using MODS

39 40 Creates an action for scanning files for viruses Uses ClamAV Can be configured for other tools Emails creator of file if virus found Deletes file from repository if virus found Can be run as space rule Compile as AMP using Alfresco SDK Introduction

40 41 antivirus-action.xml (fragment) ${antivirus.mailer} ${antivirus.template}

41 42 antivirus-action.xml (fragment, cont.) 1

42 43 AntivirusActionExecuter package; public class AntivirusActionExecuter extends ActionExecuterAbstractBase { public void setContentService(ContentService contentService); public void setNodeService(NodeService nodeService); public void setTemplateService(TemplateService templateService); public void setActionService(ActionService actionService); public void setPersonService(PersonService personService); public void setFromEmail(String fromEmail); public void setCommand(RuntimeExec command); public void setEmailTemplate(String emailTemplate); public void init(); protected void addParameterDefinitions(List paramList); protected void executeImpl(final Action ruleAction, final NodeRef actionedUponNodeRef); }

43 44 AntivirusActionExecuter.executeImpl (fragment) 135 // put content into temp file 136 ContentReader reader = 137 contentService.getReader(actionedUponNodeRef, ContentModel.PROP_CONTENT); 138 String fileName = 139 (String) nodeService.getProperty(actionedUponNodeRef, ContentModel.PROP_NAME); 140 File sourceFile = 141 TempFileProvider.createTempFile("anti_virus_check_", "_" + fileName); 142 reader.getContent(sourceFile); 143 144 // set source property for command 145 Map properties = new HashMap (1); 146 properties.put(VAR_SOURCE, sourceFile.getAbsolutePath()); 147 148 // execute the transformation command 149 ExecutionResult result = null; 150 try 151 { 152 result = command.execute(properties); 153 } 154 catch (Throwable e) 155 { 156 throw new AlfrescoRuntimeException("Antivirus check error: \n" + command, e); 157 }

44 45 AntivirusActionExecuter.executeImpl (fragment, cont.) 165 // try to get document creator's details 166 String creatorName = (String) nodeService.getProperty(actionedUponNodeRef, 167 ContentModel.PROP_CREATOR); 168 if (null == creatorName || 0 == creatorName.length()) 169 { 170 throw new Exception("couldn't get creator's name"); 171 } 172 173 NodeRef creator = personService.getPerson(creatorName); 174 if (null == creator) 175 { 176 throw new Exception("couldn't get creator"); 177 } 178 179 String creatorEmail = (String) nodeService.getProperty(creator, 180 ContentModel.PROP_EMAIL); 181 if (null == creatorEmail || 0 == creatorEmail.length()) 182 { 183 throw new Exception("couldn't get creator's email address"); 184 }

45 46 AntivirusActionExecuter.executeImpl (fragment, cont.) 186 // put together message 187 Map model = new HashMap (8, 1.0f); 188 model.put("filename", fileName); 189 model.put("message", result); 190 191 String emailMsg = templateService.processTemplate("freemarker", emailTemplate, model); 192 193 // send email message 194 Action emailAction = actionService.createAction("mail"); 195 emailAction.setParameterValue(MailActionExecuter.PARAM_TO, creatorEmail); 196 emailAction.setParameterValue(MailActionExecuter.PARAM_FROM, fromEmail); 197 emailAction.setParameterValue(MailActionExecuter.PARAM_SUBJECT, 198 "Virus found in " + fileName); 199 emailAction.setParameterValue(MailActionExecuter.PARAM_TEXT, emailMsg); 200 emailAction.setExecuteAsynchronously(true); 201 actionService.executeAction(emailAction, null); 202 203 // delete node 204 nodeService.addAspect(actionedUponNodeRef, ContentModel.ASPECT_TEMPORARY, null); 205 nodeService.deleteNode(actionedUponNodeRef);

46 47 Metadata and Encoding Transmission Standard (METS) METS is a wrapper for other metadata documents Plan to generate METS documents containing/referencing: Ingested files Renderings of these files (thumbnails, reference copies, archival formatted versions etc.) Resource discovery metadata Technical metadata Fedora Commons can ingest METS documents as SIPs Introduction

47 48 FITS in Alfresco RPS Metadata Extracter Antivrus Project source code available on Alfresco Forge University of St Andrews Digital Archiving Project

48 49 Dr Birgit Plietzsch Arts Computing Advisor Swithun Crowe Developer for Arts and Humanities Computing projects & IT Services, University of St Andrews

Download ppt "2 Dr Birgit Plietzsch Arts Computing Advisor Swithun Crowe Developer for Arts and Humanities Computing projects"

Similar presentations

Ads by Google