Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Digital Preservation Rathachai Chawuthai Information Management CSIM / AIT Introduction Issued document 1.0.

Similar presentations


Presentation on theme: "Semantic Digital Preservation Rathachai Chawuthai Information Management CSIM / AIT Introduction Issued document 1.0."— Presentation transcript:

1 Semantic Digital Preservation Rathachai Chawuthai rathachai.chawuthai@live.com Information Management CSIM / AIT Introduction Issued document 1.0

2 22 nd Century Digital Preservation Needs of Archive in IR Knowledge Preservation Technology Review 2

3 3

4 Assume that incoming scenario is happening in 22 nd century 4

5 Imagine that how a man in the future is able to read your today digital document. Alice Bob 5 ReaderArchivist

6 Hi Bob, do you have information about USA president “Barack Obama” Hi Bob, do you have information about USA president “Barack Obama” Oh! It is hard to find out. Because the information is older than 100 years. Oh! It is hard to find out. Because the information is older than 100 years. 6

7 What is a DVD? Hi Alice. Luckily, I found a DVD containing his information ? 7

8 Do you believe that you current media will be useful in the future? 8

9 No !!! That thing is unreadable ! Error: DVD unreadable Don’t be silly, Alice. It was popular in 100 years ago. It can be read by DVD reader. See it !! Don’t be silly, Alice. It was popular in 100 years ago. It can be read by DVD reader. See it !! 9

10 An age of digital media is quite short. Do you have plant to move your data to a freshly new media? 10

11 Hey, … How to open PDF file? ! Fortunately, I can get that file. Can you open “obama2009.pdf” Fortunately, I can get that file. Can you open “obama2009.pdf” Error: No program can open file format PDF 11

12 Do you inform them about software, hardware, and version to open your file? 12

13 How I know the password? As I see, it need Adobe Reader 9.0 to open it. As I see, it need Adobe Reader 9.0 to open it. File is read protected Please key password 13

14 Your file might be secured. Do you inform them how to access your file? Your file might be secured. Do you inform them how to access your file? 14

15 !7rò??àÕ ??ߟ²ÂÚ Õ??ߟ²ÂÚ ðŽɳ !Z?g! Õr / ÕŸ / ?rò? Why the author documented in alien language? Why the author documented in alien language? ? ? ! ! 15

16 It still has issues about encoding; such as, ASCII, ANSI, ISO-8859, UTF7, big-endian,little-endian, and font; such as, Tahoma, Venada. How do you tell them what it require to render? It still has issues about encoding; such as, ASCII, ANSI, ISO-8859, UTF7, big-endian,little-endian, and font; such as, Tahoma, Venada. How do you tell them what it require to render? 16

17 Barack Obama 44 th president of USA Born 08/04 /1961 Confuse!!! When he was born? 4 th August or 8 th April ? Confuse!!! When he was born? 4 th August or 8 th April ? No idea !!!! You need to ask the author living 100 years ago. No idea !!!! You need to ask the author living 100 years ago. 17

18 Knowledge of today creator and future reader might be different. How to ensure that reader understand it correctly? Knowledge of today creator and future reader might be different. How to ensure that reader understand it correctly? 18

19 What should I do if I need to find more information relevance to Barack Obama’s family? What should I do if I need to find more information relevance to Barack Obama’s family? You may have to browse every file from here. Good luck … You may have to browse every file from here. Good luck … 19

20 Many of files have relationship to other files. How to let them know? Many of files have relationship to other files. How to let them know? 20

21 It would be good if an older generation has a good plan for digital preservation It would be good if an older generation has a good plan for digital preservation 21

22 22

23 Printed Age – Paper is durable format – Store under proper condition Digital Age – Information is fragile Technological obsolescence Deterioration of media

24 24 A digital object that copy from a printed document. Store in common format format such as TIFF Digitized Object

25 25 Born-Digital Object A digital object that create from software It needs to keep versioning rather than finalized document

26 Capacities v.s. Age 1000 Years 15 Years 26 A digital media can contain much more information than printed paper at the same volume. But the digital media’s life is shorter than printed paper. Fortunately, content in digital media is duplicated to another one easily.

27 An active management of digital information to ensure its – Maintainability Bitstream is still be existing originally – Accessibility Bitstream forming a file is able to be opened – Renderability An opened file presents a digital object originally – Understandability A reader understand a digital object originally over the time Digital Preservation 27 wikipedia.org

28 Do you have these? 28 How to preserve bitstream whether life of digital media is short and itself becomes old fashion? Issue

29 29 Current solution is migration. To migrate bitstream by duplicating itself from one media to anther media every interval time. Propose Solution Challenge How to notify that it is time to migrate? Do anyone have Right that intellectual property owner allow to copy the work? How to guarantee that nothing is lost during the migration process? How to keep change of the migration process?

30 30 A bitstream need to be represent as a file in order to be opened by software. Issue In order to form an accessible file, it need to construct bitsream to be object structure that make software understand. -Datatype: number, string, array, …. -Format: text, image, video, audio, … To open file, it requires environment including hardware, software, and version. Furthermore, some of files cannot be accessible because issue about protection from security concern

31 31 Propose Solution Use metadata to record information that anyone need to know in order to access the file, such as – Byte encoding – File format – Hardware & Software, and their version – Password to open file Provide the way to access file – Use virtual environment to access file – Migrate file according to newer software

32 32 Challenge How to make a common metadata structure? – Which information that every organization agree to include. How to notify that it is time to migrate to a new software? Do anyone have Right that intellectual property owner allow to copy and modify the work in order to support a newer software? How to guarantee that nothing is lost during the migration process? How to keep change of the migration process?

33 33 Although digital object is able to opened, how to guarantee that it is rendered originally or not? Issue

34 34 Purpose Solution Use metadata to record information about look and feel of digital object, such as, – Character Code – Font – Color template Challenge Which information is necessary to include in metadata? Does it has process to verify the correctness of rendered object?

35 35 Issue How to ensure that our today digital object characteristics including: – Documentation style Date format Number format Grammar, Sentence, Phrase, Vocabulary, Symbol – Contemporary knowledge Commonsense Contextual knowledge Knowledge automatically understanding in community are understanding by future readers who have difference knowledge?

36 36 Purpose Solution Preserve underlying community knowledge as well as digital preservation Link relevance digital objects and its contents to explore original knowledge and new knowledge – Using semantic technology

37 37 Challenge How to model and implement theory of underlying community knowledge? How to collect context knowledge for each period? How to claim correctness of knowledge?

38 38 To accomplish the preservation requirements, an archive information system seems answer the solution. Thus, a good system should supports: – Flexible information model – Long-term storage – Well-formed metadata – Preservation activities – Browsing and searching – Knowledge exploration – Preservation policy – Access policy – Right and agreement policy

39 39 To complete full features of system, it needs to support following roles: Provider – One who ingest digital objects to archive Consumer – One who retrieve preservation information. Management – One who provide preservation strategies and do preservation activities such as migration A good system should support each of uses of these roles as well

40 40 The goal of preservation is to maintain knowledge over the time. To do preservation, it needs well established metadata and system. A preservation system should serve functionalities to provider, consumer, and management

41 Institutional Repositories and Digital Preservation: Assessing Current Practices at Research Libraries Yuan Li Syracuse University yli115@syr.edu Meghan Banach University of Massachusetts Amherst mbanach@library.umass.edu

42 Archive – Is a collection of historical records, or the physical place they are located. – contain primary source documents that have accumulated over the course of an individual or organization's lifetime, and are kept to show the function of an organization. Digital Archive – Is a digital format of archive that need to do digital preservation Digital Media Environment to render wikipedia.org

43 An Institutional Repository is an online locus for collecting, preserving, and disseminating - in digital form - the intellectual output of an institution, particularly a research institution. For a university, this would include materials such as research journal articles, before (preprints) and after (postprints) undergoing peer review, and digital versions of theses and dissertations, but it might also include other digital assets generated by normal academic life, such as administrative documents, course notes, or learning objects. The four main objectives for having an institutional repository are: – to provide open access to institutional research output by self-archiving it; – to create global visibility for an institution's scholarly research; – to collect content in a single location; – to store and preserve other institutional digital assets, including unpublished or otherwise easily lost ("grey") literature (e.g., theses or technical reports). wikipedia.org

44 Review – Be archive with in IRs – Manage digital content – Produce copies being digital 44

45 Preservation system requires – Natural and juridical people – Institutions – Applications – Infrastructure – Procedure 45

46 Issues of Preservation – Little control over ingestion process – Less-optimal formats – Poor metadata – Insufficient intellectual property rights clearance – Difficult or costly to preserve 46

47 Analyze needs of digital preservation (digital archive) in domain of intuitional repository 47

48 Is preservation part of the mission and goal of IRs? What preservation policies exists for IRs? What preservation strategies are IRs currently implementing? Are the necessary rights and agreements in place to preserve the content of IRs? Are all of the materials in IRs of sufficient quality and importance to warrant long-term preservation (Content policies)? Do IRs currently have the necessary sustainability in terms of funding and staffing to carry out long- term preservation of their contents? 48

49 Is preservation part of the mission and goal of IRs? 49

50 NO YES Is preservation part of IRs? 50

51 What preservation policies exists for IRs? 51

52 Duration – Short | Medium | Long Recommended file formats – Text formats : pdf, txt, rtf, xml, odb, ods, odp – Image file formats : tiff, jp2, jpg – Audio formats : aif, aiff, wav – Video formats: avi, mj2, mjp2 Preservation Policies 52

53 What preservation strategies are IRs currently implementing? 53

54 Preservation Strategies Backup System Security Storage System Checksum 54

55 Preservation Strategies By IR system By external system Preservation metadata 55

56 Metadata varies based on the sophistication of the collection Working on standard and best practices address all type of metadata Preservation Strategies 56

57 Are the necessary rights and agreements in place to preserve the content of IRs? 57

58 Rights and Agreements Digital content may be changed if technology change Does this impact copyright? Players – Content contributor – Copyright holder 58

59 Rights and Agreements What is Agreement? – Click through – Written – Policies – MOUs – Verbal Most Agreement Contributor needs permission to submit work that is own by another party Most Agreement Contributor needs permission to submit work that is own by another party 59

60 Are all of the materials in IRs of sufficient quality and importance to warrant long-term preservation (Content policies)? 60

61 Content Policies CollectManageDisseminate 61

62 Problem – Format obsolescence – Poor quality – Unreadable – Insufficient metadata To manage To preserve Content Policies 62

63 It should – Track user activities e.g. submit work – Peer review before deposit in IRs (To ensure quality) Journal article Conference proceeding Content Policies 63

64 Do IRs currently have the necessary sustainability in terms of funding and staffing to carry out long-term preservation of their contents? 64

65 Sustainability Period Time Technology Change Infinity Short-term Medium-term Long-term 65

66 To realize to implement Digital Archive in Institutional Repository To Make Agreements and secure permissions for preserving IR contents To have guidance of digital format preservation to content contributors To plan for Long-term digital preservation To solve issue of lack of preservation funding 66

67 Terminology and Wish List for a Formal Theory of Preservation Giorgos Flouris FORTH or ICS CNR of ISTI fgeo@ics.forth.gr flouris@isti.cnr.it Meghan Banach CNR of ISTI meghini@isti.cnr.it

68 68 Barack Obama 44 th president of USA Born 04/08 /1961 Bit Preservation Currently, the system can do Object Preservation Bit stream is preserved for long-term by modern media Bit stream are able to be rendered and display to user originally.

69 69 Barack Obama 44 th president of USA Born 08/04 /1961 Information Preservation Currently, the system may not focus It becomes a new challenge that the system can preserve ability of understanding the rendered object over the time. To achieve this challenge, the reader is able to understand rendered object’s content by understanding the terms, concepts, or other information that appears in it, by placing it in its correct context. Currently, this feature is not exist in existing preservation approaches.

70 70 Barack Obama 44 th president of USA Born 04-Aug-1961 Producer Consumer Archive System Ingest Render The objective is that a reader (consumer) is able to perceive information context following his/her background knowledge and understand it originally.

71 71 Terms Producer The creator of the digital object P P D D Digital Object An object that present knowledge in understood-language C C DC Consumer Designated Community A reader who read digital object A group of readers who have shared common characteristics and knowledge

72 1.Producer produced Digital Object and stored in storage media. 2.Consumer opens Digital Object from storage media by rendering sequence of bit values represent the document. 3.Consumer obtains Digital Object by light from output device taking to his eyes. 4.Consumer understands meaning of Digital Object by D itself and contextual knowledge from his/her Designated Community 72 Understanding Process Goal Consumer is able to understand Digital Object originally over the time

73 The key is “meaning” of digital knowledge. – The meaning of a digital object can be viewed as a special kind of mapping that associates a symbol with a particular real-world concept. – This association is not always clear by looking at the digital object alone. A date format is a good example that make people confuse. – If European notation, he was born on 8 th of April. – If American notation, he was born on 4 th of August. 73 Barack Obama 44 th president of USA Born 08/04 /1961 Flouris & Meghan

74 In order to capture the “meaning” of a Digital Object, the Digital Object needs to be described in Language. 74 L L Language An arrangement symbols that associate with real-world concept Language should be a formal language that can be interpreted by both Producer and Consumer. Purposes of Language are – Providing formulation rules that encode real-world concept to be symbols. – Providing logic’s semantic that use contextual, background, or commonsense information in order to decode symbol to be real-world concept

75 75 08/04 August 4 PD L The producer need to represent “4 th of August” in a common language. Thus, she need to use contextual, knowledge, or commonsense information that she agree with her community in order to write a symbol representing “4 th of August”. She decides to use “08/04” because everyone in the same community understand this and can interpret to “4 th of August”. It means that she, and readers in the same community at that period understand the same meaning.

76 From simple Math function f(x) = y 76 Every people use Interpret function to understand meaning of language producer.interpret( “08/04” ) = “4 th of August” reader01.interpret( “08/04” ) = “4 th of August” reader02.interpret( “08/04” ) = “4 th of August” In this case, everyone interprets language “08/04” to be “4 th of August” because inside the interpret process has formula. Formula comes from knowledge. If knowledge is agreed in community, formula is produced from community knowledge. It means that Producer and all reader have the same formula, so they understand the same thing together.

77 77 Underlying Community Knowledge Knowledge from designated community (DC) that help members to similarly understand association between language and real-world concept. Therefore, key feature of UCK is to produce formulas that are able to -Encode real-world concept to be language -Decode language to be real-world concept UCK

78 78 08/04 April CD L 8 producer.interpret( “08/04” ) = “4 th of August” consumer.interpret( “08/04” ) = “8 th of April” Why consumer understand incorrectly?

79 When the time change, designated community may be changed, and knowledge may be changed. Thus, “understanding” may be changed, too. The critical cause is a change of UCK. – Because difference UCK makes difference formula that makes difference understanding. Next challenge is “How to capture change of UCK” 79

80 80 UCK Evolution Structure A structure that represent difference (delta) of UCKs. UCKES captures change of UCK’s language from change of UCK’s theory such as ontology evolution. UCKES UCKES represent a gap of each UCK C P

81 81 C P UCK Mapping Structure A complex mechanism that use UCKES to produce relationship between Consumer’s formula and Producer’s formula. The main function is to change language in order to make the same understanding of real-world concept UCKMS

82 Is it possible? 82 producer.interpret( “08/04” ) = “4 th of August” consumer.interpret( “04/08” ) = “4 th of Auguse”

83 83 Consumer Producer Right now, Consumer get incorrect understanding from language that Producer need to present. UCK Formula UCK 08/04 Read Digital Object D

84 84 Consumer Producer 08/04 The system should understand knowledge from Consumer’s side and generate mapping between Producer’s formula and Consumer’s formula using UCKES and UCKMS mechanism UCK Formula UCK UCKES UCKMS Digital Object D

85 85 Consumer Producer 08/04 Then, the system transform the digital object D to be D’. D’ contains language that make Consumer understand same thing as Producer UCK Formula UCK UCKES UCKMS 04/08 Digital Object D Digital Object D’ Read

86 86 Barack Obama 44 th president of USA Born 08/04 /1961 Barack Obama 44 th president of USA Born 04/08 /1961 Consumer understand D’ as same thing as Producer understand D. It means that D’ has preservability relation with D. D D D’ D’ D

87 87 Next step How to preserve underlying community knowledge as well as digital object? It needs to think of “Reader” when do preservation by providing information to ensure that the reader can understand digital object originally from their knowledge.

88 88

89 The PREMIS Data Dictionary defines preservation metadata as "the information a repository uses to support the digital preservation process” The metadata including – Intellectual information Intellectual unit such as book, map, movie, song, … – Digital object information A digital object that actualize from intellectual information. E.g. pdf, image, video, audio, … – Agent information Person or system involving with digital object – Event information Record of activities of an digital object – Right information Agreement of the digital object wikipedia.org, LOC.gov 89

90 An Open Archival Information System (or OAIS) is a reference model of archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. Features – Ingest, Archive, Preservation Plan, Administration, Dissemination, and Access End users – Provider, Consumer, and Management wikipedia.org, OLCL.org 90

91 91

92 http://www.dlib.org/dlib/may11/yuanli/05yuanli.html http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.9681&rep=rep1&t ype=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.9681&rep=rep1&t ype=pdf http://www.loc.gov/standards/premis/ http://en.wikipedia.org/wiki/ Preservation_Metadata:_Implementation_Strategies_(PREMIS) http://en.wikipedia.org/wiki/ Preservation_Metadata:_Implementation_Strategies_(PREMIS) http://www.oclc.org http://public.ccsds.org/publications/archive/650x0b1.pdf http://en.wikipedia.org/wiki/Open_Archival_Information_System 1 92


Download ppt "Semantic Digital Preservation Rathachai Chawuthai Information Management CSIM / AIT Introduction Issued document 1.0."

Similar presentations


Ads by Google