Presentation is loading. Please wait.

Presentation is loading. Please wait.

The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln.

Similar presentations


Presentation on theme: "The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln."— Presentation transcript:

1 The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln

2 1. The vision

3 Observation ► Photoshop ► Works only, if you are examining the actual image data …

4 png tiff Extractor Comparator image info 2 image info 1 the same? Format conversion Vision stage 1

5 png tiff Extractor Comparator image info 2 image info 1 the same? Format conversion Vision stage 2 png rulestiff rules

6 Obj 1 Obj 2 Extractor Comparator object info 2 object info 1 the same? Format conversion Vision stage 3 rule set 1rule set 2

7 Obj 1 Obj 2 Extractor Comparator XCDL 2 XCDL 1 the same? Format conversion Vision stage 4 XCEL 1XCEL 2 Machine readable form of a file format specification: „eXtensible Characterisation Extraction Language“ (XCEL), able to describe any machine readable format in a formal language, processible by a software tool for extraction of content as XCDL. Abstract description of file content: „eXtensible Characterisation Definition Language“ (XCDL), able to describe the content of digital objects (=1 + n more files), processible by a software tool for further analysis. Specification of „similiarity“ to be used: „comparator comparison [Language] “ (coco). Specification of „similiarity“ observed: „comparator results [Language] “ (copra).

8 2. Examples I

9 Image width: 277 Image length: 339 XCL by Example

10 XCEL representation <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" name="imageWidth"/> […]

11 XCEL representation <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" name="imageWidth"/> […]

12 XCEL representation <item xsi:type="structuringItem" identifier="IFDE_256" optional="true"> <symbol interpretation="uint16" length="2" name="imageWidth"/> […]

13 … imageWidth 277 int... XCDL representation XCEL entry: <symbol interpretation="uint16" length="2" name="imageWidth"/>

14 … imageWidth 277 int... XCDL representation XCEL entry: <symbol interpretation="uint16" length="2" value="3"/>

15 XCDL representations can now be compared…

16 3. Syntactical aspects of XCL processing

17 The XCEL tree The XCEL tree describes a format.

18 The result tree Parsing a file produces a result tree.

19 XCDL: models All file contained is understood as instances of “higher order data types”:  image  text  [ sound ]  [[ vector graphics ]]

20 XCDL: text model A text (= ) is composed of  data (= ) plus  Interpretations / properties of data according to the underlying format specification (= ).

21 This is a text 54 68 69 73 20 69 73 20 61 20 74 65 78 74 … fontsize 48 unsignedInt8 Representing a text in XCDL

22 XCDL: recursiveness XCDL is fully recursive An arbitrarily complex image can be a property of a textual position. Aka: Illustrations in a text file

23 XCDL: recursiveness XCDL is fully recursive An arbitrarily complex text can be a property of a textual position. Aka: footnotes

24 XCDL: recursiveness XCDL is fully recursive An arbitrarily complex text can be a property of an image segment. Aka: embedded image descriptions

25 3. Semantic aspects of processing

26 Are the following two items equal: VIII  8 How do Humans do it?

27 VIII  8 eight How do Humans do it?

28 VIII  8 eight otto How do Humans do it?

29 VIII  8 eight otto acht How do Humans do it?

30 VIII  8 eight otto acht 8.0 How do Humans do it?

31 VIII  8 eight otto acht Information model: „an image“ / „a text“ Replicating the approach in a machine:

32 VIII  8 Format ontology: „what terms are used in formats to describe image / textual properties“. Replicating the approach in a machine: Information model: „an image“ / „a text“

33 Extraction language: “how to get the terms describing an image / a text out of a file encoded in a specific format”. Replicating the approach in a machine: Information model: „an image“ / „a text“ Format ontology: „what terms are used in formats to describe image / textual properties“.

34 The Planets XCL Approach – The Ontology

35 4. Conceptual aspects of processing

36 Data which represent stored information do so in two forms: 1.As a set of tokens, which describe atomic items of information. 2.By a set of independent parameters, which describe, in a formalized way, the semantic interpretation of these items of information. Assumption I

37 1.Most algorithms today are based on “data types”, which are reflecting hardware characteristics (char, int, float...). 2.“Objects”, which are constructed from these data types, are transient concepts, which are meaningful only within a specific implementation / environment. 3.What we would need are considerably higher order objects, which are persistent by themselves and independent of a specific implementation / environment. Assumption II

38 The need formulated as assumption II can be fulfilled using assumption I. Assumption III

39 (1)I = i (D, S, t) (2) I 2 = i (I 1, S 2, t) (3) I x = i (I x-1, S x, t) (4) S x = s (I x-1, t) (5) I x = i (I x-α, S x-β, t) (6) I x = i (I x-α, s(I x-β, t), t) Generalisation of Langefors “Infological Equation” I = Information i(…) = interpretative process D = data S = previous knowledge t = time

40 5. Inclusion of rendering results

41 Observation: A file in Word 2003

42 Observation: A file in Word 2007

43 Observation: A file in Open Office

44 Observation: A file in Acrobat

45 Proposal to measure layout Cut out page from rendering surface. Scale to common dimensions: 371 +/- 1 x 521 +/- 1 Measure 1.The leftmost and lowest completely black pixel in the letter “A” starting the first line of the main text. 2.The leftmost and highest completely black pixel in the letter “E” starting the first line of the text in the footnote. 3.The geometrical centre of the period at the end of the main sentence. 4.The geometrical centre of the period at the end of the footnote text.

46 Proposal to measure layout

47 Could (will ?) be done algorithmically by the way.

48 (i)= 45 / 134; (ii) = 57 / 470; (iii) = 215 / 322 ; (iv) = 254 / 483 Measuring Word 2003

49 Measuring Word 2007 (i)= 45 / 134; (ii) = 57 / 470; (iii) = 215 / 322 ; (iv) = 254 / 483

50 Measuring Open Office (i)= 45 / 134; (ii) = 52 / 470; (iii) = 215 / 322 ; (iv) = 247 / 483

51 Measuring Open Office (i)= 45 / 134; (ii) = 52 / 470; 57 (iii) = 215 / 322 ; (iv) = 247 / 483 254

52 (i)= 45 / 132; 45 / 133; (ii)= 59 / 469; 57 / 470; (iii)= 215 / 321 ; 215 / 322 ; (iv)= 254 / 481 254 / 483 Measuring Acrobat Reader

53 Automated by image segmentation

54

55

56  Used within the comparison logic as described before.  The layout characteristics will presumably become part of the AIP in a distributed long term preservation system we may become responsible for.  Proof of concept implementation for static content will become part of Planets final deliverables.  Proof of concept implementation for dynamic content may become part of Planets final deliverables. Usage

57 Thank you! manfred.thaller@uni-koeln.de http://planetarium.hki.uni-koeln.de

58 4. Some Examples

59 Extraction DOCX and PDF (text only) DOCX-extractionPDF-extraction

60 DOCX-extractionPDF-extraction Extraction DOCX and PDF (text only)

61 PDF-extractionDOCX-extraction Extraction DOCX and PDF (text only) comparison results

62 Font-Changes in DOCX and PDF Word 2007Adobe Acrobat Reader

63 Font-Changes in DOCX and PDF DOCX-extractionPDF-extraction

64 Font-Changes in DOCX and PDF DOCX-extractionPDF-extraction

65 Font-Changes in DOCX and PDF DOCX-extractionPDF-extraction document comparison

66 … more fonts … Word 2007Adobe Acrobat Reader

67 symbol-fonts (images or text) DOCX-extractionPDF-extraction

68 symbol-fonts (images or text) DOCX-extractionPDF-extraction

69 symbol-fonts (images or text) DOCX-extractionPDF-extraction document comparison

70 Extraction DOCX and PDF (text AND image) Word 2007 DOCX with embedded image Adobe Acrobat Reader PDF with embedded image

71 Extraction DOCX and PDF (text AND image) DOCX-extractionPDF-extraction

72 Extraction DOCX and PDF (text AND image) DOCX-extractionPDF-extraction

73 Extraction DOCX and PDF (text AND image) main document comparison

74 Extraction DOCX and PDF (text AND image) recursive (image) document comparison

75 Audio

76 XCDLs extracted from audio WAV-extractionMP3-extraction

77 XCDLs extracted from audio WAV-extractionMP3-extraction


Download ppt "The XCL Languages Digital Preservation – The Planets Way Dresden, April 23 rd 2010 Manfred Thaller, Universität zu Köln."

Similar presentations


Ads by Google