Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries.

Similar presentations


Presentation on theme: "Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries."— Presentation transcript:

1 Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries

2 From last time: Calculating potential file size (no really… this time we got it!) file size = height x width x bit-depth x dpi 2 8 bits per byte

3 imaging Benchmarking Subjective evaluation becomes more problematic when the goal is legibility rather than fidelity.

4 imaging Benchmarking Physical Type, size and presentation

5 imaging Banchmarking Physical condition Darkening pages Fading ink Stains bleed-through Uneven printing Fold lines smearing

6 imaging Benchmarking Document classification Simple text / printed line art Distinct-edge based representation Bitonal? Manuscripts Soft-edge-based Grayscale / color Mixed material

7 imaging Benchmarking Medium and support Support – (paper, clay tablet, etc.) Thin paper? (bleed through) Medium – (graphite pencil, inks, etc) Fading of ink Variations in color or density

8 imaging Benchmarking Tonal Representation

9 imaging Benchmarking Color Appearance Is color reproduction necessary to the document’s meaning? What purpose does the color serve? How important is maintaining the color appearance?

10 imaging Benchmarking Detail Printed text – Measure the height of the smallest lowercase letter that typifies the item or group of items. Manuscripts, line art – Measure the finest stroke-width that must be represented and characterize the needed level of quality

11 imaging Benchmarking QI…(Quality Index) Defining detail as character height ANSI/AIIM preservation microfilming standard for determining requirements for text legibility Defines a range from barely legible through excellent that maps to technical test targets

12 imaging Benchmarking Line pairs Excellent = 8 line pairs Good = 5 line pairs Marginal = 3.6 line pairs Barely legible = 3.0 line pairs

13 imaging Benchmarking Digital QI Bitonal (only black pixels) QI = (dpi x.039h)/3 h = 3QI/.039dpi dpi = 3QI/.039h Tonal images (grayscale for printed text) QI = (dpi x.039h)/2 h = 2QI/0.39dpi dpi = 2QI/.039h

14 Text Capture Methods Rekeying OCR Accuracy …

15 Software Scansoft - Omnipage Pro Abbyy – Fine Reader Adobe Acrobat … PrimeOCR – Prime Recognition

16 Encoding

17 XML vs SGML SGML (Standard Generalized Markup Language ) is the grand-daddy of all markup languages XML is a subset of SGML with an intent on being the format for use on the Internet. XML attempts to fill the gap between SGML, which can be used for just about anything, and HTML which is severely limited and currently being abused because of this. (table structures for layout, clear 1 pixel GIFs.. etc)

18 xml DTDs vs Schemas

19 xml TEI Text Encoding Initiative Initially launched in 1987, the TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.

20 xml TEI Levels of encoding Level 1: Fully Automated Conversion and Encoding Level 1: Fully Automated Conversion and Encoding Level 2: Minimal Encoding Level 3: Simple Analysis Level 4: Basic Content Analysis Level 5: Scholarly Encoding Projects

21 Character sets Unicode – Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

22 character sets Unicode Greek & Coptic

23 Software XMetal Oxygen Cooktop

24 Software MetaE


Download ppt "Digital Reformatting of Text Aaron Choate Digital Library Production Services The University of Texas Libraries."

Similar presentations


Ads by Google