Presentation is loading. Please wait.

Presentation is loading. Please wait.

Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.

Similar presentations


Presentation on theme: "Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How."— Presentation transcript:

1 Delivering textual resources

2 Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How to guidance for: Rekeying OCR

3 Getting the text ready - decisions Choices: Full text every character & word searchable, viewable & reusable in digital form Marked-up as above but with markup added to enable structured searches and use (e.g. XML, SGML) Image and text an image is all the viewer sees - text is fully searchable but is not seen or reusable Indexed Images/files attached to an index or catalogue

4 Getting the text ready - costs Full text generally expensive in time and resources but depends upon source – for born digital very cheap Marked-up Usually the most expensive due to skilled staff needed for intellectual content markup but some automated system around for format based markup Image and text comparatively cheap but some usability down sides Indexed great if index or catalogue already exists and can just link file to record (e.g. MARC)

5 Full text Files (e.g. PDF, Word) Formatted text (e.g. HTML) Fully searchable Reusable – copy, edit, share Very high accuracy i.e. 100% expected by user Unstructured searches Results can be overwhelming Born digital – reformatting for delivery to be considered

6 Markup Advantage of structured search and use Complex to create specifications and workflow from scratch Delivering requires a description of the codes, rules and documents used Most projects will adapt one that already exists: TEI – Text Encoding Initiative EAD – Encoded Archival Documents Some automation possible and some system solutions that enable this

7 Markup: examples Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock- case maker; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it.

8 Markup Two forms commonly used: Layout and structure based (format) Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock-case maker; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it.

9 Markup Content based (function) Thomas Knight was indicted for the wilful murder of Robert Ball. He stood charged on the Coroner's inquest for manslaughter, September 7. Michael Ball. The deceased Robert Ball was my son; he was a clock-case maker ; the prisoner and he had been fighting some time; he stood up against a wall; I said, Robert, will you fight any more? he said, yes; they fought again. I saw but little of it. Can obviously be combined to deliver function and format at the same time

10 Markup languages Markup is a language not a programming tool All use tags or elements – software interprets those tags for display purposes and/or for search and retrieval Allows users (or communities of users) to create their own tag sets Markup can encode both logical and physical features of text

11 Markup languages SGML Standard Generalised Markup Language (ISO in 1986) Father of all markup languages HTML Hypertext Markup Language (ISO in 1991) Markup of physical features of articles to enable Internet sharing of content – is about format of content XML: Extensible Markup Language (ISO in 1998) SGML lite to enable generic Web use of powerful XML features – is about function of content www.w3.org/XML /

12 XML: bits and pieces XML Content (.xml) XML Rules (.dtd) Schemas – e.g. TEI, METS DTDs = Document Type Definitions Namespaces (used when you want to combine sets of rules together in a single document)

13 DTD explained A DTD is the formal definition of the elements, structures, and rules for marking up a given type of XML document Think of it as an abstraction of the document structure What tags and elements must/can be used How these tags and elements are structured in relation to each other Allows Internet browsers and other software to understand how to interpret XML content

14 XML: further bits and pieces Entities (.ent) Reusable data inside a DTD or within markup Think of entities as variables that can be used to define common text (e.g. copyright information). You can then use the entity anywhere you would normally use the text. Display (.css &.xsl) eXtensible Style Sheet Language Cascading Style Sheets Exstensible Style Sheet Language (.xsl) Used for transforming data to another structure Used for formatting objects

15 Image and text Image delivered and text is fully searchable but not viewable Text usually created by uncorrected OCR Different ways to do this: Use a PDF document with image and text Deliver an image with text that has been extracted to a searchable database e.g. JSTOR Deliver an image with text that has very basic mark up (possibly just pages defined) and searched as XML

16 Indexed Basically just linking text or document formats to a subject index or resource catalogue Makes sense and is low cost where the index resources already exists Not so good if the index/catalogue has to be created as this part is costly – in that circumstance XML might be better Delivered as a link within the index/catalogue that directs user to the single text/document file Often used with MARC records or museum Content Management Systems

17 How to guidance: Rekeying Single rekeying one pass with checks. Generally 99.5% accurate Double rekeying keyed twice, differences checked. Generally 99.99% accurate Rekeyers should key what they see not what they think! Assume they know nothing Textual layout and structure provide clues for rekeyers Detail all variations, special characters, spellings that you can

18 How to guidance: Rekeying Example From the hand out Note the detail the variations quality assurance

19 How to guidance: OCR Handout Note the need to understand the nature of the document nature of original nature of printing language uniformity text alignment complexity of alignment lines, graphics and pictures handwriting

20 OCR Quiz Look at the 4 examples on screen Make a note of any features you think might affect OCR accuracy Have a guess of what you think the accuracy in % terms might be


Download ppt "Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How."

Similar presentations


Ads by Google