Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creating textual resources Printed documents. Content of this session Types of printed documents Methods of capture Some examples.

Similar presentations


Presentation on theme: "Creating textual resources Printed documents. Content of this session Types of printed documents Methods of capture Some examples."— Presentation transcript:

1 Creating textual resources Printed documents

2 Content of this session Types of printed documents Methods of capture Some examples

3 Types of documents: largely textual Books Periodicals Newspapers Grey literature Documentary surrogates: microfilm etc

4 Other types of documents Miscellaneous materials including musical scores ephemera advertisements cartoons posters, etc These fall more closely into the visual images category to be discussed later

5 Diamond sutra, worlds earliest printed book, AD 868

6 Gutenberg Bible, 1450s

7

8 Goettingen British Library TexasKeio, Japan

9 News of the World, June 1851News of the World, June 1918

10 Penny Illustrated, October 1861 Weekly Dispatch, June 1856

11

12

13

14 Chopin First Edition

15

16 Trade card, 18th C.

17 Advertisement for booksellers`

18 Imperial War Museum Spanish Civil War Collection: Poster

19 Reel of microfilm

20 Microfiche

21 Characteristics of documents: books Printed books can date back to the 1470s Gutenberg Bible Early English Books Online may need to be treated more like manuscript materials

22 Characteristics of documents: books Almost certain to be bound Is it possible to disbind? Will they be discarded after scanning? May be printed on unstable media Different sizes May have image-rich content Likely to have language/font/character set issues

23 Characteristics of documents: books Varied internal structures depending on topic and type recipe books art history books childrens books Some common structural features TofC, index, bibliography, chapters, footnotes, pages

24 Characteristics of documents: periodicals May be bound Is it possible to disbind? Will they be discarded after scanning? May be printed on unstable media Different sizes, supplements etc May have image-rich content Likely to have language/font/character set issues

25 Characteristics of documents: periodicals Will have different structures according to type, but structure likely to be regular within a title comics popular magazines trade magazines academic journals Some common features … articles, images, advertisements, columns, diagrams, footnotes, bibliography, TofC, etc

26 Characteristics of newspapers Large in format Prolific in output Designed as essentially ephemeral Fragile Complex and multipart Change over time Many different types of content: text, images, advertisements

27 Characteristics of newspapers Difficult to index Difficult to store because of bulk and volume Inherently unstable paper weak and brittle, deteriorates rapidly Great interest to researchers Difficult to extract information from

28 Characteristics of documents: grey literature Catch-all category Includes many different kinds of un-published or semi-published materials reports personal papers conference papers newsletters

29 Characteristics of documents: grey literature Difficult to characterize A collection may have many different formats, periods, conditions Difficult to catalogue

30 Characteristics of documents: microform A good long-term storage alternative but a poor substitute for reading loss of the sense of the physicality of the original linear small format tiring to read impossible to search harder to scan (by eye) than the originals

31 Capture methods Depends on the type of material There may be more than one option What is the purpose of the digitization? A forensic record of the original? The textual content? Both?

32 Capture methods The more human input to the materials the higher the cost is likely to be It is possible to create good, searchable digital surrogates from certain kinds of documents by largely automated means Other materials may need more handling and human intervention

33 Capture methods Scanning book scanner flat bed scanner drum feed scanner microfilm scanner

34 Digitization issues Preparation of materials Assessing the collection Organization of data resources

35 Scanning into electronic formats Preparation of materials Assess the collection STOP POINT 1

36 Scanning into electronic formats STOP: 2 OCR for indexing STOP: 3 OCR/Rekeying for end user presentation STOP: 5 SGML/XML STOP: 4 Metadata

37 Digitization issues In every case you have to: assess the nature of the collection prepare the collection for digitization Decide how to organize the end information resource

38 Creating full text If digital images are scanned with no added value digital microfilm is the result This has many advantages for access But much more is possible...

39 Creating full text There are a number of ways to create manipulable text rekeying (relatively expensive) OCR (Optical Character Recognition) with correction (expensive) uncorrected OCR (relatively low cost)

40 Creating full text There are a number of ways to create manipulable text rekeying (relatively expensive) OCR (Optical Character Recognition) with correction (expensive) uncorrected OCR (relatively low cost) These will be discussed further later

41 Rekeying Most costly option But less expensive than it was! Very accurate if done well Can be used instead of providing a digital image Or attached to a digital image as a means of searching

42 Case study: Old Bailey Papers Largest single digital resource on non-elite peoples. 58,000 pages = >250 million characters rekeyed Rekeying is the most effective way to address the content of the originals XML markup the only way to deliver the content in a structured way

43

44

45

46

47 OCR Pattern recognition algorithms which can convert images of alphanumeric characters into ASCII code Been around since the 1970s KDEM (Kurtzweil Data Entry Machine), hardware and software very expensive so specialist bureaux offered it as a service move to desktop OCR in the mid-late 1990s See handout for OCR guidance

48 OCR accuracy This depends on the quality of the image being processed 99% is possible To what degree is accuracy important? this can depend on the intended use of the captured text

49 Case study: Refugee Studies Centre Digital Library Grey literature collection Earliest documents from the 1960s so copyright a critical issue Making content widely available the key aim Forensic fidelity unimportant Need to capture a large volume

50 Case study: Refugee Studies Centre Digital Library Methods: Can do destructive scanning Digitization outsourced to HEDS Initially uncorrected OCR also done by HEDS Later, use Olive Software Active Paper Archive OCR for searching, page image for viewing

51

52

53

54

55

56

57 Case study: British Library Newspaper Pilot Methods scanned from microfilm by OCLC Olive Softwares Active Paper Archive used for processing and delivery all processing and metadata extraction is automatic papers divided into components using profiles articles (title/body), images (picture/caption), ads etc

58

59

60

61

62


Download ppt "Creating textual resources Printed documents. Content of this session Types of printed documents Methods of capture Some examples."

Similar presentations


Ads by Google