Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text and Books IST 653. Examples of Texts or Documents? Books Letters Log books Databases Website Maps Computer code.

Similar presentations


Presentation on theme: "Text and Books IST 653. Examples of Texts or Documents? Books Letters Log books Databases Website Maps Computer code."— Presentation transcript:

1 Text and Books IST 653

2 Examples of Texts or Documents? Books Letters Log books Databases Website Maps Computer code

3 What is a (digital) document? Buckland – “any phenomena that someone may wish to observe: events, processes, images, and objects as well as texts.” (Buckland, 1997) Otlet – “objects themselves can be regarded as documents if you are informed by observation of them”, 3 dimensionality (Buckland, 1997) Briet – “evidence in support of a fact” (Buckland, 1997) Greenberg – “Any entity, form or mode for which contextual data can be recorded” (Greenberg, 2002, 2003)

4 Settling on a Practical Meaning Documents or Records, what are they? – Movie, painting, book, letter sculpture can all be documents We are mainly concerned with information written or typed in spoken language using words, with some pictures. As librarians, we focus on Format, though scholars might view function to be more reliable IST 602, Information and Knowledge Management is where you will learn more about this debate/conversation

5 Again, the “Why?” of Digital Libraries We are trying to lower the barriers to access of human knowledge What are the barriers? – Images, audio, video and text? What processes must take place to do that? What the costs? – Equipment, skills and staffing?

6 Putting Text into your DL Scanned page of text is just a picture; it’s not searchable Methods of converting to searchable text – Optical Character Recognition (OCR) – Double-keying Text encoding (e.g., in TEI) is yet another level of processing

7 Fragile or Difficult to Scan Bound materials-might need dis-binding – Books with maps or foldouts – Handmade books or too “tight” to sit in cr Loose pages or manuscripts- might need special handling

8 Optical Character Recognition (OCR) Forms containing characters images can be scanned through a scanner and then recognition software engine of the OCR system interpret the images and turn images of printed characters into ASCII data or machine readable characters.

9 Optical Character Recognition (OCR) Indexing-OCR text allow for keyword/full-text searching, even if OCR is not 100% accurate. Full-text retrieval-Search engine results are displayed with hit highlighting within the page image displayed. Full-text representation-the user is presented with a text file as a representation of the actual document. Full-text representation-the user is presented with a text file as a representation of the actual document. Needs extra hands on massaging

10 OCR Software The technology developed in 1974, as we know it today Ray Kurzweil, refined if not invented, created speech to text Spurred by need to assist Blind and visually impaired OCR goes hand in hand with flatbed scanner breakthrough https://www.youtube.com/watch?v=YbwQvU1aUDc

11 Scanning and OCR Considerations 300 DPI is a minimum standard, though higher does not mean better Bitonal or Grayscale, about the same in accuracy 50% brightness setting 100% accuracy requires re-keying, done in- house or vendor. Very expensive!

12 Scanning and OCR Considerations Current OCR software boasts rates as high as 97% accuracy Accuracy is based on letter, not word Limit materials to post 1850 Certain languages/typesets may be more difficult than others Small font sizes might require hire dpi Hand written caned be OCR’d, but it’s still expensive cutting edge technology

13 Despeckle

14 Settings for Printed Text Bit depth is defined as the number of bits per pixel. Each of the following bit depths will result in a successively larger file: Bitonal A 1-bit pixel has two possible values, black or white. 1-bit bitonal images are appropriate for some machine-produced documents and books with no illustrations, or with black-and-white line art only. They are not sufficient for documents that include handwriting, photographs, or illustrations with half-tones (greys). Bitonal scanning will produce the smallest possible file. 8-bit Greyscale has 256 possible values along a spectrum of greys ranging from pure white to pure black. 8-bit greyscale scanning is appropriate for producing images that include grey tones, but no colors—such as black-and-white photographs, and books or documents containing handwriting or half-tone illustrations. 24-bit Color has 16.7 million possible color values, meaning it will produce greater color fidelity. Use only on color illustrations

15 Settings for Printed Text 300 to 400 dpi, perhaps higher for smaller objects 300 dpi is adequate for OCRing

16 Halftone Printing Halftone is technique that simulates continuous tone imagery through the use of dots—creates optical illusion Image without descreening (with moire patterns) Descreened Image

17 Deskew Skew detection and correction

18 Calibrabration Gives you a benchmark from which to judge color and grayscale Calibrate monitor and scanner, not unlike test tone or

19 PDF and PDF/A Adobe Portable Document Format (PDF) is an open format, but proprietary until 2008 Widely accepted in libraries, industry, and DL Favored because they look like the original Work well with Screen readers

20 Preferred Formats

21 Normal PDF This is the most common type of PDF and is most typically created from a document such as Microsoft Word. It contains the full text of the page with appropriate coding to define fonts, sizes, etc. and will provide a faithful print of the original.

22 Image PDF Image-This is a PDF that has been created from one or more images – most commonly as a result of scanning a document either directly to PDF or by converting a scanned TIFF image to PDF.

23 Searchable PDF A “Searchable” PDF is an “Image-Only” PDF that additionally contains a hidden layer of text generated by an OCR engine. This enables the file to be searched in the same fashion as a “ Normal” PDF. Text can be copied and pasted.

24 Tiff Considered a master file format Preservation worthy No compression Raster image Documented and well understood format More trusted until Adobe opened up the PDF specification

25 Indexing and Searchability

26 Book Scanning Equipment Flat bed scanners Planetary scanner – need for page curvature correction software – Does not touch object The V-Shaped Book Scanner – No need to control for curvature – Good for fragile matter

27 ILL, Document Delivery, Preservation InterLibrary Loan routinely scans articles and book chapters instead of shipping the original or using photocopying. Preservation departments engage in conservation work, and will scan in the process Digitization on demand, book is out of copyright, and patron inquires eReaders, Kindle, iPad etc.

28 Other Uses Cases Increased Access-people from all around the world can get access to unique documents to read. Corpora-used by linguists and literature scholars to conduct statistical analysis Digital Humanists- – Literature: do text mining to study large scale developments in the novel, does not “read” books – Historians: search and mine newspapers for historical analysis using statistical tools

29 Metadata

30 Vendor Option Can be a cheap or expensive option If you are doing a small project, in-house might be a better option Big projects with grant money, sometime vendors are more affordable Great expertise, but still need to be informed when choosing. Example: http://www.albany.edu/~mwolfe/ist653/week6/ http://www.albany.edu/~mwolfe/ist653/week6/

31 Scanning TIFF, 300 DPI, bi-tonal Recommended for newspapers containing none or few photographs TIFF (Compression is Jpeg), 300 DPI, 8-bit grayscale few photographs or graphic elements.

32 Filenaming CCC-nnn-YYYY-MM-DD-VV-NN-XXX.TIFF CCC: Project code (3 numbers) nnn: Publication name- 3 letters code YYYY/MM/DD VV: Page version/edition (default is 01) NN: Section (default is 01) XXX: Page # in section (in the order in the paper) Examples: 069-PMV-1985-12-12-01-01-001.tiff (Dec 12, 1985, ed. 1, sec. 1, pg. 1) 069-PMV-1985-12-12-01-01-002.tiff (Dec 12, 1985, ed. 1, sec. 1, pg. 2) 069-PMV-1985-12-12-02-02-001.tiff (Dec 12, 1985, ed. 2, sec. 2, pg. 1)

33 Filenaming (NYSHistoricaNewspapers.org) Scanned newspapers should meet the following minimum standards: - PDFs (preferred) or TIFF files, with each newspaper title individually identified, and OCR files included if possible 300 dpi / black and white (400 dpi / grey scale preferred) Each page filed as: YYYYMMDD – p# For example: nowheresville-gazette-18590621–001.pdf nowheresville-gazette-18590621–002.pdf nowheresville-gazette-18590622–001.pdf nowheresville-gazette-18590622–002.pdf

34 Article Segmentation

35 Access & Delivery Basic web pages – Pros: Easy to do – Cons: Difficult to maintain, no searching, no dynamic browsing Digital library software – Pros: Powerful functionality for searching, browsing, and managing content – Cons: Can require high level of technical skill, can be expensive

36 Text Repository Examples https://ithacalibrary.com/archives/ithacan.ph p https://ithacalibrary.com/archives/ithacan.ph p http://nyshistoricnewspapers.org/

37 Scanning Demos Zeutschel "Perfect Book". – https://www.youtube.com/watch?v=-9vxPyC2nYY https://www.youtube.com/watch?v=-9vxPyC2nYY Stanford Digital Library – https://www.youtube.com/watch?v=RdLcrNeWjIs https://www.youtube.com/watch?v=RdLcrNeWjIs


Download ppt "Text and Books IST 653. Examples of Texts or Documents? Books Letters Log books Databases Website Maps Computer code."

Similar presentations


Ads by Google