Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law.

Similar presentations


Presentation on theme: "Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law."— Presentation transcript:

1 Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law

2 Delivery Formats & Issues Delivery Format: type of the file a user receives when accessing a document in a digital collection Important not just for viewing, but also for Information Retrieval (IR) tasks like full-text indexing There is no one format that is right for every type of collection. Important issues to consider: – Open v. Closed Formats – Usability and Accessibility – Subject Specific Concerns for Legal Materials

3 Open v. Closed Formats Who is "in control" of the document format you choose? A standards body? A single company or organization? Can you count on something that one entity controls to be supported over time? Advantages of Open Formats (a.k.a. Standards) – Interoperability and support over time. – Integrate well with open-source or low cost processing and IR tools – Help web content providers who need to support an increasing variety of devices and platforms

4 Usability & Accessibility What software do users need to view a particular format? Can a web browser natively display it? If the format requires a browser plug-in: – Is it free? Are users likely to have it installed? – Does it work on all computing platforms? Do public search engines index the format? Can dial-up modem users access the material in the collection?

5 Subject Specific Concerns for Legal Materials Legal digital projects usually manage texts, not images. Some types of legal materials are harder to maintain, i.e. codified material. Legal documents are almost exclusively printed in black & white. Preservation of the page structure is important for citation purposes. Maintaining the original appearance of digitized print documents is not important; archival and rare materials are potential exceptions.

6 Possible Delivery Formats Pure image formats: TIFF, JPEG Open encoded formats: XML, HTML, ASCII, and Unicode Hybrid formats: PDF, DjVu – can contain both image and text Proprietary formats: Microsoft Word, WordPerfect

7 Pure Images: TIFF, JPEG Raster (pixel-based) exclusively used for scanned collections Raster TIFF is the best choice for archival scanned images Pros – Web browsers display them natively – Both are open formats Cons – Large file sizes make viewing on slow connections problematic – Text of the documents available only through OCR (Optical Character Recognition) – Weak support for multi-page documents – JPEGs have trouble displaying text when they are compressed to levels appropriate for the web – Contain metadata about the physical file itself, not the contents of the file

8 Imaged Formats Cont. OCR is an important consideration: – 5% rate of error doesn't have an impact on traditional IR measures – 20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques. – High quality OCR is now available for relatively low cost Abbyy Finereader ($300) Abbyy Finereader Table and page layout recognition supported

9 Open Encoded Formats XML, HTML, ASCII, Unicode Typically easier to integrate into digital libraries [Baird 2004] – Created in 3 ways: Born digital documents Manually keyed documents Corrected OCR – IR applications easy to build, open source support strong – International standards or W3C recommendationsW3C – Accessible with all current web technologies – Metadata easily embedded in XML|HTML documents – Can be created with any text-editor – Improvements in OCR make encoding scanned collections feasible

10 Open Encoded Formats Cont. Cons: – These documents can be expensive for staff to create Manual Encoding in XML may have to be done by hand Manual correction of OCR errors – Need technical expertise on staff to get the full benefits of these formats, the PERL programmer – These don't necessarily preserve the "look" of printed documents

11 Hybrid Formats: PDF, DjVu PDF and DjVu are proprietary technologies that have substantial support in the open source community. Both can contain a layer of the document’s text and an image of each page in a document. Both utilize cross-platform, freely available web browser plug-ins. Both try to preserve the look of print documents Easy to export born digital documents to these formats using printer drivers, “print to PDF”

12 Adobe PDF Pros: – PDF has strong market acceptance in the legal community – PDF-Archive, a standard for using PDF as an archival format in development by AIIM [Association for Information and Image Management] – Adobe makes the PDF reference manual and software development kit freely available to developers. – Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologiesXMPsemantic web Cons: – Plug-in performance is poor for long documents – PDFs composed of scanned images can be very large in size, even for short documents

13 DjVu Designed to be a scan-to-web technology. Pros: – Best compression of any image format on the web – Users can load lengthy documents very quickly – The DjVu plug-in can be manipulated via cgi-style arguments – Use the Any2DjVu server to try out the format.Any2DjVu Cons: – DjVu does not yet have great market acceptance in the legal community. – DjVu does not have a standard method for embedded metadata within documents.

14 Proprietary Formats Word Processing Formats: MS Word, WordPerfect Not a good choice for document delivery on the web Cons: – These formats are completely closed – Poor cross platform support – It is often problematic to index these documents using inexpensive or open source IR tools.

15 The New Jersey Digital Legal Library URL: http://njlegallib.rutgers.eduhttp://njlegallib.rutgers.edu Digitize New Jersey Legal materials not currently available online. Available for users in two formats: DjVu and PDF Current Workflow: – Scan -> TIFF; then TIFF -> PDF and TIFF -> DjVu – Extract OCR text from the DjVu to XHTML using XSL Stylesheets and DjVuLibre (The Open Source DjVu Library)XSL Stylesheets DjVuLibre – Use swish-e to index the XHTML documents with embedded extended Dublin Core metadataswish-e

16 References 1. Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004. 2. Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287-298.


Download ppt "Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th, 2004 Law Library Rutgers-Newark School of Law."

Similar presentations


Ads by Google