JSTOR & OCR - A Case Study Kiffany Francis. What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

Special Features of Publishers Web Sites. Objectives Review standard features via Elsevier website Identify special features in the websites of the following.
NATIONAL LIBRARY OF MEDICINE PubMed Central Edwin Sequeira National Library of Medicine May 26, 2004.
SCOPUS Searching for Scientific Articles By Mohamed Atani UNEP.
E-Content Service Group Virtual Meeting Digital Preservation: How to Get Started.
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
Ensuring a Journal’s Economic Sustainability, While Increasing Access to Knowledge.
Journal Retention & JSTOR Journals Due to diminishing use of print journals, Alkek Library has reviewed its journal retention policy, i.e. criteria to.
Overview of PubWEST Patent and Trademark Depository Library Training Seminar April 2006.
Slide 1 Word Processing. Slide 2 What is a word processor? A word processor is a computer that you use for writing, editing and printing text. A dedicated.
FAIRTRADE FOUNDATION OCR Nationals in ICT Unit 1 ICT Skills for Business AO4.
PDF (Portable Document Format) for Digital Preservation and Delivery John Laurie Digital Initiatives Librarian The University of Auckland Library National.
These ain’t “Old News”! Creating access to historic newspapers Christine Guenther OCLC Product Manager, Digital Services Preservation Service Centers Bethlehem,
DIGITIZATION OF LOCAL HISTORY COLLECTIONS IN PUBLIC LIBRARY “VLADISLAV PETKOVIC DIS” IN CHACHAK: DIGITIZATION OF THE NEWSPAPER “THE VOICE OF CHACHAK” Bogdan.
JSTOR User Services l February 2009 Using the JSTOR Interface User Services, February 2009.
The 3-Legged Stool: How JSTOR Balances the Needs of Scholars, Librarians, and Publishers in Maintaining a Sustainable Not-For- Profit Enterprise ACRL/NY.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004.
6/15/20151 Opportunities for Collaboration: The HEARTH Project Joy Paulson and Nathan Rupp Cornell University Digital Library Federation Spring Forum New.
Information & Library Services SwetsWise User Guide Emma Crowley Senior Academic Services Librarian
PowerPoint Lesson 2 Creating and Enhancing PowerPoint Presentations
Session 803: Processing PDF Files Gaeir Dietrich Director High Tech Center Training Unit
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Locating Primary Source Information What it is and how to find it.
Research Posters in PowerPoint. 2 Download Notes
Usability Evaluation of a Research Repository and Collaboration Website For Human-animal Bond Researchers Tao Zhang | Digital User Experience Specialist.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Port Townsend Leader Historical Newspaper Archive Keith Darrock.
LOUISVILLE.EDU Sharing Our Special Collections with the World: an IMSLP Digitizing Project By James Procell, Music Librarian University of Louisville.
HathiTrust – How To By Dr. Rob McGeachin 20 th Annual AgNIC Meeting May 7, 2015.
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Terms 2 Terms 3 Terms 4 Terms 5 Terms.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Collaborative Approach to Open Access: Experience from Bioline International Leslie Chan Associate Director Bioline International University of Toronto.
Erin Kinney, Wyoming State Library. Motivation #1 priority that came out of 2004 statewide digitization meeting WSL received many reference questions,
Portico An Electronic Archiving Service Eileen Fenton Executive Director, Portico What Works In Archiving? Society for Scholarly Publishing November 15,
Link Resolvers: An Introduction for Reference Librarians Doris Munson Systems/Reference Librarian Eastern Washington University Innovative.
Digitization Panel August 12, 2010 Christopher C. Brown, coordinator Mike Culbertson, Colorado State U. James Mauldin, GPO.
Looking back, moving forward: Examining the impact of digitizing the ACS archive 232nd ACS National Meeting September 13, 2006 David Martinsen, Adam Chesler.
2002 September -- ejk/UF RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging Distillation Other topics?
Kentuckiana Digital Library: A Digital Archive of Kentucky History Eric Weig Head, Digital Programs Special Collections & Digital Programs Division University.
PubMed Overview From the HINARI Content page, we can access PubMed by clicking on Search inside HINARI full-text using PubMed. Note: If you do not properly.
Technology Choices for the JSTOR Online Archive Presented by Chang Feng Department of Computer Engineering and Computer Science, University of Missouri-Columbia,
An Overview of Projects and Processes Higher Education Digitisation Service Joanne Lomax Smith
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Digitising Special Queen’s - the JSTOR Project Preservation Teaching Research 1.
Portico An Electronic Archiving Service Ken DiFiore, MLS Associate Director of Library Relations, Portico Orbis-Cascade October 6, 2006.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
E-Books Presentation. Hard Copy (Book) Scanning OCR Text Document HTML Conversion Text Formatting Linking Image Insertion Final QC Soft Copy (JPG/TIFF)
WISER: Citation searching Web of Knowledge is a powerful way to access the ISI's multidisciplinary citation indexes. It allows you to discover what research.
Researching the African Diaspora and Creolité on the Internet Karen Hartman Information Resource Officer U.S. Embassy, Nairobi, Kenya February 5, 2008.
1 FIND ARTICLES/DATABASES ENGLISH 115 Hudson Valley Community College Marvin Library Learning Commons.
ITGS Application Software. ITGS Application software (productivity software) –Allows the user to perform tasks to solve problems, such as creating documents,
Public Policy Research Lisa Burley Brackett Library, Harding University.
Partner Publishers’ Websites From the Partner publisher services dropdown menu, click on the Elsevier Science - Science Direct website. Note that this.
Making Dissertations & Theses accessible and discoverable Специальные условия по включению диссертаций российских ученых в базу ProQuest Dissertations.
Chapter 10 Creating a Template for an Online Form Microsoft Word 2013.
15th North Carolina Serials Conference - March 31, Accessing Yesterday’s Information for Tomorrow’s Research: The Growth of Electronic Backfiles.
The Complexities & Economics of Scanning Microfilmed Documents Videos
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
1 « Luxembourg, 18 April 2007 « Virtual Library of Official Statistics « Dissemination Working Group.
Using JSTOR November What is JSTOR?JSTOR 2.JSTOR demonstration −Searching JSTOR −Format of the journal content −Using a MyJSTOR account to organize.
Memory Masters Preserving Digitized Histories— for today, for tomorrow, and for the future This project is made possible by a grant from the federal Institute.
Accessible PDF’s using Adobe Acrobat Standard or Professional Jarilyn Weber 06/11/2014 “Leading for educational excellence and equity. Every day for every.
Using JSTOR May What is JSTOR?JSTOR 2.JSTOR demonstration −Searching JSTOR −Format of the journal content −Linking to content on JSTOR 3.Help.
Digitizing Historical Newspapers South Carolina Digital Newspaper Program's participation with the Library of Congress' Chronicling America: Historic American.
Poster Print Size: This poster template is 36” high by 48” wide. It can be used to print a Tri-Fold poster with 12” wings. Placeholders: The various elements.
RESEARCH TOPICS Web-Interface Performance DTD Extensibility Imaging
My Program Session Title
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Quick and Dirty: the art of OCR
Presentation transcript:

JSTOR & OCR - A Case Study Kiffany Francis

What is JSTOR? “JSTOR is a not-for- profit organization with a dual mission to create and maintain a trusted archive of important scholarly journals, and to provide access to these journals as widely as possible.”

JSTOR: JSTOR - journal storage. They are building a digital archive of journal back runs, Some of which date back to the 1600s. JSTOR has converted over 10 million paper journal pages from over 240 journals representing more than 170 publishers. The JSTOR archive is available at more than 1,450 libraries. They are building a digital archive of journal back runs, Some of which date back to the 1600s. JSTOR has converted over 10 million paper journal pages from over 240 journals representing more than 170 publishers. The JSTOR archive is available at more than 1,450 libraries.

JSTOR Each journal page digitized by JSTOR is processed by an OCR application. The resulting text files are used to support full-text searching offered to JSTOR users. Each journal page digitized by JSTOR is processed by an OCR application. The resulting text files are used to support full-text searching offered to JSTOR users.

What is OCR? Optical Character Recognition It is the process that converts the text of a printed page or image into editable, digital text.

What does OCR software do? The software analyzes the layout of text. The order of the paragraphs is determined. Analysis of characters begin. Compares character groups (words) to dictionary in OCR application When match is found, software prints word to text file.

What does OCR software do? If a match can not be found… The software makes a reasonable assumption and flags the word with low confidence. If a word or character can not be read at all, a default character is inserted as a placeholder.

Problems with OCR Does not handle certain text very well. Non-Arabic text Nonmodern type Small print Certain fonts Complex page layouts

JSTOR: Production Process The process begins at JSTOR in Ann Arbor, Michigan. Page-by-page examination of journal run. Preservation concerns are addressed. Scanning guidelines are created. A production librarian and serials specialist create indexing guidelines. Journal is shipped to contractor to be scanned and described.

JSTOR: Production Process At the contractor facility: Physical journals are disbound and separated Into pages sorted by issue. Each page is scanned in bitonal TIFF format at 600 dpi resolution. Page images are checked for marks, folds, skewing. A table of contents file is added. If available, abstracts and keywords are added. All digital files created by contractor, page images and toc files, are downloaded to CD-ROM and shipped back to JSTOR - Ann Arbor. Physical journals are disbound and separated Into pages sorted by issue. Each page is scanned in bitonal TIFF format at 600 dpi resolution. Page images are checked for marks, folds, skewing. A table of contents file is added. If available, abstracts and keywords are added. All digital files created by contractor, page images and toc files, are downloaded to CD-ROM and shipped back to JSTOR - Ann Arbor.

JSTOR: Production Process Rich Digital Masters: Each page is scanned in bitonal TIFF format at 600 dpi This is preferred because: 1.In 1994, there was some debate about whether 300 dpi or 600 dpi was better because of storage space. 600 dpi won out dpi printers are now standard 3.Resolutions higher than 600 dpi are not discernably better for black-and-white text-based images. Each page is scanned in bitonal TIFF format at 600 dpi This is preferred because: 1.In 1994, there was some debate about whether 300 dpi or 600 dpi was better because of storage space. 600 dpi won out dpi printers are now standard 3.Resolutions higher than 600 dpi are not discernably better for black-and-white text-based images.

JSTOR: Production Process Back at JSTOR - Ann Arbor: Files are uploaded from CD-ROM to JSTOR file servers. Quality control process verifies image and table of content quality. After quality check, each page image is processed by OCR software to create full-text for searching. After further quality control, the title is announced to JSTOR participants. Files are uploaded from CD-ROM to JSTOR file servers. Quality control process verifies image and table of content quality. After quality check, each page image is processed by OCR software to create full-text for searching. After further quality control, the title is announced to JSTOR participants.

JSTOR: Production Process The quality of OCR for journals. JSTOR reports a 97% accuracy rate for their OCR created text-files. Some journals yield OCR files that are 99.95% accurate. This level of accuracy is satisfactory for searching but not for presentation. JSTOR reports a 97% accuracy rate for their OCR created text-files. Some journals yield OCR files that are 99.95% accurate. This level of accuracy is satisfactory for searching but not for presentation.

Example of JSTOR page.

Example of scanned image from JSTOR

JSTOR: Preservation Issues A PLAN FOR PRESERVATION. Print repositories of JSTOR journals are being started at University of California and Harvard University. The database is currently housed on servers managed and maintained at Princeton University, University of Michigan, and University of Manchester (UK). Archival cold tapes are also stored at the OCLC and at the JSTOR offices in New York City.

Guidelines: Is OCR right for your project? 1. “Select the technology that will enhance your ability to meet the objectives of the project.” From “An OCR Case Study” by Eileen Gifford Fenton

Guidelines: Is OCR right for your project? 2. “Scale matters -- a lot.” From “An OCR Case Study” by Eileen Gifford Fenton

Guidelines: Is OCR right for your project? 3. “There is no right answer.” From “An OCR Case Study” by Eileen Gifford Fenton

Guidelines: Is OCR right for your project? 4. “Costs will be higher than you expect.” From “An OCR Case Study” by Eileen Gifford Fenton

Guidelines: Is OCR right for your project? 5. “The answer that is right for today may not be right in the future.” From “An OCR Case Study” by Eileen Gifford Fenton

Sources for Further Investigation Bibliography: Guthrie, Kevin, JSTOR. “Developing a Digital Preservation Strategy For JSTOR, an interview.” JSTOR website: Kiplinger, John. Director of Production, JSTOR. “Print-Repository Effort Under Way at UCLA and Harvard.” Fenton, Eileen Gifford, JSTOR, University of Michigan. “An OCR Case Study.” In Handbook for Digital Projects:A Management Tool for Preservation and Access. Bibliography: Guthrie, Kevin, JSTOR. “Developing a Digital Preservation Strategy For JSTOR, an interview.” JSTOR website: Kiplinger, John. Director of Production, JSTOR. “Print-Repository Effort Under Way at UCLA and Harvard.” Fenton, Eileen Gifford, JSTOR, University of Michigan. “An OCR Case Study.” In Handbook for Digital Projects:A Management Tool for Preservation and Access.