Presentation on theme: "1 Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation Ann Green, Steve Citron-Pousty, and Marcia Ford Social Science."— Presentation transcript:
1 Economic Growth Center Digital Library a project funded by the Andrew W. Mellon Foundation Ann Green, Steve Citron-Pousty, and Marcia Ford Social Science Research Services Yale ITS Academic Media & Technology Sandy Peterson and Julie Linden Yale Social Science Libraries and Information Services Christopher Udry Professor, Yale Dept of Economics Director, Economic Growth Center and Dept of Economics
2 Goals of the grant: Improving access to statistical resources not born digital Build a prototype archive of statistics not born digital. Implement standard digitization practices and emerging metadata standards for statistical tables. Document the costs and processes of creating a statistical digital library from print. Build the collection based upon long range digital life cycle requirements. Present the prototype digital library to the scholarly community for evaluation.
3 Research questions zWhat effect does online access to the statistical information have on scholarly use of the materials? zAre common digitization practices and standards suited to statistically-intensive documents? zWhat are the long term preservation requirements of the EGCDL? zWhat are the costs of producing high quality statistical tables with OCR and editing? zHow scalable is this process, for what kinds of collections, and for what purposes?
4 Selecting the EGC Why Economic Growth Center collection? extraordinary wealth of statistical material in printed form condition of paper is poor; preservation concerns a range of access problems imposed not only by non-digital physical formats but also by inadequate descriptions of the contents of the publications Why Mexico? Faculty connections, library strengths, image quality and completeness of collection Publisher: Instituto Nacional de Estadística, Geografía e Informática (Mexico, INEGI) Anuarios Estatales
5 Digitization process Vendor review: Part one: request for proposals (13 vendors) Part two: production of samples and extended proposals (6 vendors) Part three: final review and budget evaluation Part four: contract negotiation and processing Prepared and shipped 221 volumes in Fall 03 All materials received back at Yale Feb 04 Quality assessment period complete March 04
6 Deliverables received: images and PDFs 300 dpi TIFFs of each page of each volume, including cover and back of volume (103,115 TIFF files; 460 gigabytes) Color TIFFs for color pages, black and white TIFFs for pages with only black and white PDF image + text files of each chapter of each volume (5,607 files) Separate files for each subject chapter, front matter, indices, and back matter
7 Deliverables received: statistical tables in Excel Excel tables (16,488) of demographic and economic tables for 1994, 1996, 1998, 2000 Selected OCRd statistical tables converted to Excel format DSI operators reviewed each table twice Custom tagging done by DSI improved ability to extract columns and rows into online database and build XML metadata
8 Quality assessment of PDFs Overall assessment was very good; minor problems on a very small number of pages Sampled 5% of the PDFs subjective review of image: Tilting, cut off text, illegible characters, bleed through, color evaluation, noise Vendor did not produce PDFs exactly to our specification, had to reformat 216 files
9 Quality assessment of Excel tables zChecksums: yWrote script to find tables with Total columns yWrote Excel macro to compare column sums with Total values yExcellent numeric transfer of numbers from print zVisual review
10 Automated Dublin Core metadata production for PDF and Excel files Defined metadata format for PDF and Excel documents Dublin Core subset (title, date, format, identifier, source, coverage, subject) Wrote scripts to produce individual metadata records for each PDF chapter and each Excel file Matched chapter numbers with chapter titles: created standard matching tables Generated subject term for each chapter: created standard tables of chapter titles to topic list Generated series title list from library online catalog Other text generated from file name and directory information
12 First generation interface: Select, view, download ssrs.yale.edu/egcdl Features: Select files by year, state, topic, and/or type of file (PDF or Excel) Reconstruct the full volume Based upon Dublin Core metadata
14 Next generation interface: NESSTAR Features: Search across tables View tables, select columns and rows Create graphs and charts Download and extract Based upon the Data Documentation Initiative (DDI) metadata standard for statistical tables Uses statistical software similar to SPSS Nesstar is in use by data archives in Europe, World Bank, Health Canada; in review by ICPSR and The Roper Center
15 Metadata production and data publishing: Nesstar process zScript creates DDI XML file from Dublin Core record and CSV file from Excel table zStaff publish each pair of DDI XML file and CSV file using Nesstar cube builder software zSome editing needed at this stage: yAdd measure (what the table is measuring persons, events, pesos, etc) yAdd column header name zPublish metadata and table data to Nesstar server
20 Evaluation of Nesstar interface zNo major advantage over Excel in terms of viewing or manipulating data zMetadata and table cant be viewed together zCant search within specific fields (e.g. state, year, column headers, etc.) zDe-contextualization of tables (search results dont indicate what volumes the tables belong to) zLack of flexibility in customization
22 Evaluation of Nesstar-produced metadata zLabor-intensive to create and edit; no batch processing zHave to interpret table elements (add column header and measure labels) zSome data lost (textual codes for missing data; footnotes within cells) zDDI records not valid; some deviations from DDI specification
23 Automated DDI production: script-based process zScripts pulled elements from Excel files and Dublin Core records zNo manual editing necessary zDDI records are valid zMetadata describes table as is without our interpretation imposed on it zChallenges with marking up hierarchical tables in DDI
24 Search/browse UI for PDFs and Excel zLucene index for keyword searching yIncludes text from table titles, column and row labels, and footnotestext from table titles, column and row labels, and footnotes yCan use accent marks or not in search terms; Lucene returns appropriate results set
27 Transforming paper to digital resources: Generations of table presentation zPaper zImages of paper zPDF surrogates for paper publications zExcel files for downloading zNesstar online presentation zKeyword search on PDF/Excel interface
28 What we are learning and documenting Costs and processes of digitizing paper and building a statistical digital library: Scanning requirements (TIFFs, PDF/a, etc) OCR of Spanish text OCR of numbers into spreadsheets (zoned scanning) Quality assessment Is it less expensive to key in the tables? What is the tipping point?
29 Table complexity Metadata complexity and volume: linked to table complexity Text legibility: affects OCR accuracy OCR accuracy: formatting OCR accuracy: character ComplexHighPoorLow 1 2 3 SimpleLowGoodHigh 4
30 Cost comparison of keying in the data zSent sample tables to two vendors and asked for cost estimates for keying yVaried widely – different pricing structures (per 1000 characters vs. per Kb) yBoth require shipping volumes (or photocopies) overseas
31 Costs of digitizing Nigeria data zMany volumes have poor quality paper and print ymimeographed copies of typewritten pages yskewing, bleed through, strikeovers, text cut off zGoals of project yTry to find tipping point for digitizing collection vs. digitizing selection of tables yTest whether lessons learned from relatively clean tables apply to lower-quality print materials yDetermine how these materials would fit into EGCDL UI zCost estimates yConsidering TIFF and PDF for all volumes, flagging good quality volumes for automated conversion to Excel
32 What we are learning and documenting Digital Life Cycle considerations: Formats and metadata standards Long term home for digital assets and the Rescue Repository Exactly what assets do we preserve? Questions about versioning XML to Excel as potential preservation format
33 What we are learning and documenting Metadata production: extensive use of automated processes Dublin core for file level metadata DDI for table level metadata User interface development and Spanish character challenges Nesstar implementation for aggregate data Developed own UI for more flexibility, ability to include PDFs and Excel tables
34 What we are learning and documenting Scholarly use of the EGCDL: Finding and accessing tables yGreat advantage in locating data content by the full text of the tables; value in online access to individual tables. yIntegrate searching and access into existing resources yAre investments in online analysis and visualization worth it? Are Excel tables what faculty and students want?
35 What we are learning and documenting Production of more tables: yDo faculty value a collections based digitization effort or is there more value in on demand service? yWhat other countries in the EGC print collection are of interest? yCost comparisons of building full volume equivalents vs selected tables. yThe process can be leveraged to facilitate production for other research and learning projects.
36 Contact information firstname.lastname@example.org email@example.com firstname.lastname@example.org Project web site: ssrs.yale.edu/egcdl