Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robin Butterhof & Deborah Thomas Library of Congress Leah Weinryb Grohsgal National Endowment for the Humanities Digitized Newspapers & Research DPLAfest.

Similar presentations


Presentation on theme: "Robin Butterhof & Deborah Thomas Library of Congress Leah Weinryb Grohsgal National Endowment for the Humanities Digitized Newspapers & Research DPLAfest."— Presentation transcript:

1 Robin Butterhof & Deborah Thomas Library of Congress Leah Weinryb Grohsgal National Endowment for the Humanities Digitized Newspapers & Research DPLAfest April 16, 2016

2 NDNP / Chronicling America p.2 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS From 1836-1922… history’s markers

3 NDNP / Chronicling America p.3 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS “History’s Rough Draft” Something for everyone - Crime, Fashion, Travel, Economics, Events, Battles, Tragedy, Politics, Social Activism, Diplomacy, Society, Technology …

4 NDNP / Chronicling America p.4 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Working with U.S. Newspapers Many types of users, high demand for access Newspaper format challenges Physical characteristics Large, brittle, acid paper, poor ink, light damage Content characteristics Many subjects on a page, small text, hard to identify parts No single U.S. collection – 153,000 titles published since 1690 (collected across the country) Newspapers = fundamentals of U.S. history

5 NDNP / Chronicling America p.5 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS National Digital Newspaper Program (2004- )  Enhance access to American newspapers  Develop permanent digital resource including selected historic content from all US states and territories  Shared resources and cost distribution (LC/NEH/Awardees)  Shared practices/specifications = community  Paced scalability  Plan for technical change and sustainability requirements

6 NDNP / Chronicling America p.6 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS National Digital Newspaper Program (2016 )  10.7 million pages online  Approx. 70+ Tb online, 550+ Tb archival storage  3.9 million visits in 2015 (chroniclingamerica.loc.gov)  40 states and territories participating  Also received:  1000 newspaper history essays  2000 bibliographic titles (of 153,000 titles published)  10,000 reels of microfilm (duplicate print negative)

7 NDNP / Chronicling America p.7 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Chronicling America: Historic American Newspapers http://chroniclingamerica.loc.gov/

8 NDNP / Chronicling America p.8 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Finding Our History  Page Search – Full text  Search by place, time, keyword  Page information – Title, Date, Edition, Section, Page (Image)  Visual search results (Thumbnail view with hit-highlights)  Pan and Zoom  Full-screen view  US Newspaper Directory Search  Search by place, time, keyword, format, subject, etc. (CONSER/WorldCat data)  Keyword search – e.g., “http” (external Web site links) or “times”

9 NDNP / Chronicling America p.9 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS PARTNERS: 40 institutions | 10.7 million pages now online | 1836-1922

10 NDNP / Chronicling America p.10 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Robin’s section  See notes for text...  [Put pretty poster here]

11 NDNP / Chronicling America p.11 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Halibut prices, Seattle 1880 1918 Influenza Epidemic Civil War Editorials Mark Twain Great Blizzard of 1888

12 NDNP / Chronicling America p.12 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Genealogy and historical romance research... made visible by a Twitter bot

13 NDNP / Chronicling America p.13 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS ChronAm: What’s Available Digitized page images OCR Mars has atmosphere, seasons, land, y!?H water, storms, clouds and mountains. "H Mars has i-wr. "'o - H only 3,700 miles awa.y and revolves around ?!i it ni seven and a half 'houvs ? phoot- fciji': ing star. Metadata "place_of_publication": "Salt Lake City, Utah", "lccn": "sn83045396", "start_year": "1890", "place": [ "Utah--Salt Lake--Salt Lake City" ], "name": "The Salt Lake tribune.", "publisher": "Tribune Pub. Co.", "url": "http://chroniclingamerica.loc.gov/lccn/sn8 3045396.json", "end_year": "current", "issues": [ { "url": "http://chroniclingamerica.loc.gov/lccn/sn8 3045396/1904-01-01/ed-1.json", "date_issued": "1904-01-01" Newspaper Directory Records

14 NDNP / Chronicling America p.14 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Usage: Directory Records Stanford

15 NDNP / Chronicling America p.15 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Usage: OCR Northeastern Georgia Tech

16 NDNP / Chronicling America p.16 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Usage: Digitized Page Images University of Nebraska - Lincoln

17 NDNP / Chronicling America p.17 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS ChronAm: How do we make it available?  Public website  Open API – no login required  Industry standard endpoints – like OpenSearch  Machine readable views (like JSON)  Easier to play with the stuff  Stable URLS  Added bonus - URLs make sense (title/date/page)

18 NDNP / Chronicling America p.18 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS ChronAm: How do we make it available?  As pre-fab datasets (OCR bags)

19 NDNP / Chronicling America p.19 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Lessons Learned about the API  Easiest methods are best  CSV file  People will try to use the API first and contact you as a last resort  Users may underestimate size of files and downloading time 225,000 pages x 5.2MB = 1.2TB = BAD IDEA To avoid: Ask-a-Librarian!Ask-a-Librarian

20 NDNP / Chronicling America p.20 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Lessons Learned about the API  Expect the unexpected.

21 NDNP / Chronicling America p.21 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS ChronAm: What Users Tell Us They Want  More stuff  Practical concerns  Structure of program  Copyright issues  Better OCR  OCR problems are big with newspapers (multicolumn layout, microfilming artifacts like uneven lighting, bad condition newspapers at time of filming, less contrast than a book, etc.) See UNL visual analysis project.UNL visual analysis project

22 NDNP / Chronicling America p.22 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS For Libraries: Challenges and Opportunities  Mixing full-text search and metadata search is hard  Lots of bad OCR versus relatively little clean metadata  Newspapers are serial objects  Secondary concerns for monographs (time, place) are critical to newspapers  Newspapers are big  Compared to a book page or photograph, newspaper pages are huge

23 NDNP / Chronicling America p.23 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS For Researchers: Challenges and Opportunities  Going bigger  Interdisciplinary  Ability to scale up a project  Getting stuff across project borders  Gaps in dataset

24 NDNP / Chronicling America p.24 NATIONAL ENDOWMENT FOR THE HUMANITIES LIBRARY OF CONGRESS Thank you!  NDNP Public Web http://www.loc.gov/ndnp/ http://www.loc.gov/ndnp/  NDNP Web Service Chronicling America: Historic American Newspapers http://chroniclingamerica.loc.gov http://chroniclingamerica.loc.gov  Contact us at ndnptech@loc.govndnptech@loc.gov


Download ppt "Robin Butterhof & Deborah Thomas Library of Congress Leah Weinryb Grohsgal National Endowment for the Humanities Digitized Newspapers & Research DPLAfest."

Similar presentations


Ads by Google