Presentation on theme: "Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University."— Presentation transcript:
Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design Robert B. Allen Drexel University
Theme of Access Rich cultural record –Massive data sets. OCR text available. Text processing of the OCR. Too much content for manual processing –Segmentation –Metadata assignment –Constraints Beyond Traditional Approaches: –Events –Models, –Interfaces
Constraints for Processing and “Understanding” Page numbers Surrounding articles Sections/Features Cyclic patters (e.g., “drought”) Named entities (People, organizations, places, etc.) Event threads Community models Comparisons across resources multiple newspapers with digitized books with historical records with archives and special collections
CountPercent Correct6467 Minor errors (e.g., merging a few words)66 Combined two or more articles1112 Too much segmentation1415 Segmentation with Ilya Waldstein and Weizhong Zhu Finding meaningful regions in the text. Several methods. Look for headings and then merge other sections of text. Different sections have different problems. Tradeoff: more hand-entered knowledge helps but takes effort
Sections and Features Articles with Catherine Hall J WILLIAM LEE E M EARLE SON THE STORE THAT SAVES YOU MONEY NATIONAL BISCUIT COMPANY ADVANCE SPRING STYLES
Text Mining for Words: Text for Holidays Oct-Dec from 1916, Philadelphia Evening Ledger “Thanksgiving” “Christmas”
Towards “models” for event steams. Oct-Dec 1916, Philadelphia Evening Ledger “Campaign” “Election”
Oct 29 1906 awful breaking bridge camden coach dempsey drawbridge heroism motorman picked submerged surface survivors thoroughfare trestle windows Nov18 1906colon dillon hopes lacking princeton princetons teams tigers yale Dec 31 1906ambulances awful belt coaches cotta crowded empty horribly identified mangled relief rescuers splintered takoma terra Beyond Keyword based Search Engines: Finding Important Events by Comparing Multiple Sources of Evidence Combing information from two newspapers 3 months from 1906, Washington Times, Washington Herald Find distinctive words then overlaps of those distinctive words in the two newspapers
Focus-Context Timeline for History (Allen, 1999)
Narrative Timeline Causes of American Civil War
Interviews with Historians on Interface Needs: Two Themes: Search and Information Management The Chicago Tribune database is good for searching names, but broader topics are hard to research – e.g., race relations brings back too many results. A log of all searches – “this is a huge issue for me.” Editing a book manuscript recently, she found it “hugely taxing” to find items she hadn’t cited. Searches lead to other searches, so she would like ways to see how searches are nested within each other and to get back to earlier search results. A visual map telling you where you are in your search would be especially helpful. A system that lets her easily use multiple windows. [The historian] used newspapers to fill in gaps in research and corroborate information from other sources. Exploratory searching included looking at larger issues and events such as elections and campaigns. She used newspapers to find public opinion about changes in liquor license laws – to get a sense of “the texture of the city… how the city was thinking.
Image Genres Select images based on IPTC images genres Cluster the images based on features Learn to classify those clusters
STATEHOOD MEASURE WILL PASS THE HOUSE Republicans Determine to Rush Hamilton Bill Through to Be Ready for Senate in December margin but NSW Mexico which hasnearly I nearly double the population of Arizona is largely Republican at present The Republicans in their rule will provide that no amendment shall be con sidered I THE T MEs71 I world Fair Contests it OFFEH NO lTf acid the three employes of the District or National + t tional Government collecting respectively the < Uteat number ofLouis Sti 4 Louis Exposition coupons to the Worlds Fair for 4 one week and payaIixpenses i pxpenses Note District or National Government ewtploUli es SUKonly Uli only the coupon The Washington Times for 1904 was digitized from USNP microfilm to METS-ALTO format. The ALTO files have OCR along with fonts, point size, and coordinates. However, the OCR ranges from good to bad to ugly….