Hitting a Moving Target: Historical Language Change in Information Retrieval Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign 1
The Setting: Digitized Book Repositories Google Books: 15 million books Hathi Trust: 8.4 million books Project Gutenberg: 20,000 books … Growing quickly General research problem: How to support information retrieval over such corpora? Specific problem: How to support information retrieval in corpora that see substantive linguistic change? 2
The Setting: Query by Passage When April with its sweet showers Even a fool, when he holds his peace is counted wise. Many hands make light work. 3
The Setting: Historically Diverse Corpora 4 April Its April April comes & April goes April 8 When April with its sweet showers… When April with his showers sweet with fruit Like Aprill showre so stream the trickling teares soote stormis of Aprille vnto δe rote Whan that Aprille with his shoures sote
Two Strategies for Resolving Language Mismatch Spelling correction Machine Translation Problems Spelling: creature wight Translation: out of vocabulary knowledge 5 When April with its sweet showers… Whan that Aprille with his shoures sote
Combining Approaches via the Noisy Channel Model 6 For a modern word m, we have the probability that the correct old word is o given spelling correction evidence s o and translation (i.e. dictionary) evidence d o. Build a query model based on these conditional probabilities
Example Query: sweet showers Query Model: swete sweet sote showrs shoures showres Top Results: 1....OF CAUNTERBURY. Whan that Aprille with his SHOURES sote The droghte of Marche hath perced like this: Whan that Aprille with his SHOURES sote: Whan that Apreele with 'is shoores by grace; Gifts from heaven fall in SHOW'RS, Cheering dark and lonely hours, By our pathway bloom SWEET flow'rs, When we're saved by grace. E'en... 7
Interested? Contact Me. 8 Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign