Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University.

Similar presentations


Presentation on theme: "Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University."— Presentation transcript:

1 Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University

2 18 July 2002Perseus Project, JCDL 20022 Corpus Preview

3 18 July 2002Perseus Project, JCDL 20023 Preview: 1400-1600

4 18 July 2002Perseus Project, JCDL 20024 What DLs can do for gazetteers Directly manage gazetteers Raw materials for gazetteers –Reference works –Monolingual and parallel corpora Testbeds for improving these technologies –E.g. alignment helps name tagging, and name tagging helps alignment

5 18 July 2002Perseus Project, JCDL 20025 Lexicographical parallels Original “slipping” process –First, get a madman... Creation of Brown and other corpora –Kucera and Lewis Cobuild dictionary and friends But names “get no respect” in lexicography (McDonald, 1996)

6 18 July 2002Perseus Project, JCDL 20026 Cultural dependencies

7 18 July 2002Perseus Project, JCDL 20027 Toponym Results

8 18 July 2002Perseus Project, JCDL 20028 Projection principles Exploits asymmetry in human language technologies (Yarowsky, HLT 2001) English, French, Chinese, Czech (!) have –POS taggers, morphological analyzers –Named entity identifiers –Parsers and bracketers Parallel corpus alignment allows projection of these resources

9 18 July 2002Perseus Project, JCDL 20029 Projection principles

10 18 July 2002Perseus Project, JCDL 200210 Projection on the cheap Align texts at coarse structural level Geocode source text (English) Optionally winnow target text (e.g. non- capitalized words where applicable) Calculate mutual information (Church & Hanks, 1990) Transliteration may be too ad hoc

11 18 July 2002Perseus Project, JCDL 200211 Preliminary results Greek/English testbed 98% precision 70.8% recall (Why?) Ethnic designations present interesting problems –“Stephanus of Byzantium” Morphology outside of English

12 18 July 2002Perseus Project, JCDL 200212 Proposals Preservation of gazetteer source materials DLs as home for gazetteer “slips” Parallel texts as key resource –(also cf. Berkeley TIDES work) Persistent documents as training sets for automatic methods http://www.perseus.tufts.edu


Download ppt "Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University."

Similar presentations


Ads by Google