Presentation is loading. Please wait.

Presentation is loading. Please wait.

The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science.

Similar presentations


Presentation on theme: "The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science."— Presentation transcript:

1 The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science Database Research Group University of Rostock

2 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 2 Overview 1. Available documents Cadastral register Additional information Index of persons List of abbreviations 2. Algorithms Overview of the approach Analysis of (implicit) structure, layout and of the regular parts Association of markup in the historical texts Rule-based semantic analysis Structuring of information 3. Results 4. Conclusion & Literature

3 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 3 Motivation In the Department of History at the University of Rostock (Prof. Papay) Historical geographical information systems are developed, underlying databases Cadastral registers contain lots of information for these geographical information systems (in texts) –Characteristics of historical texts: –No unique orthography –Varying spelling (temporal and regional), also the spelling of proper names varies –Influences from different languages –Usage of Latin words (for instances saint feasts (Heiligenfeiertagen) for dates) –Varying use of capitalisation Inserting data in the databases is till now a manual process With a diploma thesis: trying to automate this process

4 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 4 Available documents Cadastral register of the town Wismar (1677 – 1838) Contains information about the history of the town Reflects changes of the building development Shows ownership structure Similar Cadastral register are available for the towns Rostock and Stralsund Different categories and different structures in the registers Then-mayor (burgomaster) fixed the structure thats why it is different in each town 2002 all Cadastral registers had been (manually) digitalised - aim: preserving of sources That means no changes on the original texts but addition of an index for persons and a list of usual abbreviations

5 Street Categories Neighbour objects Alt Wismarsche Strasse vom thor her. Norderseite 1 Grundbuch Nr. 16 ( 16 ), fol. 15v, neues Stadtbuch Nr. 196 2 siehe Grundbuch Nr. 15. 3 Haus 4 jus protimiseos an dem dahinten belegenen Garten. Medard. 1673. 5 ut No. praeced. siehe Grundbuch Nr. 15. Peter Koppe. adjud. et aedif. Tr. Regum 1657. Georg Gammelkern. empt. Medard. 1673. […] 6 400 Rthlr. Joh. Christoff Müller. Clem. 1677. delirt Mattaei 1688. 600 m. l. die Stadt Cämmerey t. Joh. mit 5 procent zu vertzinsen. […] Contig. Oldeböterstr. Ost Grundbuch Nr. 310.

6 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 6 List of persons Information about a person Different alternative spellings of a family rname References onto another spelling of a family name Family name with additional information about a person (first name) and the cadastral number (where the person is mentioned) Gagzow, David 1207 Gahde siehe Gade Gahrtz, Gehrtz -, Agneta Cristiane Hedwig (Witwe, geborene Haase) 22, 29, 263

7 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 7 List of abbreviations Abbreviations of phrases Alternative abbreviations Explanations infr.: infra (unten) Inn., Innoc., Innocent. Puer. : (dies) Innocentum Puerorum (28. Dezember) intab., intabul.: intabulatus (intabuliert, in das Grundbuch eingetragen) inter., interim. : interimistisch

8 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 8 Overview Combining analysis of layout, structure and texts Stepwise enrichment of the original documents with markup (markup contains the meaning of the text fractions) generation of XML documents (bottom up) Storage in databases

9 HTML TXT XML DOCRTF Transformation of the input format List of abbreviations List of persons Cadastral register Analysis of layout and structure Exploitation of regular expressions Storage in database tables Structuring of information Mapping rules Dictionaries Semantic rules Rules for structuring Grammars Normalisation Rules for replacement Semantic Analysis Analysis of full texts

10 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 10 Analysis of layout and structure Usage of layout characteristics Bold and italic fonts (HTML-Markup) Usage of X-Fetch Wrappers Analysis of structural characteristics Numbers at the beginning of rows determine the categories cadastral item is divided into the different categories Implementation uses the parser generator ANTLR adding of XML-Markup into the cadastral items

11 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 11 Exploitation of regular expressions Some parts of the input documents are regular (list of abbreviations, list of persons, category 1 of cadastral items) Example: Grundbuch Nr. 16 ( 16 ), fol. 15v, neues Stadtbuch Nr. 196 A Grammar describes these expressions adding of further XML-Markup into the cadastral items

12 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 12 HTML TXT XML DOCRTF Transformation of the input format List of abbreviations List of persons Cadastral register Analysis of layout and structure Exploitation of regular expressions Storage in database tables Structuring of information Mapping rules Dictionaries Semantic rules Rules for structuring Grammars Normalisation Rules for replacement Semantic Analysis Analysis of full texts

13 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 13 Analysis of the historical full texts /1 Usage of 23 different dictionaries Some of them generated from the list of persons (first names, family names) Some created by hand (streets, saint feasts, professions, stop words) Same method that was used in the GETESS project, developed from the DFKI Saarbrücken, AG Prof. Uszkoreit Determining the similarity between terms in the cadastral register and in the dictionaries with phonetic encoding and phonetic similarity search

14 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 14 Analysis of the historical full texts /2 phonetic encoding Word distance with Levenshtein distance Cadastral item dictionaries Term 1 Term 2 phonetic encoding Term 1 Term 2 Phonetic encoding norms all spellings that sound similar Example: Friedrich VRYEDRYCH Substitution rules for that had been defined (because that were not available for historical texts)

15 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 15 HTML TXT XML DOCRTF Transformation of the input format List of abbreviations List of persons Cadastral register Analysis of layout and structure Exploitation of regular expressions Storage in database tables Structuring of information Mapping rules Dictionaries Semantic rules Rules for structuring Grammars Normalisation Rules for replacement Semantic Analysis Analysis of full texts

16 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 16 semantic analysis with rules Association of terms to dictionaries: unique ambiguous (two or more meanings)semantic no association foundrules Different semantic rules (with priorities) are applied In the rules information about the context are used (predecessor –successor) Example: (a rule in informal description) token: number, has no associated meaning predecessor token: date meaning year is associated to the token

17 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 17 HTML TXT XML DOCRTF Transformation of the input format List of abbreviations List of persons Cadastral register Analysis of layout and structure Exploitation of regular expressions Storage in database tables Structuring of information Mapping rules Dictionaries Semantic rules Rules for structuring Grammars Normalisation Rules for replacement Semantic Analysis Analysis of full texts

18 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 18 Structuring of information Up to now: meanings are associated to terms of the cadastral register Now: bottom up structuring of information Example: … Process base on rules

19 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 19 HTML TXT XML DOCRTF Transformation of the input format List of abbreviations List of persons Cadastral register Analysis of layout and structure Exploitation of regular expressions Storage in database tables Structuring of information Mapping rules Dictionaries Semantic rules Rules for structuring Grammars Normalisation Rules for replacement Semantic Analysis Analysis of full texts

20 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 20 Result of this process...............

21 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 21 Instead of a benchmark 0 10 20 30 40 50 60 70 80 tokens with unique markup tokens without markup token with two or more meanings after full text analysis after application of semantic rules Semantic rules assign numbers a meaning (most number are four-digit numbers: year or cadastral numbers) Solves ambiguous meanings (first name vs. date, family name vs. profession)

22 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 22 Conclusion Approach bases on extensible rules and dictionaries flexible tool Extraction of information is realised in a stepwise process Check mechanism supports error detection An evaluation of the results is difficult because we cannot compare the results of the process with correct results All methods that are represented here had been developed in the diploma thesis from Manja Nelius Future work: Semantic analysis and information structuring in one step: Matching between XML documents with incomplete markup and a schema

23 Dagstuhl, 4.-8.12.2006 Manja Nelius, Meike Klettke University of Rostock Folie 23 Literatur Ernst Münch: Das Wismarer Grundbuch ( 1677/80 - 1838 ), Verlag Schmidt-Römhild, Rostock, 2002 Hans-Jürgen Martin, Geschichtlicher Abriß der Rechtschreibung, http://www.schriftdeutsch.de/orth-his.htm, 2004 http://www.schriftdeutsch.de/orth-his.htm Justin Zobel and Philip W. Dart: Phonetic string matching: lessons from information retrieval, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval table of contents, 1996 Justin Zobel, Philip W. Dart, Finding Approximate Matches in Large Lexicons, Software --- Practice and Experience, Volume 25, Number 3, 1995 Norbert Fuhr: Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung, http://www.is.informatik.uni-duisburg.de/projects/rsnsr/, 2005http://www.is.informatik.uni-duisburg.de/projects/rsnsr/ M. Abolhassani and N. Fuhr and N. Gövert, Information Extraction and Automatic Markup for XML documents, in Intelligent Search on XML Data. Applications, Languages, Models, Implementations, and Benchmarks, ed. Henk M. Blanken and Torsten Grabs and Hans-Jörg Schek and Ralf Schenkel and Gerhard Weikum, Lecture Notes in Computer Science, Vol 2818, 2003 Arnaud Sahuguet, Fabien Azavant: Building Light-Weight Wrappers for Legacy Web Data- Sources using W4F, 25th Conference on Very Large Database Systems, Edingurgh, UK, 1999 Terence Parr: ANTLR -- ANother Tool for Language Recognition, http://www.antlr.org, 2004http://www.antlr.org X-Fetch Suite, Republica Corporation, www.x-fetch.comwww.x-fetch.com Kai-Uwe Sattler, Stefan Conrad, Gunter Saake, Datenintegration und Mediatoren, In Web & Datenbanken, d.punkt Verlag, Heidelberg, 2003


Download ppt "The cadastral register of the town Wismar (1677 – 1838) Structure Mining in historical documents Manja Nelius Meike Klettke Department of Computer Science."

Similar presentations


Ads by Google