Presentation is loading. Please wait.

Presentation is loading. Please wait.

Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,

Similar presentations


Presentation on theme: "Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,"— Presentation transcript:

1 Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen, LIACS, Leiden University The Netherlands

2 Links ICOS 2014 Glasgow Utrecht Leiden name variants different versions of a name, that can denote the same object requires a proof that the same object is involved (in at least one example) – not always easy – rarely explicitly provided 2

3 Links ICOS 2014 Glasgow Utrecht Leiden proper names in historical sources Lots of variation – spelling variationDirk- Dirck – suffix variationWillem - Willempje – abbreviationWillem - Wim – translationWillem - Guillaume Willem- Wilhelmus – typos (digitization)Willem - Aillem – … 3

4 Links ICOS 2014 Glasgow Utrecht Leiden variation! Guljelmus Wllhelmus Wlhelmus WIllem (Willem) Wiellem Wlllem Gujlelnius Wllem WiIllem Wijllem Wihelmus Willemj Wikllem Wwillem Willlem Guilleam Willeam Willem Wil.lem Wilem Guileam Willelmini willem Wiilem Guillem Weillem Guilelmis Wil;helmus Wilhlem Welhelmus Wiillem Wiehelmus Wulhelmus Willem) Wilehelmus Woillem Wihhelmus Weijlem Willelmus Wi;;em Wilehlmus Wuhelm Guilelmus Wilhlelmus Willem(se) Wilalem Wullem Willem. W#ilhelmus Guillelmus Wliiem Wlihelmus Wilelmus Willemm Wileem Wìllem Willemem Wolhelmus Wechelmus Guilllelmus Wilemm W.ilhelmus Willem] Willemh \Willem Wïllem w8illem Wilhellmus Wilhelm. Wilmhelmus Wilhelmuns Wilhelmua Wilhelmos wilhelmnus Wilhelmnus Wilhelmues Guilleaumme Wilhelmum Guilhelmus Willeml Wilhelmanus Wilhelmjus Wilhelmes Guilliaumme Wilhelmas Willemn Wilhelmus Wilhelmns Willhelmus Guiliaume Willlen Guiilleaume Guilliaume Willenis Guiliermo Wilempjen Willempjen Willepjen Guilliermo Wittem Willen! Wilhlenn Wijlen Wielen Willen wilhem Willempke Guilleaume Wilhellemus Wilhekmus Guiileaume Willeaume Wilhelmuus Guylleaume Guileaumme Guileaume Wilhelemus Guilleauma Willewm Guillesmus Guïllermo Guilermo Guiilermo Guillermo Guillerlmus Guijlleaume Wilheminus Wilhelhmus Guillaum guillaum Gueillaum Wilhemus Guilhemus Wielhemus Wilhehmus Wilhelminus Wilhelmienus Wilherlmus Wilhermus Weilhim Wilhiem Wilheim Wilhein Willaum Guillaim wilhemus Wilhelnmus Woalter Willhem Guillhem Wilheem Wilhem Wölhelm Wilhelimus Wilhelus Willaim Willemerman Wiechem Wiloem Wilhelmius Wilhelmijs Guilhelmis Wilhelmjs Wilhelmis Willemhelmus Willoem Wilhelnus Weilhelmus Wwilhelmus Wylhelmus wWilhelmus (Wilhelmus) (Wilhelmus Wilhelmüs Wilhelmus\ Guiljame Wilhelmus? WilhelmusHubertus Wilheelmus Wilhelmmus Wielhelmus Wilhhelmus Wiilhelmus WEilhelmus wilhelmus Wilhelmus) Wilhelmuss WilhelmusStephanus WIlhelmus Willkem Wilkhelmus Wilhelmiem Wilhelmigs Willme Wilme WilhelmusHenricus WilhelmusTheodorus Wilhelmushenricus Wilhelmusn Wilhelmuszn Eilhelmus ilhelmus Ilhelmus Willemcus WilhelmusJohannes Wilhelmushubertus Wilhelmuw Wwilhwlmus Guilliaam Guiliam Guiliaam Guillieam Guilliam Wiliam Wilnelmus Willwm Guillieaume Wilhwlmus Wilhwelmus Willum Guillum William Wilhlemus Wilielmus Willielmus Güilielmus Guililmus Guileilmus Guïllielmus Guilielmus Guillijaam Willemus Wiiliam Guilemus Guillemus Willemmus Wilehmus Wilemus Willliam Wieliam Guillielmus Wilhlmus Guillmus Wiliaam Wilhmus Guiilmus Wilmus Guilmus Aillem JohannesWilhelmus Johanneswilhelmus CornelisWilhelmus Gulliëlmus Guliëlmus Gijlliaume Güliëlmus Guli?lmus Guijelmus Gulielmus Guiëlmus Giliaume Gilliaume Gilliaumme Guihelmus Guikelmus Gullielmus Guielmus Jannwillem Janwillem JanWillem JanWilhelmus MartinusWilhelmus Qwillem 1.4

5 Links ICOS 2014 Glasgow Utrecht Leiden challenge name variation is difficult to model, therefore: learn variation in person names from use of names in real life (let data speak for itself) automatically from big data 5

6 Links ICOS 2014 Glasgow Utrecht Leiden required big data – with many references to individuals true person resolution – proof that the same individual is concerned – even with data that contain name variants 6

7 Links ICOS 2014 Glasgow Utrecht Leiden big data Dutch vital registration (who-was-who 2011 ) early 20th century – 4.1 million birth certificates (~30%) – 3.1 million marriage certificates (~90%) – 7.6 million death certificates (~65%) 55 million name references to persons 7

8 Links ICOS 2014 Glasgow Utrecht Leiden source names 1,052,000 different full first names (composite) Jan, Johanna Maria Cornelia 111,900 different female first names (singular, Maria) 82,700 different male first names (singular, Jan) 681,000 different surnames (prefixes included) Bakker, de Vries different surnames (prefixes excluded) Vries 8

9 Links ICOS 2014 Glasgow Utrecht Leiden information per person first name person (child, bride or groom, deceased) first name father surname father first name mother surname mother (always maiden name in The Netherlands) age person 9

10 Links ICOS 2014 Glasgow Utrecht Leiden person resolution assumption: the available information identifies a person uniquely (if there is exact matching) relaxed assumption: one of the first names and surnames of the mother or father is not needed for true person resolution 10

11 Links ICOS 2014 Glasgow Utrecht Leiden example Johanna Endt marries in 1858 as 29 years old daughter of Gerrit Endt and Dorothea Kerbert dies in 1882 as 54 years old daughter of Gerrit Endt and Doortje Kerbert ~1829, Johanna, Gerrit, Endt, Kerbert, Dorothea ~1828, Johanna, Gerrit, Endt, Kerbert, Doortje 11

12 Links ICOS 2014 Glasgow Utrecht Leiden test of assumption (of true person resolution) consider all matches between birth and death certificates with exact matching of all information leave out one name per match count number of multiple matches result: only 85 out of 1,107,162 matches are not unique 12

13 Links ICOS 2014 Glasgow Utrecht Leiden harvesting name variant pairs (procedure) identify all record pairs of individuals (over birth, marriage and death certificates) that exactly share – first name of the individual – approximate year of birth – three out of four names of parents (first names and surnames) collect pairs of the remaining name, if different Christiena – Christina Bloothooft - Bloothoofd 13

14 Links ICOS 2014 Glasgow Utrecht Leiden harvesting name variant pairs (results) female first names 48,600 pairs 246,500 tokens male first names 31,900 pairs 183,000 tokens surnames177,000 pairs 374,900 tokens average: first names: 5 to 6 tokens per variant pair surnames: 2 tokens per variant pair 14

15 Links ICOS 2014 Glasgow Utrecht Leiden so far so good, but the original certificates are not error-free > found variants can be due to errors in the source, during transcription or to typos theoretical issue: what is a name variant, and what is an error? 15

16 Links ICOS 2014 Glasgow Utrecht Leiden example in the source documents: Pieter born as son of Jacob Houtlosser and Aafje Spruit, died as son of Jacob Houtlosser and Grietje Spruit variant Aafje – Grietje ? 16

17 Links ICOS 2014 Glasgow Utrecht Leiden variants and errors distinction is difficult to make variants share the same lemma and errors do not requires onomastic expertise (which we would like to avoid, let the data speak for itself) 17

18 Links ICOS 2014 Glasgow Utrecht Leiden variants and errors Variants Willem- Wilhelm Willem- Guillaume Willem- W8llem (no indication of different lemma) Errors Grietje- Aafje Fijtje- Sijtje (understandable reading error but different lemma) 18

19 Links ICOS 2014 Glasgow Utrecht Leiden methods for cleaning using name dictionaries with lemmas to accept name pairs using known non-variants to reject name pairs rules to accept name pairs all with manual intervention (< 2%) 19

20 Links ICOS 2014 Glasgow Utrecht Leiden cleaning | name dictionaries dictionary of Dutch first names (20,000), but – lemmas too detailed – names with multiple lemmas – only 8% of all first name pairs share lemma in dictionary (43 % of tokens) 20

21 Links ICOS 2014 Glasgow Utrecht Leiden results, in variant pairs female first name pairs 34,800 accepted13,900 errors (29%) male first name pairs 22,500 accepted 9,400 errors (29%) surnames pairs 120,100 accepted57,100 errors (32%) 21

22 Links ICOS 2014 Glasgow Utrecht Leiden very many variant pairs (Willemina) WILMINA -WILMIJNA WILLEMJE -WILLEMPJE WELLEMTJE -WILLEMTJE WILMTJE -WILLEMPJE WILLEMTJE -WILEMTJE WILHELMINA -WILLEMPJE WILLEPMJE -WILLEMTJE WILLEMPIE -WILLEMPJE WELLEMTJE -WELLIMTJE WELLEMTJE -WOLLEMTJE WILLEMIJNTJE -WILLEMPJE WILLEMIJNTJE -WLLEMIJNTJE WLLEMIJNTJE -WILLEMPJE WILLEMIJN -WILLEMIJNA WILHELMINA -WILLEMINA WILLEMTIEN -WILMTIEN WILLEMTIEN -WILLEMTJE WILEHELMINA -WILHELMINE WILLEMKE -WILLEMKEN WILLEMKEN -WILLEKEN WILLEMINA -WILLEMINE WILLEMINA -WILLIMINA WILLEMIENA -WILLEMINA WILLEMINA -WILLEMPJE WIHELMINA -WILHELMINA WILLEMKE -WILLENKE WILLEMIJNTJE -WILEMIJNTJE WILHEMINA -WILLEMINA WILLEMKEN -WILMKEN WILLEMPJE -WILLEMTJE WILLEMIJNTE -WILLEMIJNTJE WILLEMIJNTJE -WILLEMYNTJE WILLEMPTJE -WILLEMTJE WILLEMIJNTJE -WILLEMTJE WILLEMIJNTJE -WILLEMYNA WILLEMYNA -WILLEMIJNA WILLEMPJE -WILSJE WILEMPJE -WILLEMPJE WILLEMIJNTJE -WILLEMEINTJE WILLEMIINTJE -WILLEMIJNTJE WILLEMINA -WILLEMINTJE WILLEMINA -WILELMINA WILHELMINA -WILHELMINE WILLEMIJN -WILLEMPJE WILLEMIJN -WILLEMTJE WILLEMINA -WILLEMIJN WILLEMIJNTJE -WILLEMINTJE WILLEMIJNTJE -WILLEMEIJNTJE WILLEMIJN -WILLEMIJNTJE WILHELMINA -WILLEMIJNA WILHELMIMA -WILHELMINA WILHELMINA -WILHLEMINA WILHELMIJNA -WILHELMINA WILLEMKE -WILLEMPJE WILLEPMJE -WILLEMKE WILLEPMJE -WILLEMPJE WILLEMIJNTJE -WILLEMINA WILHELMA -WILLEMIJNA WILLEMINA -WILLLEMINA WILLEINTJE -WILLEMPJE WILHELMIJNA -WILLEMIJNA WILHELMINA -WILHELMUS WILLEMINA -WILHELMUS WILHELMIA -WILHELMINA WILLEMTIEN -WILTIEN WILLEKE -WILLEMKE WILHELMINA -WILHLMINA WILHELMINA -WILHEMINA WILLEMPTJE -WILLEMTJEN WILLEMIEN -WILLEMTIEN WILLEM -WILLEMPJE WILLEMINA -WILLEMIJNE WILTIEN -WILMTIEN WILMKE -WILLEMKEN WELHELMINA -WILHELMINA GUILLIELMINE -GUILLELMINE WILLEMTIEN -WILLEMPIEN WILHELMIENA -WILHELMINA WILMINA -WILMIENA WILLEMKE -WILLEMTIEN WELLEMTJE -WELMTJE WILLEMIN -WILHELMINA WILMTJE -WILLEMTJE WILLEMINA -WILMINA WILLELMIN -WILHELMINA GUILLIELMINE -WILHELMINA WILLEMINA -WILLEMKE WILEMIJNA -WILLEMIJNA WILLEMTIJN -WILLEMTJE WILLEMINA -WILLEMMINA WILLEMIJNE -WILLEMIJNA WILLEMS -WILLEMINA WILLEMINE -WILLELMINA WILLEMKE -WILMKE WILLEMIJNTJE -WILLEMIENTJE WILLEMINA -WILLEMIMA WILLEMA -WILLEMINA WILLEMINA -WILLEMEIJNTJE WILHELINA -WILHELMINA WILLEMKEN -WILLENKE WILLEMINA -WILLEMTJE WILLEMIJNTJE -WILLIMPJE WILHELMINA -WILLEMIJNTJE WULLEMPJE -WILLEMPJE WILLEMINA -WELLEMINA WILHELMINE -WILLEMINE WILLEMIJN -WILHELMINA WILLEMIJNE -WILHELMINA WILLEMPTJE -WILMPTJE WILHELM -WILHELMI WILLEMIEN -WILHELMINA WILLEMINA -WILLEMKEN WILHELMA -WILHELMINA WILHELMINE -WILLEMINA WILLEMIN -WILLEMINA GUILLEMINE -WILHELMINE WILLEMIENTJE -WILLEMEINTJE WILLMINA -WILHELMINA WILLEMIJNA -WILEMINA WILLEMINA -WILLMINA GUILLELMINE -WILHELMINE WILLEMIJNTJE -WILMIENA WILLEM -WILLEMS WILHELMINA -WILMINA WILMPJE -WILLEMTJE WILLEMINA -WILLEMIENTJE WILLEMKE -WILLEMTJE WILLEMKE -WILLEMPKE WILLEMIJNTJE -WILLEMKEN WILLEMIJNTJE -WILLEMIJNTIE WILLEMPJE -WILEMTJE WILLEMINA -WILMIJNTJE WILLEINTJE -WILLEMTJE WILLEMTJEN -WILLEMPJE WILLEMTJE -WILLMEPJE WILLEMINA -WILHELMIMA GUILLIELMINE -GUILIELMINE WILLEMPIEN -WILLEMPJE WILHELMINA -WILLEMTJE WILLEMINA -WILLEMEINTJE WILLEMIEN -WILLEMIN WILLEMINA -WILMPJE WILMINE -WILLEMINE WILKENS -WILKES WILLEMINE -WILMINA WILLEMTJEN -WILLMEPJE WIILEMINA -WILLEMINA WILEHELMINA -WILHELMINA WILHELMINA -WILLEMDINA WILLEMKEN -WILHELMINA WILLEMIENTJE -WILLEMIJNA WILLEMA -WILLEMS WILLEMPJEN -WILLEMTJEN WILLEMPIEN -WILLEMTJE WILHELHERMINA -WILHELMINA GUILLEMINE -WILHELMINA WILLEMIJNTJE -WILMIJNTJE WILLEMPJE -WILMPJE WILLEMINE -WILLEMIENE WILLEMINA -WILLEMSEN WILLEMPKE -WILLEMPJE GUILLELMINE -GUILLELMINA WILLEMIENA -WILLEMPJE WILLEMIJNTIE -WILLEMPJE WILLELMINA -WILLEMINA GUILLEMINE -GUILLELMINA WILLEMIENA -WILHELMIENA WILLEMINA -WILHELMIENA WILELMINA -WILHELMINA GUILLEMINA -GUILLELMINE WILLEMKE -WILEMKE WILLEMKE -WILLEM WILLEMTJEN -WILLEMTIJN WILLEMPIEN -WILLEMPJEN WILLEMJE -WILLEMTJE WILLEMKEN -WILLEM WILEMIJNA -WILMIJNA WILHELMINA -WILLEMIENA WILLEMTJE -WILLEMTJEN WILLEMTIEN -WILLEMS WILLEMTIEN -WILLEMPJE GUILHELMINE -GUILLELMINE WILLEMKE -WIMPKE WILHELMINA -WILKELINA WILHELLEMINA -WILHELMINA WILEMINA -WILLEMINA WILLEMJEN -WILLEMKEN WILMINE -WILLEMINA WILHELMIN -WILHELMINA WILLEMPJ -WILLEMPJE and many more 22

23 Links ICOS 2014 Glasgow Utrecht Leiden name clusters variant pairs (are interconnected) Jan - Johannes Jan - Joannes Jan - Johan Johannes – Johan, etc create cluster Jan { Jan, Johannes, Johan } 23

24 Links ICOS 2014 Glasgow Utrecht Leiden name clusters male first names1.221 ( names, 20%) female first names1.530 ( names, 21%) compares to number of lemma’s in Dutch dictionary of first names, vd Schaar 1964 surnames ( names, 17%) compares to number in Dutch surnames overview (without many variants), Winkler

25 Links ICOS 2014 Glasgow Utrecht Leiden conclusions person name variants need proof from true person links expert knowledge necessary because errors cannot be distinguished fully automatically from true variants (but < 2%) final results are promising as a starting point to create a national repository of proven name variants 25


Download ppt "Links ICOS 2014 Glasgow Utrecht Leiden Large scale harvesting of variants of proper names Gerrit Bloothooft, UiL-OTS, Utrecht University Marijn Schraagen,"

Similar presentations


Ads by Google