Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004.

Similar presentations

Presentation on theme: "1 Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004."— Presentation transcript:

1 1 Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004

2 2 Overview The use of knowledge tables  Relationship data: segmentation, storage  Attributes  Statistics  Rules  A closer look How do we use the knowledge and the rules in interpretation? The Rolodex-demo

3 3 ANK Engineering Ltd. Appleford

4 4 Monsieur e/o Madame Durand

5 5 Int. Transp. Ond. Joh. Tilburg Hardinxv./Giessend. e/o

6 6 Fysiotherapeutisch Centrum Arie en Jolanda Kruizenga Intake Unit 1

7 7 Dr. John Park jr. BA, MR EconS, MKM

8 8 Siemens ElectroCom GmbH & Co. Postdienstautomatisierung und Technologieentwicklung

9 9 DE POST c/o mevrouw A. Vanderwalle-Van Damme Industrieel Ingenieur Logistiek

10 10 RegTP, Regulierungsbehörde für Telekommunikation und Post

11 11 CQCS International Consulting

12 12 Chowhounds Delight Restaurant & Bar Attn: John Peter Arnold

13 13 Eerste Roelofarendveense Papierfabriek Anno 1931 NV h.o.d.n. “Papier Hier”


15 15 Suomen Posti OY Tuotteet/ Mediapalvelut/ Osoitepalvelut

16 16 Let’s summarize…. Surnames Given names Forms of address Titles Prefixes/infixes and prepositions/articles Additions Professions Geographical items Legal forms Company words Divisions Company names Ordinals

17 17 Relationship data LCR manages and maintains 3 knowledge databases for each country: 1stbase Fambase DicMan LCR manages and maintains country specific synonym tables

18 18 Storage of relationship data Segmentation (define groups of data) Attributes of groups Attributes of particular items Link between items (abbreviation, plural, etc.)

19 19 STATISTICS BEDENL Surnames3374101006097277312 Given names206182242525569 FoA269131136 Titles2841739279 Prefix/Infix & articles/prepositions 654664498 Additions324192143 Professions9682792355 Geogr. items124163224818611 Legal forms2361835138 Company words2046781215920 Divisions17216090 Company names19671504684 Ordinals42129371

20 20 General and country specific rules Capitalization Punctuation Word break Abbreviation

21 21 Capitalization Belgium: Flemish: Karin Van der Ploeg Walloon: Henri de La Censerie Germany: E.v. Buskirk KG Verband der Chemischen Industrie e.V. Netherlands: Puffelen r.a., Victor van Puffelen RA, de heer Van

22 22 Punctuation Mr Theodor St.John mr. Olaf Oudendijk Martin Klaus Lehmann Martin, Klaus & Lehmann HA.DI.WE. Inh: Hans-Dieter Weber Don Quichotte N.V./S.A. Don Quichotte NV/SA

23 23 Epitaph Here lies my beloved wife Christine In heaven she is not in hell I know It’s written for everyone to be seen

24 24 Word break J.P.L. den He- yer Groepsex- cursies General and country specific rules: -In NL: ma-chi-nes -In GB: ma-chines NEVER: mac-hines

25 25 Abbreviation General rule for BE, DE and NL: Every word must not be abbreviated further than its first Vowel- Consonant (VC) group or its first Consonant- Vowel-Consonant (CVC) group. Abbreviation – abbrev. – abbr. Consonant – conson. – cons. There are country specific abbreviations: Ges.m. beschränkt. Haft. / Handelsmij./ Stnrs. / R.P. and RR.PP. But beware of the Hotel Association Française

26 26 A closer look: Family names Prefixes Names consisting of several parts Names with a foreign language attribute Diacritic symbols

27 27 Prefixes In NL separation of prefix and family name is necessary for sorting purposes In the Human Inference databases:  22.000 family names with prefix in BE  15.000 family names with prefix in DE  30.000 family names with prefix in NL Validation of names: Le Galloudec, but not Galloudec

28 28 Names consisting of several parts Double-barrelled names with and without hyphen: Adelheid de Boer-van Buiten Dirk Segaert vanden Bussche Double-barrelled name with infix: Arie Gansneb genaamd Tengnagel tot den Bonckenhave Double-barrelled name without infix: Martina Galloux Wittevrouw

29 29 Names with a foreign language attribute Three categories: Arabic: el Bahlaoui Husseini al Fharid Chinese/Vietnamese: Cuong Buo Chan Spanish/Portuguese: Fonseca Aranda de Pereira Rodriguez

30 30 Diacritic symbols All diacritics have to be recorded in the database.  Preferences in Capital Conversion  Validation of names Examples: Büch Hällström Özgüleç Güçlütürk

31 31

32 32 Interpretation of relationship data Different kinds of relationship data Different attributes General and country specific rules (capitalization, abbreviation, etc.) Signification differs due to context Due to the ambiguity of relationship data, correct interpretation is no picnic

33 33 Different kinds of relationship data with different attributes Betonmortelfabriek BEMOTI Tilburg bv Tilburgse Betonmortelfabriek BEMOTI bv RegTP, Regulierungsbehörde für Telekommunikation und Post CQCS International Consulting Servicebureau Jansen/ Jansen Elektroservice De Boer Landbouwmachines/ De Boer Machinebouw

34 34 Signification can differ as consequence of context, rules for abbreviation, capitalization and punctuation Art Gallery Wandt & Wandt Wandt Fachhandel für Kunstart. Art. Wandt Kunsthandel van Walbeek, M.B.A. Van Walbeek, MBA

35 35 Significations: How can they be determined? Does the item exist in the particular knowledge universe? Can the significations be resolved or deducted (acronyms and compounds)? If the item does not exist in the knowledge universe, what is the most probable signification, considering the context?

36 36 Can the item be deducted or resolved? NeVoBo Nederlandse Volleybalbond KLM Koninklijke Nederlandse Luchtvaartmaatschappij AAAA Maschinenfabrik Mertens Carburateurbinnenverlichtingsfabriek Mertens

37 37 The item is not found in the knowledge universe Harry Edward Johnson Harry Edward Ireallygotaweirdsurname IBM Computing HAL Computing Hermans Groente & Fruit, A’dam Johnson Sarvice & Cnosult, Chelsee

38 38 Context Metzgerei Theo Frankfurt given name/surname? Metzgerei Theo Frankfurt given name/ geographical item? Karin Jansen – Bloemen given name/surname/company word? Karin Jansen – Bloemen given name/surname – surname (maiden name)?

39 39 Patterns Restaurant Die Vier Jahreszeiten Café Het Nerveuze Schaap Jasmijn Bloemen en Planten Helena Catering & Imbiß Consultingservice QCS Amsterdam Aardappelhandel ABC Paterswolde

40 40 Patterns? chr. bond v. ambtenaren chr. bond van zomers KARL OTTO GRAF LAMBSDORFF EVA MARIA BARON POTOCKI Hi-Fi Johanson & Gruber GmbH Em-Lo Emmerich und Lohmeier GmbH

41 41 Multiple occurrences An item must be stored in all its significations Beh.  Behandlung, Behälter, Behörde, Behinderte Ond.  Onderzoek, Onderhoud, Onderneming, Onderwijs, Onderling

42 42 Interpretation step by step Read appellation Divide appellation in relevant sections and ascribe all possible significations to the sections Apply context and grouping rules and chose the most probable combination of significations Score the found items, the small context, the large context and the corrections for special cases.

43 43 InterpretationSignification Knowledge Universe AppearanceContext

44 44 The rolodex demo

45 45

46 46 For more information:

Download ppt "1 Interpretation and fault-tolerant identification of relationship data Holger Wandt Colloquium Taal en Spraak KU Nijmegen Wednesday 3 March 2004."

Similar presentations

Ads by Google