1 © OMIKRON © OMIKRON Data Quality GmbH ∙ ∙ Duplicates & matching in worldwide data: Challenges & Solutions Carsten Kraus

2 © OMIKRON Data Quality GmbH ∙ ∙ The world becomes more global Germany creates over 30% of its GNP by exports In 1993, export accounted for only 20%

3 © OMIKRON Data Quality GmbH ∙ ∙ The world becomes more local When conquered by the russians, Mongolia changed from their own alphabet to Cyrillic In 1993, they changed back Ireland and many other countries put efforts to strengthen their own languages  Don‘t believe the world will soon switch to English standards anyway

4 © OMIKRON Data Quality GmbH ∙ ∙ The World – so what? Worldwide customer data needs: -A worldwide adequate data structure -worldwide adequate entry forms  e.g. for the internet -Worldwide adequate Processing  E.g. Matching / Duplicate check

5 © OMIKRON Data Quality GmbH ∙ ∙ Why is matching important Save money -Avoid duplicates in Mailings -Avoid selling to Blacklisted customers or above credit limit Earn more money -Avoid customer frustation -Single view of customer needed for BI  CLV  cross selling  marketing controlling …… Save your life -Avoid selling to terrorists

6 © OMIKRON Data Quality GmbH ∙ ∙ Do not trust the postal code In the UK, the postal code goes very deep, sometimes to the level of a single building -It is therefore a strong identifier in duplicate checks Germany, France, Switzerland, Sweden … only 5 resp. 4 numeric digits 70 countries do not have any postal code at all – e.g. Ireland  Do not trust that you can base strong identification on the postal code

7 © OMIKRON Data Quality GmbH ∙ ∙ Postal Data Street names are available only for a few countries Different periods of updating Therefore: Duplicate Check must also be able to handle addresses, which are not postally precise

8 © OMIKRON Data Quality GmbH ∙ ∙ Characters in the world: Alphabet Abugidas Abjads Syllables Script Symbol Script -Александр Пушкин - ेवनागर - أسامة بن لادن - あいこ - 愛子

9 © OMIKRON Data Quality GmbH ∙ ∙ Processes e.g. Duplicate Check Just „Unicode capabilities“ are not enough

10 © OMIKRON Data Quality GmbH ∙ ∙ Abjads Semitic Languages -Arabic [ العربية ] -Hebrew [ עברית ] -Abjads are written from right to left Abjads are only using consonants -For most words, vowels are optional, as they are obvious for the locals and are added while speaking Problems result with latin writing of arabic names: -27 ways of writing Usama bin Laden [ أسامة بن لادن ] in the archives of „Der Spiegel“ (magazine like „Time“) -(Demonstration with Omikron-technology)

11 © OMIKRON Data Quality GmbH ∙ ∙ Japan Multiple ways of writing: Aiko (Child of Love) 愛子 あいこ

12 © OMIKRON Data Quality GmbH ∙ ∙ China ZHANG Aiguo 张爱国 ZHANG Aimin 张爱民 ZHANG Aidang 张爱党 ZHANG is the family name, Ai the generation name Only the last syllable represents the given name

13 © OMIKRON Data Quality GmbH ∙ ∙ The 5 most common names U.K. 3,3% China 31%

14 © OMIKRON Data Quality GmbH ∙ ∙ Thus Do not trust the family name to be a good differentiator in all countries Your software should be able to handle these cases

15 © OMIKRON Data Quality GmbH ∙ ∙ Householding In many countries, male and female names have different endings -E.g. greek names  Male: Πέτρος Κώτης (Petros Kotis)  Female Αναστασία Κώτη (Anastasia Koti) When identifying households, it is just not enough to search for a 100%match

16 © OMIKRON Data Quality GmbH ∙ ∙ Order of given/family name In the U.K., names begin with the given name: -John Smith In France and in many other countries the given name stands after the familiy name: -DUPONT Michel

17 © OMIKRON Data Quality GmbH ∙ ∙ Komposita in Firmennamen English: ring tone service Ltd. (3 words) German: klingeltonservice GmbH (1 word) Or: -Klingelton-Service -Service für Klingeltöne Not only German: -Netherlands -Scandinavia -Occasional occurrence in many languages Most algorithms cannot solve that as they compare wordwise

18 © OMIKRON Data Quality GmbH ∙ ∙ Omikron offers: Worldwide matching technology -At D&B Sweden and at Schober Iberia, we replaced localized solutions because our international matching technology proved better results than localized solutions – even on local data -At Reed, we found more duplicates in international Addresses already processed by another high end software -Patent pending Other DQ technology -e.g. data structuring, Upper/lower case… All built into an SOA-ready solution, the Omikron DQ Server -Other surroundings available

19 © OMIKRON Data Quality GmbH ∙ ∙ Thus: Handling international data correctly, means more than just being able to import Unicode-Data Keep in mind the impact on -Data storage -Data entries -Matching -Salutation -etc. The global world becomes more local again – care about it and you will have a competitive advantage Feel free to ask us to help you ;-)

