Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann

Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann
Use-case oriented data preparation on address data (dd.mm.yyyy) Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann

Agenda Duplicate Detection Data normalization Phone Definition
Difficulty Typical Process Data Preparation Indexing and retrieval of candidates Comparison of candidates Classification Evaluation - Testing Merging records into a canonical record - Production Other things to consider Gold standard Transitivity Tools (Gold Standard, Deduplication) Literature Data normalization Phone Name (personal, company) Address normalization Not a trivial task GIS systems Basic GIS operations Other things to consider

Part 1 – data normalization (name, phone, address)

Data normalization Data should follow standards provided by the highest degree of authority possible. (global > national > local > company) Following specific standards should help not just in the preparation phase (e.g. duplicate detection), but all processes of your application afterwards. A normalization can take place in many attributes and forms: Commonly: addresses, names, phone numbers, dates Less commonly: price → timestamp for currency exchange rate (e.g. price in Euros (€)) age → birthday (e.g. age in full length, specifying even days, or in another calendar, such as Gregorian, Chinese etc.) weight → imperial system (e.g. pounds) versus metric system (e.g. kilograms).

Phone normalization Different lengths and formats until E.164, which can be used for worldwide uses. In E.164 a “+” (plus sign) precedes the 2-digit country-code, followed by the actual number. Phone numbers are often also split into blocks, for clarity reasons, which however is not described by E.164. This leads again to the usage of different delimiters. 415/ Example from restaurants dataset...

Names (personal, organization) normalization
First name (F), middle name (M), and last name (L), are not always in the same order (FM L), but are commonly met in the reverse order (L FM), especially when used formally. The middle name is sometimes shortened to just the initial character. Organization “Fahlman, S. and Lebiere, C.” “S. E. Fahlman and C. Lebiere,” Example from Cora dataset... Chain + Location specifier (of a particular hotel) "QUALITY SUITES" "QUALITY SUITES BORDEAUX AIRPORT HOTEL AND SPA" Not enough information 😞 Example from Hotels dataset...

Address normalization
A typical experience with address normalization. Since you didn’t remember/want to type the full address Prof.-Dr.-Helmert-Straße 2-3, Potsdam

Address normalization
Important since such information exists in many applications. Complicated task, as every country/area has its own rules. (different format or parts) Commonly found address parts: Street address: House number, Street name City ZIP-Code: ZIP4+5 in USA State-Code (/name): WA Country-Code (/name): USA (or US) USPS responsible to define the rules in USA

Address normalization Not a trivial task – Main data types to consider
Hierarchy Polylines Points of Interest (POIs) / Address entries Useful during reverse geocoding Datasets Hierarchy: Who's On first Polylines: Polyline Address entries: OpenStreetMap Data is part of companies’ success, so good quality data, is usually not available for free…

Address normalization Not a trivial task – Hierarchy
Knowledge of the hierarchy, could change a matching of Springfield to a completely different place. Maybe it is not in Washington (state) but in California. ... Or it could be that the street address actually has the error.

Address normalization Not a trivial task – Address entries datasets
OpenStreetMap: Complete planet XML file: 66GB compressed → 913 GB uncompressed Elements (Nodes, Ways, Relations) Other datasets: GeoNames, Tiger US Census, OpenAddressesIO

Address normalization Geographic information systems (GISs)
Commercial: Google Maps, Factual, ArcGIS, Baidu, Bing, HERE, Yahoo, TomTom ... Limits: 1. requests per day (proxies?) (depends on the operation: geocoding cost > reverse geocoding 2. legality: not allowed to store information for commercial reasons (with the exception of geolocation?) Open Source: OpenStreetMap (Nominatim), Pelias, Gisgraphy... Python tool, with a list of geocoders:

Address normalization Basic GISs operations
Geocoding: Given an address, return the geocoordinates (latitude, longitude). Reverse geocoding: Given geocoordinates give me the (nearest) address. Normalization (geocoding -> reverse geocoding): Repair my address. Nearby: K Nearest Neighbors. Parsing: Parse address into Street address, City, ZIP-code, State-code, Country-code. Verification: Does this address exist?

Address normalization Basic GISs operations - Geocoding
Geocoding: f(address) = geolocation The most difficult operation. Question 1: What if there is no house number 105 in the dataset, but there is house 101 and house 110? → Interpolation ← Shape of road and other building semantics might make this interpolation a much more difficult task. Picture taken from:

Address normalization Basic GISs operations - Geocoding
Question 2: What if there are typos? What happens if you can match on the address level but not in the city or state? And vice versa. Query: Prof. Dr. Helm. Str. 2, Bellevue, Brandenburg ✓ State: Brandenburg ✓ Brandenburg City: Potsdam ✓ Potsdam ✖ Street address: Prof. Dr. Helm. Str. 2 ✖ ✓ Prof. Dr. Helm. Str. 2 Which one should be considered a match? (Higher tolerance in lower levels)

Address normalization Basic GISs operations – Reverse Geocoding
Special case of "Nearby" with k = 1. Finds the closest place to the given geolocation (latitude, longitude). Question: what happens if the closest place is too far away? Different limits per type? 0.5km for places, 1km for a street, otherwise return city/state/country. Apart from the above decisions, implementation should be straightforward. (R-Trees among others)

Address normalization Basic GISs operations
Normalization: It should be as straightforward as geocoding & reverse geocoding. Unless geocoding fails, in which case syntactic preparation (e.g. with regex tools) could be done instead. USPS: "…A standardized address is one that is fully spelled out, abbreviated by using the Postal Service standard abbreviations (shown in Publication 28 or as shown in the current Postal Service ZIP+4 file), and uses the proper format for the address style…" Nearby: similar to reverse geocoding but with a general k. Parsing: Split into parts. Have seen both regex tools, and statistical distribution (libpostal). Verification: Does geocoding return a result? Maybe it could be more relaxed. Instead of verifying existence, we could verify the format. Does this seem like an address?

Address normalization Other things to consider
Synonyms / Designators / Abbreviations Tokens Numbers Places ROAD, PATH, WAY, ... ? STREET, ST, STR, STRT PARKWAY, PARKWY, PKWAY, PKWY, PKY AV, AVE, AVEN, AVENU, AVENUE, AVN, AVNUE 1,1st, I, first, one 2, 2nd, II, second, two ... San Francisco, SF Munich, München ...

Language representation <tag k="name" v="Potsdam"/> <tag k="name:ar" v="بوتسدام"/> <tag k="name:be" v="Патсда́м"/> <tag k="name:bg" v="Потсдам"/> <tag k="name:ce" v="Потсдам"/> <tag k="name:ckb" v="پۆتسدام"/> <tag k="name:cs" v="Postupim"/> <tag k="name:de" v="Potsdam"/> <tag k="name:dsb" v="Podstupim"/> <tag k="name:el" v="Πότσδαμ"/> <tag k="name:eo" v="Pocdamo"/> <tag k="name:fa" v="پوتسدام"/> <tag k="name:he" v="פוטסדאם"/> <tag k="name:hi" v="पॉट्सडैम"/> <tag k="name:hsb" v="Podstupim"/> <tag k="name:ja" v="ポツダム"/> <tag k="name:ka" v="პოტსდამი"/> <tag k="name:kk" v="Потсдам"/> <tag k="name:ko" v="포츠담"/> <tag k="name:la" v="Potestampium"/> <tag k="name:lt" v="Potsdamas"/> ... Encoding issues? Language effect during matching?

Different systems for geocoordinates... degrees minutes seconds (DMS): 40° 26′ 46″ N 79° 58′ 56″ W degrees decimal minutes (DDM): 40° ′ N 79° ′ W decimal degrees (DD): ° N ° W

Interesting tools Addresses – Overpass - Queries
For Ad-Hoc queries: Using the wizard in the website. "hotel in bellevue". It got translated into its query language.

Interesting tools Addresses – Overpass – Street model
Can also provide a street model. This way we might be able to answer questions /requests, such as: Get me nearby streets. Is this a crossroad? Are these two streets parallel?

Interesting tools Addresses – Overpass - Example

End of part 1 GOTO Jupyter.geocoding; GOTO Jupyter.reverse_geocoding;
GOTO Jupyter.Address_normalization;

Part 2 – duplicate detection

Duplicate Detection Definition
The goal of deduplication, is to have a clean DB in the end, where every entity is represented by a single record. No concern about normalized values, completeness of records etc. Ironically, there are many "duplicate" terms for Duplicate Detection: Duplicate Detection, Deduplication, Record (/Data) Linkage (/Matching), Entity Resolution, Similarity (/Theta) Join, Database Hardening, Object Identification, and more. A duplicated DB, can lead to many issues, such as: false positives: what you found is not what I searched. (inconsistency in values; repair by merging records) false negatives: you did not find what I searched, although you have it. (full query values, scattered across records; also resolved by a merge) Incompleteness: I am sure there is a record with more information, just not the one you could find.

Duplicate Detection Difficulty
The task itself begins at the moment when there is no unique ID to distinguish/join records. If there is such an ID, then the task is trivial; we only have to merge and select a canonical representation (will be discussed later). Deciding whether two records are referring to the same entity can be a difficult task for humans too. hotel name 1 hotel name 2 Similarity [0.0, 1.0] "SUITES NOVOTEL PARIS ISSY LES MOULINEAU" "HOTEL NOVOTEL SUITES PARIS ISSY LES MOULINEAUX" 0.953 "MARRIOTT DL AIRPORT" "MARRIOTT DALLAS DALLAS FT WORTH F" 0.40 Easy and difficult DPL cases.

B A Typical process DATA duplicated Classification (transitivity) r1
Data Preparation Indexing and retrieval of candidates Candidates comparison (pairs) Classification Merging (Canonical Representation) B Classification (transitivity) r1 r2 DATA duplicate free ... But deduplication is already part of "data preparation" r7 r8 r3 r4 r5 r9 r6 A Evaluation (Experimental flow/ Testing) Application (Business flow/ production) For the context of duplicate detection there are also preparation steps.

Data Preparation Typical data preparation steps for deduplication:
Data normalization (address, name, phone) etc. Repair encoding issues. Other preparators… We are currently working on a paper that shows how data preparation, and particular preparators, influence duplicate detection results.

Indexing and Retrieval of candidates
It is not just about execution time. The indexing phase can affect both precision and recall. Theoretically a good similarity measure & classifier could fix the precision later. Relies on other methods to provide the edges Typical indexing schemes (for alphanumeric fields) Blocking (Hashtable) Sorted Neighborhood (Trees – e.g. AVL) Clustering (Graph) ...Metric trees can also be used.

Indexing and Retrieval of candidates
Blocking key, Sorted Neighborhood key John Koumarelas Potsdam Germany jokoumpotsger Variations: Phonetic encoding, MinHasing & Locality Sensitive Hashing (LSH) -- approximation of Jaccard distance, Character/word N-Grams. Bloom Filters can also be used for further pruning. "abcd efg" 3-grams {"abc", "bcd", "cd ", "d e", " ef", "efg"}

Comparison of candidates
We need a way to compare two values, and depending on the type return a similarity or a difference. Ideally we would like the returned similarity (or difference) to be expressed in the range of [0.0, 1.0] (or another predefined range). Some typical data types: Numeric values: age, speed, price, rating, geolocation distance ...Not always easy to normalize. Date (or hierarchical data in general?): a difference in the year is more important than the day or seconds. Alphanumeric: the most common one... We focus on this.

Comparison of candidates Alphanumeric
Three main types of similarity measures Edit-based (insertion, deletion, substitution) Token-based (set of tokens – bag similarity) Hybrid (compare tokens with edit-based similarities) Exact match Hamming Levenshtein Damerau-Levenshtein Jaro-Winkler LongestCommonSubsequence Jaccard Jaccard N-Grams Hybrid Jaccard Monge-Elkan Stable matching Soft TF-IDF {"ab", "bc", "cd"} {"bc", "ef"} Other types: phonetic encoding

Classification Using the calculated similarities (on records' attributes) as features, we can use a number of classifiers. Threshold-based classifier (or Linear classifier): Typically used in duplicate detection. Other classifiers: random forests (RF), support vector machines (SVM), logistic regression, k-nearest neighbors (kNN), neural networks (NN) etc. pair sim name sim city sim zip-code r1r2 0.9 0.8 0.85 r3r4 0.4 1.0 ✓ ✖ For ML people: F-Measure optimization is non-convex and non-concave problem. Wname * simname + Wcity * simcity + Wzip-code * simzip-code ≥ Threshold

Evaluation – for experiments / testing
true positives (TP) false positives (FP) true negatives (TN) false negatives (FN) precision recall f-measure F0.5: higher emphasis on precision F1.0: equal emphasis F2.0: higher emphasis on recall

Merging records into a canonical record
Ideas Plain majority voting. Weighted majority voting completeness of record, cleanliness of record. Merge values: instead of selecting one value from existing, create one new based on the available ones. r1 r2 r canonical r3 r4 Record Name City ZIP-code Country r1 John Potsdam r2 jon 14482 Germany r3 r4 german r canonical Plain majority voting

Other things to consider Gold Standard
Creation of a gold standard, pairs of records that are marked as duplicates (DPL) or non-duplicates (NDPL), is a difficult task. Classification needs both classes. Most papers in the literature: Assume everything different from DPL to be NDPL. Select randomly equal or relative number of pairs to DPL. We consider both to be wrong, because: For this reason we apply blocking, and remove DPL. Everything that remains, are difficult cases to be used. Hotel name 1 Hotel name 2 Similarity OAK TREE INN RAWLINS DAYS INN RAWLINGS 0.618 MARYGREEN MANOR HILTON SUITES F B 0.098 ... We can further filter pairs, based on their similarity or blocking participation. Difficult and easy NDPL cases.

Other things to consider Transitivity
If we know r1r2 and r2r3 → We assume r1r3 It is not a given, but we assume it because of the meaning in duplicate detection. If r1 and r2 represent the same entity, and the same holds for r2 and r3, it makes sense that r1 and r3 also represent the same entity. ... But what happens in longer chains? r1 r2 r3 Apply clustering instead? ? r1 r2 r3 r9 r10 ...

Duplicate detection Census dataset
Census is based on real data, generated by the U.S. Census. The available attributes are: last_name, first_name, middle_name, zip_code, and address. After address normalization we have: Country code State (name) City Postcode Road House number Average similarity before and after address normalization

Duplicate detection Restaurants dataset
Restaurants is a mixed dataset of two other relations, based on Fodor's and Zagat's restaurant guides. From the seven available attributes, we use name, addr, city, phone, and type. After address normalization we have: Country code State (name) City Postcode Road House number Average similarity before and after address normalization

Summary Address data preparation: address normalization, geocoding, etc. → Complex procedure, better to use GIS systems → Prefer OpenStreetMap from open source → Essential information to maintain: Country, state, city, zip-code, street-address Duplicate detection → Typical data integration/cleaning step → Improve effectiveness with data preparation

Grading Presentation in the week 4
Describe the used method, concrete steps, and used libraries. 15% Presentation in the week 8 Describe the preparator implementation. (algorithm, datasets, results) Final presentation Describe the implementation of the grand task. 20% Participation Participate discussion, seminar attendance. 10% Implementation Unit tests / correctness, comments, coding styles 40% Data Preparation for Science, Introduction WS 18/19

Presentation Description
What is the task you want to complete with the dataset Dataset Problems What problems do you face when preparing the data for your task Method Data preparation tool Scripting (libraries) Manual editing Concrete steps Add a new column, change the format of value… Order of the steps Data Preparation for Science Introduction WS 18/19

Room changes Date: 13th Nov.
Place: G3.E.15/16 (Ground floor in design-thinking school) Data Preparation for Science Introduction WS 18/19

Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann

Similar presentations

Presentation on theme: "Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann

Similar presentations

Presentation on theme: "Lan Jiang John Koumarelas Supervisor: Prof. Felix Naumann"— Presentation transcript:

Similar presentations

About project

Feedback