Presentation on theme: "Sharing address cleaning patterns for Patstat: a metadata structure proposal By Gianluca Tarasconi."— Presentation transcript:
Sharing address cleaning patterns for Patstat: a metadata structure proposal By Gianluca Tarasconi
From chaos to order... Main milestones of clearing and standardizing patstat persons (inventors and applicants) can be synthesized as follows: PRE-PARSING CLEANING STANDARDIZATION DEDUPLICATION Due to the strict sequentiality of the process, last steps (address standardization and deduplication) results greatly depend from the quality of first two steps.
... and back to chaos... Different team specialize on local addresses [countrywise data cleaning] Standards (i.e. sequence in toponym, street name, number) differ from team to team Enrichments / links to other data may need special data structure Eventually results of data parsing and cleaning may differ much among different workteams.
Metadata structure proposal (I) We figure that there will be a certain point in which data coming from patstat are parsed into an intermediate data structure where original strings are splitted into several fields according to the meaning of geographic information contained. Data origin (206, 206ascii…) Parsed data structure Parsed & clean data Pre- parsing Cleaning Standard data Standard. Standard & disambig data Dedupl.
Metadata structure proposal (II) ADDRESSTipically toponim, name, number LOCALITYCity area (optional) ADDR_OTHEROther specifics different than toponims ( floor, building, but also c/o company name) [should be data not relevant for standardization] CITYMunicipality name COUNTYAdministrative level above municipality REGIONAdministrative level above county STATEAdministrative level above region for federal nations ZIP_CODEAlphanumeric zip code
Dimensions in data cleaning: operators (I) typical operators in order to clean/move pieces of information across fields are: MOVEmoves a string from one field to another REPLACEchanges a string inside a field INSERTinserts a string without removing other strings DELETEremoves a string without removing other strings For such operators we should consider two dimensions indicating where the operation takes place: FIELD FROM name of the field where operation start from FIELD TO name of the final target of operator (optional) Also we need to list what string has to be found and what must be replaced with FINDstring to be found REPLACEstring replacing the string found
Dimensions in data cleaning: operators (II) We may reduce the number of operators only to MOVE considering the other as particular cases of MOVE REPLACE= MOVE where FIELD FROM = FIELD TO INSERT= MOVE where FIELD FROM = FIELD TO and REPLACE string = FIND + insert string DELETE= MOVE where FIELD FROM = FIELD TO and REPLACE string is empty This structure allows also combinations of operators like move & replace (f.i. remove from a field a misspelled string to the correct field with correct spelling see example #5 below)
Dimensions in data cleaning: example Eventually we will have to take count, using move operator, in which position of target field we want to move the string. This issue will be faced later on. DescriptionField fromField toFindreplace MOVE zip in LONDON W1 AL2 from city to zip_code CITYZIP_CODEW1 AL2 REPLACE β with SS in ADDRESSADDRESS βSS INSERT AM in FRANKFURT MAIN CITY FRANKFURT MAIN FRANKFURT AM MAIN DELETE / in cities like FRANKFURT /MAIN CITY / REMOVE straβe from CITY and put it into ADDRESS as STRASSE CITYADDRESSstraβeSTRASSE
Dimensions in data cleaning: endogenous data Methods used to clean addresses may differ depending from pieces of information contained in the data themselves. Typical case are: APPLICATION AUTHORITYgives some address filling hints and charset COUNTRY CODEgives toponyms, administrative data etc. etc. YEAR FROM / TO (OPT.)some info may change with time (fi: change in ctry code) PATSTATEDICTION FROM/TO (OPT.)some info can change with changes in patstat.
Dimensions in data cleaning: match patterns Eventually, at string level, this is the core of our interchange format. Our proposal is to use SQL REGEXP operator patterns as default, including the following parameters LIKEpattern to be found (inclusion criteria) LIKE NOT [OPTIONAL]pattern not to be in (exclusion criteria) POSITION (begin / end)start / end position where pattern can be SQLSTANDARDgives the standard used for filling the patterns (sql dialect, like vs regexp…) in order to make easier translation
Interchange data structure proposal: operator Its proposed to use a field called OPERATIONKIND where we may store origin and destination of the move operation. It would be a multilayer indicator having a digit for each of the field of the pattern group, indicating the address field to be addressed. COUNTRYADDRESSLOCALITYADDR_OTHERCITYCOUNTYREGIONSTATEZIPNOWHERE ABCDEFGHI0 FI: BCEF LIKE, LIKE NOT, FIND, REPLACE = BCEF would mean if LIKE pattern is in address, NOT LIKE is not in locality, find FIND pattern in city and insert REPLACE pattern in county. It will be added an optional last digit indicating in case of move operation (where 1 st and 4 th digit are different) containing L or T respectively where REPLACE pattern must be inserted leading or trailing in target field. FI: BBBDT would mean LIKE, LIKE NOT, FIND are in address, and replace string must be inserted at the end of addr_other field.
Interchange data structure proposal: endogenous data This is the list of the fields needed; where not indicated meaning of the field is explained in previous slides APPLICATION AUTHORITY2 char string% may indicate valid for all COUNTRY CODE2 char string% may indicate any country DATE FROMdate[optional] empty means no exclusion DATE TO date[optional] empty means no exclusion PATSTATFROMMMYYYY[optional] empty means no exclusion PATSTATTOMMYYYY[optional] empty means no exclusion
Interchange data structure proposal: match patterns (I) Where not indicated meaning of the field is explained in previous slides ALIKEstring(is not called LIKE cause it may cause errors in some SQL ) LIKE NOT [OPTIONAL]string FINDstring FIND2string when literal find do not work and we need a fix len REPLACEstring POSFROMintegerstart point of string position POSTOintegerend of position where string can be SQLSTANDARD string
Interchange data structure proposal: match patterns (II) Note: some combinations of POSFROM POSTO may have particular meanings like : (1, 1) mean start position ; (9999; 9999) means trailing position; (2 ; 9999) means everywhere but at beginning. Eventually a field containing a description of the operation is needed; DESCRIPTIONtext
Some examples ID1299100106 OPERATIONKINDEEEDEEEEBBBB APPLICATION AUTHORITYEP%% COUNTRY CODEUS%% LIKE PO BOX [0-9][0- 9][0-9][0-9]%,,% '[0-9] - [0-9]' '[0-9] BIS [0-9]' '[0-9] A [0-9]' LIKE NOT ' -.+ - '' BIS.+ BIS '' A.+ A ' FIND PO BOX,, ' - ' ' BIS ' ' A ' FIND2 PO BOX #### REPLACE, '-' POSFROM11222 POSTO19999 SQLSTANDARDMYSQL50 DATE FROM DATE TO PATSTATFROM PATSTATTO DESCRIPTION moves PO BOX from city to addr_other Removes double comma in city these are different formats aiming to set multiple street number in address to format #-#
Deep into one pattern (I) Lets see how query would work in one examples (# 100 the one highlightened) We suppose we have an intermediate table called address where our fields are structured according to metadata structure proposal (see above). Our patterns table is called here corrections. We run it on a record with ADDRESS = WAGNER STRASSE 3 BIS 12
Deep into one pattern (II):WAGNER STRASSE 3 BIS 12 VS BIS update applicant a, corrections b set a.address=trim(concat( LEFT(a.address, INSTR(a.address,b.find)-1), b.replace, right(a.address, LENGTH(a.address) - length(b.find)-INSTR(a.address, b.find)+1) )) where b.OPERATIONKIND = BBBB INSTR(a.address, b.find) >= b.posfrom and INSTR(a.address, b.find) <= b.POSTO and a.address regexp b.like and a. address not regexp b.likenot and b.datefrom is null and b.dateto is null and b.pastatfrom is null and b.pastatto is null; new address field is trimmed aggregation of what was before the change (WAGNER STRASSE 3) - 12 this means – is from position 2 onward this means – is before position 9999 address contains reg. expr. '[0-9] BIS [0-9]' addr. dont contain ' BIS.+ BIS that means twice BIS no criteria on date or patstat ediction
Open issues (I): Eventually we have to consider some issues still pending Define a standard address Since cleaning pattern rely on backward logic, people sharing these data should have a common target in data standardization. Its propose to use local post office standards, but such standards may be unavailable / not fitting. Company Names standardization It may be possible to think about adding company names, benefitting from national experience in standardizing legal kind (ie CO. LTD GMBH…) of company names. Automatic query generation User would greatly benefit from exchanging patterns if it could be possible to create a query generating tool that would, from pattern table, create SQL files.
Open issues (II): High correlation & chronology Quality and results of data cleaning may depend from the order steps have been run (FI: if I do not remove PO BOXES numbers from addresses before cleaning street numbers I may have wrong results). Most of all some patterns must be run recursively and in some cases groups of patterns should run recursively (fi: MOVE from address PO BOX, CITY, ZIP, REMOVE COMMA; since I do not know the order the elements have in ADDRESS I should run the group of queries 4 times to be sure) A partial solution may be to add fields indicating the ID of previous query, of following query and number of repetitions. Remain open the issue of how do we manage group of repetitions and cleaning patterns needing a loop until no match is found.