Presentation on theme: "The EP-INV-Patstat db and preliminary results"— Presentation transcript:
1 The EP-INV-Patstat db and preliminary results Andrea MaurinoDISCo - Dip. di Informatica, Sistematica e Comunicazione Università di Milano Bicocca viale Sarca 336/14, , Milano (Italy)
2 Index APE-INV project EP-INV-PatStat Feedback Web application Preliminary resultsOngoing works••• ITIS Lab •••
3 A preliminary truth The world is dirty! and Real world data are dirty! A mandatory and prelimnary task before to realize any analysis or statistic isClean your data••• ITIS Lab •••
4 Disambiguation of academic inventors: ESF-APE-INV Project chair: Francesco Lissoni (uniBocconi)Technical Manager: Andrea Maurino (uniMiB)Project steps:Reclassification of all patents by inventor (INV)Matching between inventors and academic scientists (APE)Results expected:To produce a freely-available database of “Academic Patenting in Europe”••• ITIS Lab •••
6 Which is the part of PatStat interested by disambiguation? Users should not consider these tables, SUBSTITUTIVE TABLES with disambiguated inventors and inventors information are provided by APE-INV projectSource: PatStat documentation
7 INVENTORS_INFOINVENTORS_INFO tableCODINV2NAME-SURNAMECOUNTRY / GCOUNTRYSTATEREGION / GREGIONCOUNTY / GCOUNTYCITY / GCITYSTREET / GSTREETZIP / GZIPLONGITUDELATITUDEGACCURACYFields preceded by letter G are the result of Google-based standardization algorithm, all the other fields are cleaned PatStat addresses (eg. CITY and GCITY)We report Google information only when GACCURACY is larger than or equal to 6 (i.e. Address is available at the level of Street).
8 From APE-INV to PatStat, PATSTAT_PUBL_NR and PATSTAT_APPL_ID In order to connect DISAMBIGUATION and INVENTORS_INFO tables with PatStat dataset we include in the repository other two tables:PATSTAT_PUBL_NRallows to link each inventor (as identified by the CODINV2 code in the APE-INV dataset) to her granted patents (PUBLN_NR).PATSTAT_APPL_IDAllows to identify the APPLN_ID corresponding to each PUBLN_NR (NB In the specific case of EP patents there is a one-to-one correspondence between APPLN_ID and PUBLN_NR).The table reports also the information of the PatStat edition the APPLN_ID refers to.PATSTAT_PUBL_NRCODINV2PUBLN_NR1001101210231154PATSTAT_APPL_IDPUBLN_NRAPPLN_IDPEDITION1542011263748
9 DISAMBIGUATION.txt DISAMBIGUATION table CODINV2: is a stable key generated within the APE-INV project. It identifies uniquely any distinctive combination of inventor and addressCODINV: is a code associated to each CODINV2 after applying the disambiguation procedure. If two or more distinct CODINV2s are found to be the same person, they are assigned the same CODINVCODINVCODINV2110010121023115Dite qui che in futuro speriamo di passare da CODINV e CODINV2 codes (che sono nostri idiosincratici) a PERSON_ID, se PatStat riuscirà a crearne uno stabile. Dite anche per i brevetti successivi al 2000 abbiamo comunque una tavola di conversione CODINV2-PERSON_ID scaricabile dal sito web, all'indirizzo:
11 Why sharing dataInstead of looking for one golden algorithm, APE-INV proposes data dissemination and users’ feedback recording2 kinds of users:Take the data and run (dissemination only): they use the data in their studies a-critically. No benefit for the project, risky for them (data are disambiguated according to the state-of-the-art of dissemination techniques, but we can always do better..).Critical users (dissemination+feedback): they use the data, usually sub-samples of the whole dataset, and have the possibility to increase the data quality:Hand checked data and survey work on smaller samplesAlgorithms fitting better sub-sample specificities (es. Country, firm, technological field)Data sources external to PatStat helping the disambiguation effort••• ITIS Lab •••
12 How does data dissemination work? Access with id and passwordChoose the country(s) of inventors you need (eg. My research is on Italian inventors)Get the EP-INV dataset and the CONTROVERSY.txtQuery results in txt format.
18 Temporal Record linkage “Panta rei” (Heraclitus) everything flows, everything is constantly changing.Database may keep trace of these never ending changesExamplesPeople change namesXin Dong Xin Luna DongPeople change worksHavely moves from Univ. of Wa. to GoogleNations changeYUGOSLAVIA Serbia-Montenegro Serbia KosovoBased on the paper P. Li, X.L.Dong, A.Maurino, D.Scrivistava, linking temporal data, VLDB 2011••• ITIS Lab •••
19 An example person_id person_name person_address appln_filing_date 110670ABELE, MANLIO, G.5 EAST 22ND STREET;NEW YORK, NY 1001618/10/199006/04/19921106715 EAST 22ND STREET, 205;NEW YORK, NY 1001020/02/1991110672ABELE, Manlio, G.5 East 22nd Street, 205,New York, NY 100101106745 East 22nd Street,New York, NY 1001612/04/199512/03/1996110675Abele, Manlio250 East 54th St.,New York, NY 1002219/03/2004