Presentation is loading. Please wait.

Presentation is loading. Please wait.

Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website:

Similar presentations

Presentation on theme: "Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website:"— Presentation transcript:

1 Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website:

2 KITeS Knowledge, Internationalization and Technology Studies KITeSs mission is understanding the relationship between innovation, technology management, firms competitiveness and economic growth in the global economy. KITeS research intends to be rigorous, relevant and inter- disciplinary. It focuses on three main areas: innovation, technology management and trade. 22

3 KITeS –The centre KITeS was founded in 2008, building upon the experience of research centres such as CESPRI and CRITOM. Its guested @ Bocconi University. KITeS is an inter-departmental research centre, integrating researchers from the Economics Dpt., the Management Dpt. and the Institutional Analysis Dpt. KITeS researchers hold doctoral degrees from Yale, Stanford, London School of Economics, Bocconi, Manchester, Leuven, Sussex, Maastricht, and others. Patent statistics have been widely used at KITeS for many years now, dating back to CESPRI's early research in industrial dynamics. This tradition has led to the cumulative creation and updating of a large database, known as EP-CESPRI. Inventors' data used so far are organized in a sub-section of such database, known as EP-INV. … whos who: 3

4 The EP-CESPRI Database (i) The EP CESPRI database contains information on patents applied for at the European Patent Office (EPO), from 1978 to October 2009. The EP CESPRI database was first created by making use of information downloaded regularly from EPO Bulletins. Since October 2007 it is based upon applications published on a regular basis by EPO in PATSTAT ; presently, it contains about 2.090.000 patent applications. A beta version for USPTO was released in 2009 and SIPO (chinese patent office) version is forecasted for 2010. 4

5 The EP-CESPRI Database (ii) EP-CESPRI data fall into three broad categories: 1. Patent data, such as the patents publication number, its priority/application date, and main/secondary technological class (IPC12 digit). 2. Applicant data, such as a unique code assigned by KITeS to each applicant after cleaning the applicants data, plus the applicant s name and address. 3. Inventor data: such as name, surname, address and a unique code (CODINV) assigned by KITeS to all inventors found to be the same person. This section of EP-CESPRI is also known as EP-INV and it is the one of major interest to todays seminar 5

6 EP-INV: From raw data to structured data Data coming from PATSTAT are cleaned, standardized and re-structured CODINV2 code Eventually a similarity score is calculated for pairs of inventors who have the same name and surname, but different addresses CODINV code 6

7 Standardization of inventors names and addresses Original EPO data on inventors come from PATSTAT table TLS206_ASCII, where data are only partially parsed for names, address, city, zip codes. Further steps are as follows: 1.Cleaning of address data 2.Cleaning of names 3.Computation of similarity scores 7 CODINV2 codes CODINV codes

8 Cleaning of address data Parsed data are given a unique code (CODINV2) and (iteratively) cleaned by: shifting information contained in wrong fields (like zip code, county…); standardizing city names or parts of names (e.g.: Saint is turned into St.); fixing mistakes in zip codes, according to national post office tables; In 10/2007 data there were 2.381.991 codinv2 in EP-INV DB out of 3.278.486 PATSTAT person_id (28% less ). 8

9 Example of city cleaning CITYZIP ORIGINAL DDR-4203 Bad Dürrenberg ZIP PARSEDBad Dürrenberg4203 CITY CLEANEDBAD DURRENBERG4203 ZIP LOOKUPBAD DURRENBERG06231 9

10 Cleaning of names The name+surname field was parsed into the following fields: first, second, third name, extension (e.g. Jr, Sr, III), surname, and academic title (e.g. Dr., Prof, Ing….). This operation was mainly based on two iterative steps: Pairs of inventors with the same address and equal first name, surname, extension and initial of second or third name are corrected for the third name (e.g.: Rossi Giovanni Paolo is turned into Rossi Giovanni P.); Pairs of inventors records where 2 out of the 3 fields city, address and name are the same and the remaining one has a low edit distance (Levenshtein/alfanum) are updated on the data for the inventor with the higher number of patents. 10

11 An example 11 NameAddressCityZipcodinv2 Tarasconi, GianlucaVia P. Maspero, 24Milan1 Tarasconi, GianlucaVia Maspero, 24IT-20137 Milan2 Tarasconi, G.c/o university bocconiMilano201363 Tarasconi, Gianlucac/o university bocconiMilano201364 Tarasconi, Gianluca35, Via TertullianoMilan5 NameAddressCityZipcodinv2 Tarasconi, GianlucaVia Maspero, 24Milano201371 Tarasconi, Gianlucac/o university bocconiMilano201363 Tarasconi, GianlucaVia Tertulliano, 35Milano201355

12 Further info on cleaning names and addresses Cleaning of names and address has been realized by MySQL; The sql code is based on 25 lookup tables and 950 recursive queries; The aggregation algorithm was quite conservative (to allow new entries to be quickly linked); 12

13 Computation of similarity score Inventors data are restructured following a structure person (CODINV) vs person@location (CODINV2) All inventors with anything different other than name and surname are compared in pairs, through the Massacrator SQL routine 13

14 Introduction of CODINV 14 NameAddressCityZipcodinv2Codinv Tarasconi, GianlucaVia Maspero, 24Milano2013711 Tarasconi, Gianlucac/o university bocconi Milano2013632 Tarasconi, GianlucaVia Tertulliano, 35Milano2013553

15 Similarity Score Workplace: same applicant/ company/ group Social networks: coinventors in common, 3 degrees of distance in coinventorship Toponymic permanence: same address, town, county… Citations linkages: (self)citing or cited Time lag: how long since last patent? IPC: patenting in the same tech fields Computation of similarity score 15

16 Scores by category WorkplaceIPC Same applicant5Same IPC code (4 digits)5 Same applicant (the applicant has <50 inventors)5Same IPC code (6 digits)5 Same group (if available)5Same IPC code (12 digits)10 Toponymic PermanenceTime Lag Same city5Priority dates differ for >20 years-5 Same province5 Same region5Citation linkages Same state (US)5Inventor 1 cites inventor 25 Same address [ in different cities; it may indicate misspellings in the city field ]5Inventor 1 is cited by inventor 25 Social NetworksOther Same coinventor10Widespread surname-5 3 degrees of separation10 16

17 Update of CODINV using similarity score 17 NameAddressCityZipcodinv2Codinv Tarasconi, GianlucaVia Maspero, 24Milano2013711 Tarasconi, Gianlucac/o university bocconi Milano2013632 Tarasconi, GianlucaVia Tertulliano, 35Milano2013553 codinv 1 1 3 1 1 1 Algorithm should be run recursively Intuitively, high similarity scores can be taken as indication of a high probability that the two inventors in the pairs are the same person. Whenever two inventors in a pair are found to be the same the lowest CODINV code is assigned to both inventors.

18 Finding a threshold value (I) 18 Manual checking of EP-INV records suggest that a large number paired inventors with total score higher than 20 are indeed the same person. Percentages vary across countries, largely because of the different distribution of frequent surnames. Therefore, no automatic re- assignment of CODINV codes has been performed so far. In KEINS research data have been extensively checked for IT, FR, SE; the threshold value of the similarity score was set at 15 (median value): inventors in pairs with score >= 15 are then presumed to be the same person, and assigned the same CODINV code.

19 Finding a threshold value (II) Manual checking suggests that: no Type 2 error (false positives) is introduced with this choice, i.e. no pair of inventors are assigned erroneously the same CODINV code) several Type 1 errors remains, i.e. pairs of inventors who are indeed the same person have scores <15 and are not given the same CODINV code 19

20 Applying Massacrator to all EPO (I) At 10/2007 we get 2.672.671 couples out of 2.363.501 inventors Mode is 0 pts (764946 couples) but 758.471 couples have >= 15pts 20

21 Applying Massacrator to all EPO (II) 16,78 % of couples are >= 20 pts 22,72% of couples are >= 15 pts 21

22 Applying Massacrator to all EPO (III) A raw version of the algorithm for getting a proxy of the possible reductions may be same IPC (12 digits) OR same applicantOR same addressOR 3 degrees of distanceOR 1 coinventor in commonOR citation linkageOR same IPC (6 digits) and same country Compressing 571970 CODINVs out of 2363501 (-24%) 22

23 Some publications using the EP-INV data Lissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research Evaluation, 17(2): 87-102. Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181. Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109 Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432 Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118 Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf] Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) - Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". Forthcoming 23

24 Future uses of the algorithm (I) Cross Patent-office match: Is J. Smith in EPO the same of USPTO ? Decompression: Where toponymic data are few (USPTO data FI), a mere data cleaning would group inventors who are not the same; the algorithm could help to avoid type 2 errors 24

25 Future uses of the algorithm (II) Companies match: Identify applicants who have similar companies names as the same; NPL match: Helping to deduplicate authors / affiliations 25

Download ppt "Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website:"

Similar presentations

Ads by Google