The EP-INV-Patstat db and preliminary results

Slides:



Advertisements
Similar presentations
Information Systems Technology Ross Malaga B Copyright © 2005 Prentice Hall, Inc. B-1 WORKING WITH DATABASES.
Advertisements

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
©2011 1www.id-book.com Evaluation studies: From controlled to natural settings Chapter 14.
Chapter 16 Unemployment: Search and Efficiency Wages.
ISDSI 2009 Francesco Guerra– Università di Modena e Reggio Emilia 1 DB unimo Searching for data and services F. Guerra 1, A. Maurino 2, M. Palmonari.
Chapter 13: The Systems Perspective of a DSS
BASIC SKILLS AND TOOLS USING ACCESS
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Spearheading Internet technology and policy development in the African Region Resource Services Report.
HOW TO USE … SAMIEEE FOR VOLUNTEER POSITIONS WITH AUTOMATIC ACCESS.
Review of Data Processing Steps MICS3 Data Analysis and Report Writing Workshop.
MICS4 Survey Design Workshop Multiple Indicator Cluster Surveys Survey Design Workshop MICS4 Technical Assistance.
1Role of Metadata Role of Metadata in Data Dissemination Presented at the UN Regional Seminar on Census Data Dissemination and Spatial Analysis Motale.
1 Data Linkage Strategies Shihfen Tu, Ph.D. University of Maine
Relational Database and Data Modeling
Workshop on the Harmonisation of Information for Poison Centres Brussels, 24 Nov View of the Italian Competent Authority Roberto Binetti Maristella.
Event, date: Reporting of SoE biology, Author: Jannicke Moe (NIVA) 1 Agenda item 2: Practical information for reporting of State-of-Environment.
Improving imputation methodology in the Hungarian Central Statistical Office (HCSO) NTTS 2009 seminar, Bruxelles February 2009 Improving imputation.
1 Geneva, October 2008 YUN Young-Woo IP INFORMATION & WIPO STANDARDS.
Antonios Farassopoulos Head of International Classifications and WIPO Standards Service Global IP Infrastructure Department Combined CPC/FI Introduction.
IPC Reclassification Website Antonio Carlos Souza de Abrantes Daniel Barros Júnior February 4, WIPO/Geneva.
ECMWF June 2006Slide 1 Access to ECMWF data for Research Manuel Fuentes Data and Services Section, ECMWF ECMWF Forecast Products User Meeting.
XP New Perspectives on Microsoft Office Word 2003 Tutorial 6 1 Microsoft Office Word 2003 Tutorial 6 – Creating Form Letters and Mailing Labels.
Microsoft Access 2007 Advanced Level. © Cheltenham Courseware Pty. Ltd. Slide No 2 Forms Customisation.
Michigan Electronic Grants System Plus
0 - 0.
Addition Facts
Relational data objects 1 Lecture 6. Relational data objects 2 Answer to last lectures activity.
Query optimisation.
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
SADC Course in Statistics Session 4 & 5 Producing Good Tables.
STATISTICAL INFERENCE ABOUT MEANS AND PROPORTIONS WITH TWO POPULATIONS
BURSTY SUBGRAPHS IN SOCIAL NETWORKS. Introduction 2.
MICRO-BUMP ASSIGNMENT FOR 3D ICS USING ORDER RELATION TA-YU KUAN, YI-CHUN CHANG, TAI-CHEN CHEN DEPARTMENT OF ELECTRICAL ENGINEERING, NATIONAL CENTRAL UNIVERSITY,
© 2006 AT&T. All rights Reserved. AT&T Southwest VDB and ERDB Systems AT&T Southwest VDB and ERDB Systems.
Surgery OR Procedure Card Database David L. Odom 1. Secure Password into Database:
Configuration management
1 Dr. Ashraf El-Farghly SECC. 2 Level 3 focus on the organization - Best practices are gathered across the organization. - Processes are tailored depending.
© 2003 By Default! A Free sample background from Slide 1 A First Course in Database Management Jeanne Baugh Department of.
Recurrences : 1 Chapter 3. Growth of function Chapter 4. Recurrences.
© Abdou Illia MIS Spring 2014
© Paradigm Publishing, Inc Access 2010 Level 1 Unit 1Creating Tables and Queries Chapter 2Creating Relationships between Tables.
Integration Integrating Word, Excel, Access, and PowerPoint
GIS Lecture 8 Spatial Data Processing.
Presented by Douglas Greer Creating and Maintaining Business Objects Universes.
]po[ Docu Wiki.  ]project-opem[ 2008, Rollout Methodology / Frank Bergmann / 2 Types of Readers  Beginners – These users have just started using ]po[.
DB analyzer utility An overview 1. DB Analyzer An application used to track discrepancies and other reports in Sanchay Post Constantly updated by SDC.
E-Nomination – Electronic procedure for the nomination of exchange students.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Stephen C. Hayne 1 Database System Components The Database and the DBMS.
Week 1.
CMU SCS : Multimedia Databases and Data Mining Lecture #4: Multi-key and Spatial Access Methods - I C. Faloutsos.
Vanderbilt Business Objects Users Group 1 Linking Data from Multiple Sources.
Dr. Markus Quandt GESIS – Leibniz-Institute for the Social Sciences Workshop: Persistent Identifiers for the Social Sciences University Club, Bonn, February.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Chapter 15 A Table with a View: Database Queries.
1 Unit 1 Kinematics Chapter 1 Day
HTML Concepts and Techniques Fourth Edition Project 2 Creating and Editing a Web Page.
Unique Device Identifier (UDI) - Overview 6/21/2014
Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
ISFR – Jan 28th, 2010Gianluigi Viscusi SEQUOIAS -DISCo - UnMiB Linking Temporal Records 1 Università di Milano Bicocca, 2 AT&T Labs-Research VLDB 2011,
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
The APE‐INV Project: An Introduction Francesco Lissoni DIMI-Univ. of Brescia & KITES-Bocconi Univ., Milan APE-INV workshop “Disambiguation of inventors'
Toward Generic Systems Shifra Haar - Central Bureau of Statistics-Israel.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Data sources of the EuroGroups Register Presentation by Eurostat
Presentation transcript:

The EP-INV-Patstat db and preliminary results Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Università di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy)

Index APE-INV project EP-INV-PatStat Feedback Web application Preliminary results Ongoing works ••• ITIS Lab ••• http://www.itis.disco.unimib.it

A preliminary truth The world is dirty! and Real world data are dirty! A mandatory and prelimnary task before to realize any analysis or statistic is Clean your data ••• ITIS Lab ••• http://www.itis.disco.unimib.it

Disambiguation of academic inventors: ESF-APE-INV www.academicpatenting.eu Project chair: Francesco Lissoni (uniBocconi) Technical Manager: Andrea Maurino (uniMiB) Project steps: Reclassification of all patents by inventor (INV) Matching between inventors and academic scientists (APE) Results expected: To produce a freely-available database of “Academic Patenting in Europe” ••• ITIS Lab ••• http://www.itis.disco.unimib.it

EP-INV-PatStat PATSTAT_PUBL_NR PATSTAT_APPL_ID INVENTORS_INFO DISAMBIGUATION ••• ITIS Lab ••• http://www.itis.disco.unimib.it

Which is the part of PatStat interested by disambiguation? Users should not consider these tables, SUBSTITUTIVE TABLES with disambiguated inventors and inventors information are provided by APE-INV project Source: PatStat documentation

INVENTORS_INFO INVENTORS_INFO table CODINV2 NAME-SURNAME COUNTRY / GCOUNTRY STATE REGION / GREGION COUNTY / GCOUNTY CITY / GCITY STREET / GSTREET ZIP / GZIP LONGITUDE LATITUDE GACCURACY Fields preceded by letter G are the result of Google-based standardization algorithm, all the other fields are cleaned PatStat addresses (eg. CITY and GCITY) We report Google information only when GACCURACY is larger than or equal to 6 (i.e. Address is available at the level of Street).

From APE-INV to PatStat, PATSTAT_PUBL_NR and PATSTAT_APPL_ID In order to connect DISAMBIGUATION and INVENTORS_INFO tables with PatStat dataset we include in the repository other two tables: PATSTAT_PUBL_NR allows to link each inventor (as identified by the CODINV2 code in the APE-INV dataset) to her granted patents (PUBLN_NR). PATSTAT_APPL_ID Allows to identify the APPLN_ID corresponding to each PUBLN_NR (NB In the specific case of EP patents there is a one-to-one correspondence between APPLN_ID and PUBLN_NR). The table reports also the information of the PatStat edition the APPLN_ID refers to. PATSTAT_PUBL_NR CODINV2 PUBLN_NR 100 1 101 2 102 3 115 4 PATSTAT_APPL_ID PUBLN_NR APPLN_ID PEDITION 1 5 42011 2 6 3 7 4 8

DISAMBIGUATION.txt DISAMBIGUATION table CODINV2: is a stable key generated within the APE-INV project. It identifies uniquely any distinctive combination of inventor and address CODINV: is a code associated to each CODINV2 after applying the disambiguation procedure. If two or more distinct CODINV2s are found to be the same person, they are assigned the same CODINV CODINV CODINV2 1 100 101 2 102 3 115 Dite qui che in futuro speriamo di passare da CODINV e CODINV2 codes (che sono nostri idiosincratici) a PERSON_ID, se PatStat riuscirà a crearne uno stabile. Dite anche per i brevetti successivi al 2000 abbiamo comunque una tavola di conversione CODINV2-PERSON_ID scaricabile dal sito web, all'indirizzo: http://www.esf-ape-inv.eu/download/personidcodinv2.csv

Feedback web application ••• ITIS Lab ••• http://www.itis.disco.unimib.it

Why sharing data Instead of looking for one golden algorithm, APE-INV proposes data dissemination and users’ feedback recording 2 kinds of users: Take the data and run (dissemination only): they use the data in their studies a-critically. No benefit for the project, risky for them (data are disambiguated according to the state-of-the-art of dissemination techniques, but we can always do better..). Critical users (dissemination+feedback): they use the data, usually sub-samples of the whole dataset, and have the possibility to increase the data quality: Hand checked data and survey work on smaller samples Algorithms fitting better sub-sample specificities (es. Country, firm, technological field) Data sources external to PatStat helping the disambiguation effort ••• ITIS Lab ••• http://www.itis.disco.unimib.it

How does data dissemination work? Access http://www.ape-inv.disco.unimib.it/ with id and password Choose the country(s) of inventors you need (eg. My research is on Italian inventors) Get the EP-INV dataset and the CONTROVERSY.txt Query results in txt format.

Some results ••• ITIS Lab ••• http://www.itis.disco.unimib.it

Number of academic patents, 1996-2006

Ownership distribution of academic patents lower bound estimates

Ownership distribution of academic patents, upper bound estimates

Ongoing works ••• ITIS Lab ••• http://www.itis.disco.unimib.it

Temporal Record linkage “Panta rei” (Heraclitus) everything flows, everything is constantly changing. Database may keep trace of these never ending changes Examples People change names Xin Dong Xin Luna Dong People change works Havely moves from Univ. of Wa. to Google Nations change YUGOSLAVIA  Serbia-Montenegro Serbia Kosovo Based on the paper P. Li, X.L.Dong, A.Maurino, D.Scrivistava, linking temporal data, VLDB 2011 ••• ITIS Lab ••• http://www.itis.disco.unimib.it

An example person_id person_name person_address appln_filing_date 110670 ABELE, MANLIO, G. 5 EAST 22ND STREET;NEW YORK, NY 10016 18/10/1990 06/04/1992 110671 5 EAST 22ND STREET, 205;NEW YORK, NY 10010 20/02/1991 110672 ABELE, Manlio, G. 5 East 22nd Street, 205,New York, NY 10010 110674 5 East 22nd Street,New York, NY 10016 12/04/1995 12/03/1996 110675 Abele, Manlio 250 East 54th St.,New York, NY 10022 19/03/2004

Experimental Evaluation Effectiveness test: Data set: patent data, 1871 records, 359 entities, in 1978-2003 Comparison: three existing algorithms, w./o. decayed similarity

Thanks! 疑问 ••• ITIS Lab ••• http://www.itis.disco.unimib.it