Disambiguation of USPTO Inventors

Disambiguation of USPTO Inventors
Name Game Workshop – Madrid 9-10 December 2010 Presenter: Amy Yu Coauthors: Ronald Lai Alex D’Amour Lee Fleming Technical Collaborator: Edward Sun We would like to thank the NSF for supporting this research. Errors and omissions remain ours (though we ask that you bring them to our attention). The Institute for Quantitative Social Science at Harvard University

Agenda Introduction Methodology Results and Analysis
Torvik-Smalheiser Algorithm (PubMed) Results and Analysis Descriptive Statistics DVN platform

Introduction

Background Patent data made available by the USTPO enables further research into technology and innovation NBER database includes authorship, firm, and state level data but has not completed the effort to disambiguate unique inventors (Hall, Trajtenberg, and Jaffe, 2001) Inventor disambiguation is non-trivial USPTO does not require consistent and unique identifiers for inventors

Motivation Inventor disambiguation allows for construction of inventor collaboration networks Open new avenues of study: Which inventors are most central in their field? How does connectedness affect inventor productivity? What corporate structures are conducive to innovation? How do legal changes impact idea flow? Build a scalable, automated system for tracking and analyzing developments in the inventor community

Methodology

Overview Previous methodology (2008) Current methodology (2010)
Linear, unsupervised – more intuitive Similarity between records is a weighted average of element-wise similarity scores Weights are not optimized Strong results for US - (Lai, D’Amour, Fleming 2008) showed recall of 97.3% and precision of 96.1% Current methodology (2010) Variation of Torvik-Smalheiser algorithm (Torvik et al, 2005; Torvik and Smalheiser, 2009) Multi-dimensional similarity profiles Semi-supervised with automatically generated training sets Optimal weighting, non linear interactions Easier to scale

Disambiguation Process
Public Databases HBS scripts Inventor disambiguation algorithm HBS scripts Data preparation load and validate clean and format generate datasets Consolidated inventor dataset Weekly USPTO patent data (1998 – 2010) Consolidated inventor matched dataset Primary Datasets Assignee Inventors Classes Patents

Data preparation Create inventor, assignee, patent and classification datasets from primary and secondary data sources USPTO: weekly patent data in XML files NBER Patent Data Project: assignee data National Geospatial-Intelligence agency: location data Standardize and reformat Removal of excess whitespace, removal of tags, and translation of Unicode characters Construct the inventor-patent database Consolidate inventor, assignee, patent, and classification datasets

Patent Data: Base Datasets
Consolidated inventor dataset INVENTOR Invnum_N Disambiguated inventor number Invnum Initial inventor number: Patent + Invseq Firstname Inventor first name Lastname Inventor last name InvSeq Inventor number on patent. Street Inventor’s street address City Inventor’s city. State (US only) State Zipcode (US only) Zipcode. Lat Latitude Long Longitude PATENT Patent USPTO assigned patent number. AppDate Patent application date. GDate Patent grant date. AppYear Patent application year ASSIGNEE Assignee Primary firm associated with patent. Asgnum Generated assignee number. CLASSES Class Main patent classification Subclass Patent subclassification *HBS algorithm generated variables.

Consolidated Dataset Consolidated inventor dataset
Inventor First & Last Name Location data Patent Number Patent Application & Grant dates Assignee data Patent Class Invnum Invnum_N Firstname Lastname City State Country Zipcode Lat Lng InvSeq Patent AppYear GYear AppDate Assignee AsgNum Class Invnum Invnum_N GAROLD LEE FLEMING NEWTON KS US 67117 38.13 -97.32 3 1977 1978 3/16/1977 HESSTON CORPORATION H 100-21 LEE FREMONT CA 94555 37.57 2 1990 1991 8/30/1990 HEWLETT PACKARD COMPANY A / / / LEE O 1 1992 9/20/1991 326-16/326-31/326-56/326-82 CATHLEEN M FOREST HILL MD 21050 39.57 -76.40 1997 1998 3/3/1997 COLOR PRELUDE INC A / EILEEN SANDIA TX 78383 28.09 -97.94 2003 2006 10/29/2003 TMC SYSTEMS L P H /239-63/239-64/239-67 78382 30.09 2009 4/27/2006 239-69/ /239-63/239-64 ELENA NORTH VANCOUVER 49.32 5 2004 12/6/2004 SMITHKLINE BEECHAM CORPORATION A 435 ELIZABETH A HOUSTON 77299 29.77 -95.41 9/9/1991 SHELL OIL COMPANY A 250 ELIZABETH S BELMONT MA 2478 42.39 -71.18 1999 2002 1/13/1999 SCRIPTGEN PHARMACEUTICALS INC H 514 ELLEN L BIRMINGHAM MI 48012 42.54 -83.21 1995 6/7/1995 273/463 ERIC MICHAEL SELKIRK 50.15 -96.88 6/1/1992 HEMLOCK SEMICONDUCTOR CORPORATION A 422/501 FRANCIS WILLIAM GLASGOW GB 55.83 -4.25 6/17/1977 MARKON ENGINEERING COMPANY LIMITED H 310/165 FRANK ALBERT WARRENSBURG MO 64093 38.76 -93.73 2000 11/6/2000 HAWKER ENERGY PRODUCTS INC H 429 OZARK 65721 37.01 -93.20 1/4/2003 2005 2/23/2005 HAWKWER ENERGY PRODUCTS INC H FRANK JOSEPH MELBOURNE FL 32951 28.02 -80.54 5/10/2002 HARRIS CORPORATION A 380/713

Disambiguation Algorithm
Blocking Training Sets Ratios Disambiguation Consolidation \

Inventor disambiguation algorithm
Blocking Inventor disambiguation algorithm Run # Type Block1 Block2 1 Consolidated First name Last name 2 First 5 characters of first name. First 8 characters of last name. 3 First 3 characters of first name First 5 characters of last name 4 Initials of first and middle names. 5 First initial 6 Last 5 characters of last name, reversed 7 Firstname Lastname City State Country Zipcode GAROLD LEE FLEMING NEWTON KS US 67117 … LEE FREMONT CA 94555 LEE O CATHLEEN M FOREST HILL MD 21050 EILEEN SANDIA TX 78383 78382 Blocking Training Sets Ratios Disambiguation Consolidation

Training Sets Inventor disambiguation algorithm Name Attributes Patent Attributes Match P(α|M) P(β|M) Nonmatch P(α|N) P(β|N) P(α|M) * P(β|M) = P(x|M) = Probability of seeing similarity profile x given a match P(α|N) * P(β|N) = P(x|N) = Probability of seeing similarity profile x given nonmatch [x1, x2, x3, x4, x5, x6, x7] α β Similarity Profile: Blocking Training Sets Ratios Disambiguation Consolidation

Match Probability P(M|x)
Ratios Inventor disambiguation algorithm Similarity Profile Match Probability P(M|x) [2, 4, 3, 4, 2, 1, 4] [3, 4, 3, 5, 3, 2, 5] [4, 5, 3, 7, 3, 4, 6] [6, 6, 4, 8, 3, 8, 7] … * approximated probabilities for demonstration Likelihood ratio: r = P(x|M)/P(x|N) generated from training sets Probability of match given similarity profile x: P(M) is empirically determined Smoothing: enforce monotonicity r is interpolated/extrapolated for unobserved xa Blocking Training Sets Ratios Disambiguation Consolidation

Firstname Lastname City State Country Zipcode EILEEN FLEMING SANDIA TX US 78383 … 78382 Invnum Invnum_N Invnum Invnum_N Firstname Lastname City State Country Zipcode GAROLD LEE FLEMING NEWTON KS US 67117 … LEE FREMONT CA 94555 LEE O CATHLEEN M FOREST HILL MD 21050 EILEEN SANDIA TX 78383 78382 [6, 6, 4, 8, 3, 8, 7] Similarity Profile Match Probability … [6, 6, 4, 8, 3, 8, 7] > 0.95 Blocking Training Sets Ratios Disambiguation Consolidation

Consolidation Inventor disambiguation algorithm Firstname Lastname City State Country Zipcode Invnum Invnum_N GAROLD LEE FLEMING NEWTON KS US 67117 … LEE FREMONT CA 94555 LEE O CATHLEEN M FOREST HILL MD 21050 EILEEN~2 FLEMING~2 SANDIA~2 TX~2 US~2 78383~1/ 78382~1 Firstname Lastname City State Country Zipcode Invnum Invnum_N GAROLD LEE FLEMING NEWTON KS US 67117 … LEE FREMONT CA 94555 LEE O CATHLEEN M FOREST HILL MD 21050 EILEEN SANDIA TX 78383 78382 Blocking Training Sets Ratios Disambiguation Consolidation

Process Map: Consolidated Steps
Inventor disambiguation algorithm Inventor-Patent Dataset tsetC1 tsetC2 tsetC3 tsetC4 tsetC5 tsetC6 tsetC7 … ratio1 ratio2 ratio3 ratio4 ratio5 ratio6 ratio7 … D1 D2 D3 D4 D5 D6 D7 … lower bound result C1 C2 C3 C4 C5 C6 C7

Final Step: Splitting Inventor disambiguation algorithm Blocking Disambiguation D7 invnum_N Inventor-Patent Dataset upper bound result ratio7 D8

Results and Analysis

Patents and Inventors* 1975 – 2010
* excluding East Asian inventors

* based on lower bound disambiguation

Top 10 Inventors KIA SILVERBROOK AU SILVERBROOK RESEARCH PTY LTD 3382
Firstname Lastname Country Assignee Number of Patents KIA SILVERBROOK AU SILVERBROOK RESEARCH PTY LTD 3382 DONALD E WEDER US WANDA M WEDER AND WILLIAM F STAETER 1001 LEONARD FORBES MICRON TECHNOLOGY INC 925 GURTEJ S SANDHU 832 PAUL LAPSTUN 803 WARREN M FARNWORTH 729 GEORGE SPECTOR THE RUIZ LAW FIRM 715 SALMAN AKRAM 670 WILLIAM I WOOD GENENTECH INC 646 AUSTIN L GURNEY 618 Based on lower bound * based on lower bound disambiguation, excluding East Asian inventors

Unique Coauthors by Patent Grant Year
increasing collaboration over time data <1990 – skewed due to data sparsity

Largest Component per Year

Analysis Benchmark dataset from Jerry Marschke, NBER
manually edited, data derived from inventor CVs patent history of ~100 US inventors – mainly research scientists in university engineering and biochemistry depts Verification Measures:

Verification statistics
Run # Type # of records Underclumping Overclumping Recall Precision Base Dataset 9.17 million n/a 1 Consolidated 4.61 million 74.6% 1.7% 25.40% 93.73% 2 2.20 million 12.3% 4.8% 87.70% 94.81% 3 2.08 million 6.8% 10.1% 93.20% 90.22% 4 2.05 million 4.6% 10.3% 95.40% 90.26% 5 2.02 million 4.1% 95.90% 90.30% 6 2.01 million 2.8% 19.2%** 97.20% 83.51% 7 1.99 million 2.7% 97.30% 83.52% 8 Splitting 2.26 million 15.9% 15.3% 84.10% 84.61% ** due to “blackhole” names

Encouraging results

Challenges and Improvements
Disambiguation of East Asian names is difficult Current algorithm is well-suited for European names Systematic improvements required to handle correlations between fields Overclumping for common names – frequency adjustment using stop listing removing David Johnson, Eric Anderson, and Stephen Smith from our analysis improves the overclumping metric from 19.2% to 5.1% for the last two consolidated runs Computation time v. algorithmic accuracy Benchmark datasets for results analysis

Research applications
Origin of breakthroughs Impact of legislation on innovation Organizational influence on innovation Inventor careers and collaboration networks

Dataverse Network Platform

Questions?

Appendix

Patent Data Prof Fleming, Amy, and Ron collaborate on patent 9999999.
Consolidated inventor dataset Invnum_N Name Patent Assignee City State … 12345 Fleming, Lee HP Fremont CA Harvard Cambridge MA 45678 Yu, Amy Boston 67890 Lai, Ronald Randolph Prof Fleming, Amy, and Ron collaborate on patent Data are organized in unique inventor-patent pairs. Unique inventor number (HBS disambiguation algorithm), constant between patents. Invnum = Patent Num + inventor sequence Invnum_N = disambiguated inventor identifier Patent is assigned to one entity (usually inventors’ employer or self if blank), constant over a patent. Location data are personal addresses (at the city level) of inventors, vary over a patent.

Disambiguation Algorithm
Blocking Partition the inventor-patent dataset Based on seven different criteria Training Sets Build a training set for each set of block criteria Each set is a database that contains four different tables, each with ~ 10 million pairs of record ids. Ratios One ratio database is created for each training set Similarity profiles are paired with match probabilities. Disambiguation Starts from invpat or previously disambiguated and consolidated database Within each block, we compare each record Output is invnum_N Consolidation Based on the disambiguated invnum_N, update invnum_N within invpat Consolidates records with the same invnum_N

Summary of Data Passes Inventor disambiguation algorithm Run # Type Block1 Block2 1 Consolidated First name Last name 2 First 5 characters of first name. First 8 characters of last name. 3 First 3 characters of first name First 5 characters of last name 4 Initials of first and middle names. 5 First initial 6 Last 5 characters of last name, reversed 7 8 Splitting Invnum_N from step 7

Patent Similarity Profiles
Inventor disambiguation algorithm Seven-dimensional Fields used: name attributes (first name, middle initials, and last name) and patent attributes (author address, assignee, technology class, and coauthors) Each element is a discrete similarity score determined by a fieldwise comparison between two records Jaro-Winkler string comparison Monotonicity assumption: if one profile dominates another profile (this is, each of its elements is greater than or equal to the elements of another similarity profile), then it must map to a higher match probability.

Similarity Scores Inventor disambiguation algorithm Comparison function scoring: LEFT/RIGHT=LEFT VS RIGHT 1) Firstname: 0-6. Factors: # of token and similarity between tokens 0: Totally different: THOMAS ERIC/RICHARD JACK EVAN 1: ONE NAME MISSING: THOMAS ERIC/(NONE) 2: THOMAS ERIC/ THOMAS JOHN ALEX 3: LEE RON ERIC/LEE ALEX ERIC 4: No space match but raw names don't: JOHNERIC/JOHN ERIC. Short name vs long name: ERIC/ERIC THOMAS 5: ALEX NICHOLAS/ALEX NICHOLAS TAKASHI 6: ALEX NICHOLAS/ALEX NICHOLA (Might be not exactly the same but identified the same by jaro- wrinkler) 2) Lastname: 0-6 Factors: # of token and similarity between tokens 0: Totally different: ANDERSON/DAVIDSON 1: ONE NAME MISSING: ANDERSON/(NONE) 2: First part non-match: DE AMOUR/DA AMOUR 3: VAN DE WAALS/VAN DES WAALS 4: DE AMOUR/DEAMOUR 5: JOHNSTON/JOHNSON 6: DE AMOUR/DE AMOURS 3) Midname: 0-4 (THE FOLLOWING EXAMPLES ARE FROM THE COLUMN FIRSTNAME, SO FIRSTNAME IS INCLUDED) 0: THOMAS ERIC/JOHN THOMAS 1: JOHN ERIC/JOHN (MISSING) 2: THOMAS ERIC ALEX/JACK ERIC RONALD 3: THOMAS ERIC RON ALEX EDWARD/JACK ERIC RON ALEX LEE 4: THOMAS ERIC/THOMAS ERIC LEE 4) Assignee: 0-8 0: DIFFERENT ASGNUM, TOTALY DIFFERENT NAMES ( NO single common word ) 1: DIFFERENT ASGNUM, One name missing 2: DIFFERENT ASGNUM, Harvard University Longwood Medical School / Dartmouth Hitchcock Medical Center 3: DIFFERENT ASGNUM, Harvard University President and Fellows / Presidents and Fellow of Harvard 4: DIFFERENT ASGNUM, Harvard University / Harvard University Medical School 5: DIFFERENT ASGNUM, Microsoft Corporation/Microsoft Corporated 6: SAME ASGNUM, COMPANY SIZE>1000 7: SAME ASGNUM, 1000>SIZE>100 8: SAME ASGNUM, SIZE<100 5) CLASS: 0-4 # OF COMMON CLASSES. MISSING=1 6) COAUTHERS 0-10 # OF COMMON COAUTHERS 7) DISTANCE: 0-7 FACTORS: LONGITUDE/LATITUDE, STREET ADDRESS 0: TOTALLY DIFFERENT 1: ONE IS MISSING 2: 75<DISTANCE < 100KM 3: 50<DISTANCE < 75 4: 10<DISTANCE < 50 5: DISTANCE < 10 6: DISTANCE < 10 AND STREET MATCH BUT NOT IN US, OR DISTANCE < 10 AND IN US BUT STREET NOT MATCH 7: STREET MATCH AND IN US

Probabilistic Matching Model
Inventor disambiguation algorithm Name and Patent attributes are assumed to be independent Unbiased training sets are created by conditioning on one set of features to create a sample of obvious matches or non-matches to learn about the other set of features without bias Count frequency of each similarity profile x in match and nonmatch sets to calculate P(x|M) and P(x|N)

Training Set Criteria Inventor disambiguation algorithm Condition on patent attributes to train name attributes Condition on name attributes to train patent attributes Name Attributes Patent Attributes Match Choose all the record pairs that have at least two common coauthors within each predefined block. Choose all the record pairs that share the same rare name (calculate statistics on unique full names, choose those whose first or last name only appear once). Not necessary to check each block. Nonmatch Choose all record pairs that have same appyear, different assignee, no common coauthors and no common classes within each predefined block. Choose all record pairs that have different last names from a subset of the whole database in which the number of records are proportional to the original one in terms of grant year.

Probabilistic Matching Model
Inventor disambiguation algorithm Likelihood ratio: r = P(x|M)/P(x|N) Probability of match given similarity profile x: where P(M) is empirically determined Smoothing: enforce monotonicity r is interpolated/extrapolated for unobserved xa

Disambiguation & Consolidation
Inventor disambiguation algorithm Generate similarity profile for each record within each block Lookup similarity profile in ratio database to find match probability Based on a given probability threshold, we determine if invnum_N (algorithmically generated unique inventor identifier) should be updated Records with same invnum_N are consolidated Improves algorithm efficiency for subsequent runs

Verification Measures

References Hall, B. H., A. B. Jaffe, and M. Trajtenberg. (2001). The NBER patent Citations Data File: Lessons Insights and Methodological Tools, NBER. Torvik, V. and M. Weeber, D. Swanson, N. Smalheiser (2005). “A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation,” JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 56(2):140–158, 2005. Torvik, V. and N. Smalheiser (2009). “Author Name Disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data, Vol. 3., No. 3, Article 11.

Disambiguation of USPTO Inventors

Similar presentations

Presentation on theme: "Disambiguation of USPTO Inventors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Disambiguation of USPTO Inventors

Similar presentations

Presentation on theme: "Disambiguation of USPTO Inventors"— Presentation transcript:

Similar presentations

About project

Feedback