Presentation is loading. Please wait.

Presentation is loading. Please wait.

Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May.

Similar presentations


Presentation on theme: "Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May."— Presentation transcript:

1 Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May 24 th 2010

2 Introduction Overview of linkage process – Prelims vs. final releases Name commonness scores Error rate estimation Weights Looking ahead

3 Historical Record Linkage – U.S. 1850 1% sample 1860 1% sample 1870 1% sample 1880 complete-count 1900 1% sample 1910 1% sample 1920 1% sample 1930 1% sample

4 Historical Record Linkage at the MPC Primary goals are to create linked sets that are – Representative – Accurate

5 Historical Record Linkage at the MPC Representative links – We use a very limited set of variables to predict links to avoid linkage bias Block by birthplace, sex and race Given (first) name Surname (last) name Age

6 Historical Record Linkage at the MPC Accurate links – If there is more than one ‘potential’ link for a given person we exclude them all – We throw away a lot of potential links

7 Historical Record Linkage at the MPC Create given and surname and age similarity scores – Jaro-Winkler string similarity algorithm – 20% age difference score We apply name and age similarity thresholds to limit output of potential links

8 Additional Variables Based on Age Age – Age difference (absolute value, normalized)* – Age categories, in five-year groups*

9 Additional Variables Based on Name Phonetic Match (binary) – Double Metaphone – NYSIIS* Middle initials (if present) must not conflict (binary)*

10 Additional Variables Based on Name Name Commonness Scores* – Our answer to incorporating probabilistic information into the process without complete standardization of all name strings. – Proportion of records (by race, birthplace, and sex) in the 1880 data with a Jaro-Winkler score greater than 0.9 – Name commonness score works in tandem with a birthplace density measure, which is the proportion of 1880 records for specific birthplaces (by race and sex)

11 Classification of links Comparisons that beat the thresholds become ‘potential links’ that are classified as ‘true’ and ‘false’ links by two SVM models – One model includes age variables, the other does not* Link is accepted if both models call it a ‘true’ link and there are no conflicts

12 Name Commonness Table 6. Distribution of 1870 Records (Males) by Name Commonness Scores

13 Linkage Rate by Name Commonness

14 Linkage Rates by Name Commonness and Birthplace Population Size

15 Table 8. Linkage Rate for Native-Born 1870 Males by Birthplace Rank (number of males by birthplace) and Name Commonness Scores

16 Occupational Scores and Name Commonness

17 Estimating error rates Calculate migration rates by different slices of data, e.g. five-year age cats, age difference Split brothers Compare link made in one dataset to link made in another for same group of people Compare to linked set made by another independent source: Pleiades

18 Selected Linked Household – 1870-1880 LINKTYPELAST70FIRST70LAST80FIRST80RELATE70RELATE80AGE70AGE80 household UNDERWOOD NORMAN Head52 household UNDERWOOD MARY UNDERWOOD MARY SpouseHead3349 household UNDERWOOD LUTHER Son11 household UNDERWOOD IRVING UNDERWOOD ERVIN Son 313 primary UNDERWOOD VANDER UNDERWOOD VANDER Son 110 household UNDERWOOD ROSA Daughtr18 household UNDERWOOD CHARLES Son8 household UNDERWOOD ADDI Daughtr5

19 Weights The weights are based on the linkable population, which is always based on the terminal census year data. Based on an iterative process We capped weight minimums and maximums (min is 1/5 the avg. weight for the subgroup; max is 4 times the avg. weight for subgroup)

20 Final Release Data Set Size, Males MALE nat-b whitefor-b whiteaf-am 1850 701329982 1860 10426634235 1870 177258792180 1900 1859615151334 1910 14855995791 1920 10050511504 1930 9018336352

21 Final Release Data Set, Females FEMALE nat-b white marriedsingleFormerly 1850 1077468798 1860 22531495843 1870 425467001134 1900 327442411162 1910 189114071124 1920 793849894 1930 221700545

22 Final Release Data Set Size, Couples COUPLE nat-b whitefor-b whiteaf-am 1850 21352276 1860 453893219 1870 88622267407 1900 77452132180 1910 46501101102 1920 210741626 1930 6121057

23 Looking Ahead Hope to alleviate small N problem in the future – Link 1900 and 1930 5% samples to 1800 complete count – 1850 complete count database currently under construction – Hope to have complete count data for 1860, 1870, and 1900 in the future


Download ppt "Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May."

Similar presentations


Ads by Google