Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Similar presentations


Presentation on theme: "An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)"— Presentation transcript:

1 An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria) K. Inwood (University of Guelph) J. A. Ross (University of Guelph) Record Linkage Workshop, May 24 th -25 th, 2010, University of Guelph

2 ‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

3 Current Work 100% of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch, Church of Latter Day Saints, Minnesota Population Center, Université de Montréal, University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

4 Existing (True) Links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links Bias –family- context –others? Logan Twp Guelph

5 Attributes for Automatic Linking Last Name - string First Name - string Gender – binary Age - number Birthplace - number Marital status – single, married, divorced, widowed, unknown

6 Automatic Linkage The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense The system:

7 Data Cleaning and Standardization Cleaning –Names – remove non-alpha numerical characters; remove titles –Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); –All attributes - deal with English/French notations (e.g. days/jours, married/mariee) Standardization –Birthplace codes and granularity –Marital status

8 Computational Expense Very expensive to compare all the possible pairs of records Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days.

9 Managing Computational Expense Blocking –By first letter of last name –By birthplace Using HPC –Running the system on multiple processors

10 Record Comparison Comparing Strings –Jaro-Winkler –Edit Distance –Double Metaphone Age –+/- 2 years Exact matches –Gender –Birthplace

11 Classification Classifier –Support Vector Machines –5-fold cross validation Training Data –True links found by experts –Ontario proprietors Classes –Match –Non-match

12 Linkage Results ProvinceLinkage Rate (%) New Brunswick24.45 Nova Scotia21.50 Ontario18.36 Quebec17.45

13 Linkage Results - Evaluation True Links SetTotalTP (%)FP (%) Ontario_Props164721.599.28 Logan176021.648.85 St_James23224.727.12 Les_Boys140317.9911.41 ProvinceTPFPPossibleUnsure New Brunswick662761 Nova Scotia70225- Ontario534052 Quebec42526-

14 Linkage Results - Evaluation AttributeON71QC71CAN81ON_PropsLinked(ON)Linked(QC) Gender Distribution Female47.4649.8349.3548.6345.2643.50 Male49.6950.0050.6451.3354.7456.50 Age 0-1542.2041.8438.6860.2840.9643.24 15-2520.1220.7221.229.4420.7022.56 25-5026.4225.7827.6831.3526.9523.07 >5011.2611.6612.428.9311.3911.13 Birthplace ON (15030)67.290.5734.0473.2466.300.48 QC (15081)2.4591.7130.702.402.5792.08 ENG (41000)7.441.114.026.7410.001.37 IRE (41100)5.480.982.755.845.400.94 SCO (41400)9.353.174.457.338.572.83 GER (45300)1.230.060.561.122.100.07 USA (9900)2.591.231.772.193.961.72 Marital Status Married (1)30.3630.2231.7839.7529.1123.13 Widowed (5)3.213.023.660.864.073.64 Single (6)66.4366.7564.5259.3966.8273.24

15 Directions to Improve Common patterns in incorrect links –Big age difference –Change in marital status for females –First name change Probability estimate score of the classifier

16 Before Results – Common Patterns After ProvinceLinkage Rate (%) New Brunswick24.45 Nova Scotia21.50 Ontario18.36 Quebec17.45 ProvinceLinkage Rate (%)Diff. NB22.24-2.21 NS18.72 -2.78 ON15.68-2.68 QC14.82-2.63

17 Results – Common Patterns Before After True Links SetTotalTP (%)FP (%) Ontario_Props164721.599.28 Logan176021.648.85 St_James23224.727.12 Les_Boys140317.9911.41 SetTP (%)TPDiff.FP (%)FPDiff. O_P20.48-1.117.32-1.96 L20.36-1.287.25-1.6 St_J23-1.725.92-1.2 L_B16.66-1.3310.36-1.05

18 Results – Classification Scores 0.8 0.85 0.9 22.06TotalTP (%)FP (%) Logan176019.374.86 St_James23222.063.43 Les_Boys140315.255.94 True Links SetTotalTP (%)FP (%) Logan176018.974.61 St_James23222.063 Les_Boys140314.645.31 True Links SetTotalTP (%)FP (%) Logan176018.1253.78 St_James23221.632.4 Les_Boys140313.943.97

19 Conclusions Linking people across 1871-1881 Canadian censuses Preliminary automated linkage system More evaluation and experimentation is needed

20 Acknowledgements University of Guelph Ontario Ministry of Research and Innovation SHARCNET FamilySearch, Church of Latter Day Saints Minnesota Population Center University of Alberta Université de Montréal/PRDH Université Laval/CIEQ


Download ppt "An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)"

Similar presentations


Ads by Google