An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)

Slides:



Advertisements
Similar presentations
Canada’s National Database of Post-M.D. Trainees A Co-operative Endeavour of National Medical Organizations & Governments  ACMC-Association of Canadian.
Advertisements

© 2013 Pearson Education, Inc. Active Learning Lecture Slides For use with Classroom Response Systems Introductory Statistics: Exploring the World through.
Simulation Topics Railway Trade (free?) Representation in Government –How many seats? –Who gets them? Annexation Protection (Fenians) National Capital.
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
Chris Dibben University of Edinburgh Linking historical administrative data.
The computer memory and the binary number system.
Big Data and Programming 4 February Today’s Agenda  A Short Introduction to Big Data  A Big Data Project: People In Motion  Next week  Meet.
Canada. Provinces/Territories Nova Scotia Nova Scotia Newfoundland and Labrador Newfoundland and Labrador P.E.I P.E.I New Brunswick New Brunswick Ontario.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May.
Economic Opportunity and Spatial Mobility in Britain, Canada and the United States, Lisa Dillon, Département de Démographie, Université de Montréal.
Big Data and Programming (History 9808A) 27 October 2014.
Evolving toward independence (?): Long-term changes in Canadian elderly women’s residential patterns, Lisa Dillon, PRDH, Université de Montréal.
1. Canadian Results PISA PISA 2012 by the numbers 3.
Lower Canadian Prayer Walk June 15, 2014 Travel from Seattle, WA, USA to Victoria, British Columbia, Canada, prayer walk and continue to Vancouver Seattle.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
SAMPLING DISTRIBUTIONS. SAMPLING VARIABILITY
Triple P: The Canadian Perspective Debbie Easton Program Implementation Consultant –Canada Triple P International.
National Statistical Office, Thailand 2-6 December 2013, Hanoi, Viet Nam Census Evaluation.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Gender and 3D Facial Symmetry: What’s the Relationship ? Xia BAIQIANG (University Lille1/LIFL) Boulbaba Ben Amor (TELECOM Lille1/LIFL) Hassen Drira (TELECOM.
Ameri-can-adians: Demography and Identity of Borderline Canadians and Americans Jack Jedwab and Susan W. Hardwick.
2008 NAPHSIS Annual Meeting Celebrating 75 Years of Excellence Orlando, Florida June 1–5, 2008 STEVE – Data Preparation Steps.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
HOME ALONE: DETERMINANTS OF LIVING ALONE AMONG OLDER IMMIGRANTS IN CANADA AND THE U.S. SHARON M. LEE DEPARTMENT OF SOCIOLOGY POPULATION RESEARCH GROUP.
Collecting Quantitative Data
Identity in the Census Finding people in more than one.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Minneapolis, Minnesota, USA November 5, 2009 Strengthening the U.S./Canadian Alliance: A Workshop to Advance School Mental Health Minneapolis, Minnesota,
1st Pan American Conference on Alcohol Policies Alcohol, Gender and Culture in Peru: Preliminary epidemiological estimates Marina Piazza, MPH, ScD; Inés.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
Median Age. Lutheran Membership Lutheran population by mother tongue.
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Statistics Canada’s Education Outreach Program Mary Townsend Statistics Canada DLI National Training Day Montreal, QC, May 14, 2007.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Christian A. Cumbaa and Igor Jurisica Division of Signaling Biology, Ontario Cancer Institute, Toronto,
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Chapter XIV Data Preparation and Basic Data Analysis.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Disambiguating inventor names using deep neural networks Steve Petrie T’Mir Julius.
Suicide. OVERALL TRENDS Australia: 2213 suicides in in The Australian suicide rate in 2003 was 24% lower than in Western Australia:
Chapter 1 Exploring Data Guided Notes. 1.0 Data Analysis: Making Sense of Data Pages 2-7 Objectives SWBAT: 1)Identify the individuals and variables in.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
HIST*4170 Data: Big and Small 29 January Today’s Agenda Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special.
7 sec. 3 Subregions of Canada. Atlantic Provinces Prince Edward Island, New Brunswick, Nova Scotia, Newfoundland Very small population, logging and fishing.
Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,
1 Combating Drunk & Drugged Driving in Canada 2011 Annual Region I Conference Baltimore, Maryland July 19, 2011 XXXXX.
Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,
(1) Organize information processing centers environment, the various functions and details Electronic Data Processing (EDP): can refer to the use of automated.
A First Book of C++ Chapter 4 Selection.
7. Performance Measurement
PCAP for Grade 8 Canadians
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Finding Clusters within a Class to Improve Classification Accuracy
Classification Nearest Neighbor
Semantic Interoperability and Data Warehouse Design
The European Statistical Training Programme (ESTP)
EXAMPLE.
Sandra Lagarto, Statistics Portugal,
Chapter 10 Content Analysis
In 2006, 80% of Canadians lived in urban centres
Chapter 13: Item nonresponse
Presentation transcript:

An Automated Record Linkage System for the Canadian Census, L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria) K. Inwood (University of Guelph) J. A. Ross (University of Guelph) Record Linkage Workshop, May 24 th -25 th, 2010, University of Guelph

‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data What we are working towards 1851 Census 1871 Census 1881 Census 1891 Census 1901 Census 1906 Census 1916 Census 1911 Census US 1880 Census US 1900 Census

Current Work 100% of 1871 Census Automatic Linking 4,277,807 records 3,601,663 records Partners and collaborators: FamilySearch, Church of Latter Day Saints, Minnesota Population Center, Université de Montréal, University of Alberta 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census

Existing (True) Links Ontario Industrial Proprietors – 8429 links Logan Township – 1760 links St. James Church, Toronto – 232 links Quebec City Boys – 1403 links Bias –family- context –others? Logan Twp Guelph

Attributes for Automatic Linking Last Name - string First Name - string Gender – binary Age - number Birthplace - number Marital status – single, married, divorced, widowed, unknown

Automatic Linkage The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense The system:

Data Cleaning and Standardization Cleaning –Names – remove non-alpha numerical characters; remove titles –Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); –All attributes - deal with English/French notations (e.g. days/jours, married/mariee) Standardization –Birthplace codes and granularity –Marital status

Computational Expense Very expensive to compare all the possible pairs of records Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days.

Managing Computational Expense Blocking –By first letter of last name –By birthplace Using HPC –Running the system on multiple processors

Record Comparison Comparing Strings –Jaro-Winkler –Edit Distance –Double Metaphone Age –+/- 2 years Exact matches –Gender –Birthplace

Classification Classifier –Support Vector Machines –5-fold cross validation Training Data –True links found by experts –Ontario proprietors Classes –Match –Non-match

Linkage Results ProvinceLinkage Rate (%) New Brunswick24.45 Nova Scotia21.50 Ontario18.36 Quebec17.45

Linkage Results - Evaluation True Links SetTotalTP (%)FP (%) Ontario_Props Logan St_James Les_Boys ProvinceTPFPPossibleUnsure New Brunswick Nova Scotia Ontario Quebec42526-

Linkage Results - Evaluation AttributeON71QC71CAN81ON_PropsLinked(ON)Linked(QC) Gender Distribution Female Male Age > Birthplace ON (15030) QC (15081) ENG (41000) IRE (41100) SCO (41400) GER (45300) USA (9900) Marital Status Married (1) Widowed (5) Single (6)

Directions to Improve Common patterns in incorrect links –Big age difference –Change in marital status for females –First name change Probability estimate score of the classifier

Before Results – Common Patterns After ProvinceLinkage Rate (%) New Brunswick24.45 Nova Scotia21.50 Ontario18.36 Quebec17.45 ProvinceLinkage Rate (%)Diff. NB NS ON QC

Results – Common Patterns Before After True Links SetTotalTP (%)FP (%) Ontario_Props Logan St_James Les_Boys SetTP (%)TPDiff.FP (%)FPDiff. O_P L St_J L_B

Results – Classification Scores TotalTP (%)FP (%) Logan St_James Les_Boys True Links SetTotalTP (%)FP (%) Logan St_James Les_Boys True Links SetTotalTP (%)FP (%) Logan St_James Les_Boys

Conclusions Linking people across Canadian censuses Preliminary automated linkage system More evaluation and experimentation is needed

Acknowledgements University of Guelph Ontario Ministry of Research and Innovation SHARCNET FamilySearch, Church of Latter Day Saints Minnesota Population Center University of Alberta Université de Montréal/PRDH Université Laval/CIEQ