Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Descriptive Statistics. Descriptive Statistics: Summarizing your data and getting an overview of the dataset  Why do you want to start with Descriptive.
An Automated Record Linkage System for the Canadian Census, L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria)
Economic Opportunity and Spatial Mobility in Britain, Canada and the United States, Lisa Dillon, Département de Démographie, Université de Montréal.
3.3 Toward Statistical Inference. What is statistical inference? Statistical inference is using a fact about a sample to estimate the truth about the.
Bios 101 Lecture 4: Descriptive Statistics Shankar Viswanathan, DrPH. Division of Biostatistics Department of Epidemiology and Population Health Albert.
Samples & the Sampling Distributions of the Means
Sociology 601: Class 5, September 15, 2009
IMPUTING MISSING VALUES FOR HIERARCHICAL POPULATION DATA Overview of Database Research Muhammad Aurangzeb Ahmad Nupur Bhatnagar.
ARE OBSERVATIONS OBTAINED DIFFERENT?. ARE OBSERVATIONS OBTAINED DIFFERENT? You use different statistical tests for different problems. We will examine.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Continuous Surveys: Statistical Challenges and Opportunities Carl Schmertmann Center for Demography & Population Health Florida State University
Raw Census Microdata from IPUMS IPUMS Data Structure Household record (shaded) followed by a person record for each member of the household Relationship.
Sample Design on Historical Census Projects at the University of Minnesota Ron Goeken.
Thomas Songer, PhD with acknowledgment to several slides provided by M Rahbar and Moataza Mahmoud Abdel Wahab Introduction to Research Methods In the Internet.
The Two Sample t Review significance testing Review t distribution
Probability and the Sampling Distribution Quantitative Methods in HPELS 440:210.
Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
COLLECTING QUANTITATIVE DATA: Sampling and Data collection
Simple Linear Regression
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
Confidence Intervals and Two Proportions Presentation 9.4.
National Projections Program, 2005 Population Projections Branch Population Division U.S. Census Bureau.
Identity in the Census Finding people in more than one.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Plans for Access to UK Microdata from 2011 Census Emma White Office for National Statistics 24 May 2012.
1 Statistics for the Behavioral Sciences (5 th ed.) Gravetter & Wallnau Chapter 4 Variability University of Guelph Psychology 3320 — Dr. K. Hennig Winter.
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Slide Slide 1 Section 3-3 Measures of Variation. Slide Slide 2 Key Concept Because this section introduces the concept of variation, which is something.
Sampling Distributions & Standard Error Lesson 7.
Interpreting Performance Data
Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.
Measures of Central Tendency And Spread Understand the terms mean, median, mode, range, standard deviation.
CSC 211 Data Structures Lecture 13
Statistics in Biology. Histogram Shows continuous data – Data within a particular range.
1.State your research hypothesis in the form of a relation between two variables. 2. Find a statistic to summarize your sample data and convert the above.
Measures of Central Tendency Foundations of Algebra.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
Sub-regional Workshop on Census Data Evaluation, Phnom Penh, Cambodia, November 2011 Evaluation of Age and Sex Distribution United Nations Statistics.
Confidence intervals. Estimation and uncertainty Theoretical distributions require input parameters. For example, the weight of male students in NUS follows.
Weighting Household Surveys By David F. Pearson, Ph.D., P.E. April 2007.
1 Chapter 4 Numerical Methods for Describing Data.
United Nations Workshop on Evaluation and Analysis of Census Data, 1-12 December 2014, Nay Pyi Taw, Myanmar DATA VALIDATION-I Evaluation of editing and.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Analysis of the characteristics of internet respondents to the 2011 Census to inform 2021 Census questionnaire design Orlaith Fraser & Cal Ghee.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
1 Introduction to Statistics. 2 What is Statistics? The gathering, organization, analysis, and presentation of numerical information.
Surveillance and Population-based Prevention Department for Prevention of Noncommunicable Diseases Displaying data and interpreting results.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
United Nations Workshop on Evaluation and Analysis of Census Data, 1-12 December 2014, Nay Pyi Taw, Myanmar DATA VALIDATION-II Consistency check.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Basics in R part 2. Variable types in R Common variable types: Numeric - numeric value: 3, 5.9, Logical - logical value: TRUE or FALSE (1 or 0)
Inference about proportions Example: One Proportion Population of students Sample of 175 students CI: What proportion (percentage) of students abstain.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Adjusting for coverage error in administrative sources in population estimation Owen Abbott Research, Development and Infrastructure Directorate.
Evaluating Classifiers
Hypothesis Testing Review
Measures of Central Tendency
Measures of Central Tendency
Probability and the Sampling Distribution
Statistics PSY302 Review Quiz One Fall 2018
Math Review #3 Jeopardy Random Samples and Populations
An Introduction to Automated Record Linkage
Chapter 11 Analyzing the Association Between Categorical Variables
Presentation transcript:

Record Linkage at the Minnesota Population Center Ron Goeken, Lap Huynh, Tom Lenius, and Rebecca Vick RecordLink Workshop, 2010 University of Guelph, May 24 th 2010

Introduction Overview of linkage process – Prelims vs. final releases Name commonness scores Error rate estimation Weights Looking ahead

Historical Record Linkage – U.S % sample % sample % sample 1880 complete-count % sample % sample % sample % sample

Historical Record Linkage at the MPC Primary goals are to create linked sets that are – Representative – Accurate

Historical Record Linkage at the MPC Representative links – We use a very limited set of variables to predict links to avoid linkage bias Block by birthplace, sex and race Given (first) name Surname (last) name Age

Historical Record Linkage at the MPC Accurate links – If there is more than one ‘potential’ link for a given person we exclude them all – We throw away a lot of potential links

Historical Record Linkage at the MPC Create given and surname and age similarity scores – Jaro-Winkler string similarity algorithm – 20% age difference score We apply name and age similarity thresholds to limit output of potential links

Additional Variables Based on Age Age – Age difference (absolute value, normalized)* – Age categories, in five-year groups*

Additional Variables Based on Name Phonetic Match (binary) – Double Metaphone – NYSIIS* Middle initials (if present) must not conflict (binary)*

Additional Variables Based on Name Name Commonness Scores* – Our answer to incorporating probabilistic information into the process without complete standardization of all name strings. – Proportion of records (by race, birthplace, and sex) in the 1880 data with a Jaro-Winkler score greater than 0.9 – Name commonness score works in tandem with a birthplace density measure, which is the proportion of 1880 records for specific birthplaces (by race and sex)

Classification of links Comparisons that beat the thresholds become ‘potential links’ that are classified as ‘true’ and ‘false’ links by two SVM models – One model includes age variables, the other does not* Link is accepted if both models call it a ‘true’ link and there are no conflicts

Name Commonness Table 6. Distribution of 1870 Records (Males) by Name Commonness Scores

Linkage Rate by Name Commonness

Linkage Rates by Name Commonness and Birthplace Population Size

Table 8. Linkage Rate for Native-Born 1870 Males by Birthplace Rank (number of males by birthplace) and Name Commonness Scores

Occupational Scores and Name Commonness

Estimating error rates Calculate migration rates by different slices of data, e.g. five-year age cats, age difference Split brothers Compare link made in one dataset to link made in another for same group of people Compare to linked set made by another independent source: Pleiades

Selected Linked Household – LINKTYPELAST70FIRST70LAST80FIRST80RELATE70RELATE80AGE70AGE80 household UNDERWOOD NORMAN Head52 household UNDERWOOD MARY UNDERWOOD MARY SpouseHead3349 household UNDERWOOD LUTHER Son11 household UNDERWOOD IRVING UNDERWOOD ERVIN Son 313 primary UNDERWOOD VANDER UNDERWOOD VANDER Son 110 household UNDERWOOD ROSA Daughtr18 household UNDERWOOD CHARLES Son8 household UNDERWOOD ADDI Daughtr5

Weights The weights are based on the linkable population, which is always based on the terminal census year data. Based on an iterative process We capped weight minimums and maximums (min is 1/5 the avg. weight for the subgroup; max is 4 times the avg. weight for subgroup)

Final Release Data Set Size, Males MALE nat-b whitefor-b whiteaf-am

Final Release Data Set, Females FEMALE nat-b white marriedsingleFormerly

Final Release Data Set Size, Couples COUPLE nat-b whitefor-b whiteaf-am

Looking Ahead Hope to alleviate small N problem in the future – Link 1900 and % samples to 1800 complete count – 1850 complete count database currently under construction – Hope to have complete count data for 1860, 1870, and 1900 in the future