UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and.

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Census Data Editing: Structure and Within Record Editing

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Part I: Structure Editing

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Summary  Part I: Structure Edits What are structure edits? Geography edits Hierarchy of records Correspondence between housing and population records Editing relationships in a household Family nuclei

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 What are structure edits?  Structure edits check coverage and relationships between different units: persons, households, housing units, enumeration areas, etc. Specifically, they check that: all households and collective quarters records within an enumeration area are present and are in the proper order; all occupied housing units have person records, but vacant units have no person records; households must have neither duplicate person records, nor missing person records; enumeration areas must have neither duplicate nor missing housing records.

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Geography edits  Each EA must have the right geographic codes (city, province, region...)  Every housing unit in an EA should be entered and every record must have a valid EA code  The capture process must check this before editing of data commences  If errors remain, it is best to find the right code by returning to the enumeration documents and correcting manually, for example.

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Hierarchy of records

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Hierarchy of records 1_EA 2_Housing unit 4_Individual 2_Housing unit 3_Collective living quater 4_Individual 1_EA

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Hierarchy of records  Type 1 (EA) followed by new Type 1 (if original EA empty) or Type 2 (Housing unit) or Type 3 (Collective Living Quarter) Particular case of homeless people: create a dummy housing record to make structural checking easier  Type 2 (Housing Unit) followed by Type 1, 2 or 3 (if original dwelling vacant) or Type 4 (if original dwelling occupied)  Type 3 (Collective Living Quarter) followed by Type 4 (Individual) If not occupied, empty CLQ allowed?  Type 4 (Individual) followed by Type 4 (other individual in the same dwelling or collective living quarter), or Type 2 or 3 (other dwelling or CLQ) or Type 1 (new EA)

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Correspondence between housing and population records  An occupied unit should have at least one person and a vacant unit should have no people: if Type 2 (Housing Unit) & category (vacant) followed by Type 4 (individual) then change the category to occupied  The number of occupants recorded on the Housing Unit form should be exactly the same as the sum of the individual records in the household. If not, change the number on the Housing Unit form  Population records should be sequenced (numbered)  Type 3 (CLQ) & category (Hospital) followed by multiple Type 4 (individual) of category “Retirement home” then change the category of the CLQ to “Retirement home”

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Editing relationships in a household  Each individual has a relation to the first person: 1st person (or Head, or reference person) Spouse Child of the 1st or of his/her spouse Parent Other relative Friend Lodger...

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Editing relationships in a household Household with potential inconsistencies in age reporting

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Family nuclei  Father: Sex should be male and Age should be > minimum age  Mother Sex should be female and Age should be > minimum age  Child Age under a maximum limit ?

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Part II: Within Record Editing

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Summary  Part II: Within Record Edits Validity and Consistency Checks Top-down Editing versus Multiple-variable Editing Example of Multiple-Variable Editing Methods of Correcting and Imputing Data Example of Hot Deck for Sample Household (Sex Only) Example of Hot Deck for Sample Household (Sex and Age) Issues Related to Hot Deck Methods of Correcting and Imputing Data: General Principles Edit Trails and the Use of Imputation Flags

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Validity and Consistency Checks  Validity checks are performed to see if the value of individual variables are plausible or lie within a reasonable range Examples:  0<=AGE<=110  SEX= Female or SEX=Male  Consistency checks are performed to ensure that there is coherence between two or more variables Examples:  Head of Household should have AGE>=15  A child should be younger than a head of household  A person with AGE<15 should never be married

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Top-down Editing versus Multiple- Variable Editing  Top-down Editing approach starts by editing top priority variable (not necessarily first variable on questionnaire) and moves sequentially through all items in decreasing priority  During editing process, some edits change the value of an item more than once; this can introduce one or more errors in dataset Example: Child’s age first imputed on basis of mother’s age. Later child’s age re-imputed on basis of reported years of schooling, which might be inconsistent with mother’s age In this case, child’s age should keep being re-imputed till it is consistent  Important to avoid circular editing!

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Top-down Editing versus Multiple- Variable Editing  Multiple-Editing approach uses a set of rules that state the relationship between variables  Each statement is tested against data to see if true  Edit system keeps track of all false statements relating to invalid entries or inconsistencies  Assessment is then made on how to change record so that it will pass all edits and then decision is made  Fellegi-Holt principle of “minimum change” should be used

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Example of Multiple-Variable Editing Head of household and spouse have same sex PersonRelationshipSexChildren ever born Unedited data 1Head of householdMale3 2SpouseMaleBLANK Data after editing for sex 1Head of householdFemale3 2SpouseMaleBLANK

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Example of Multiple-Variable Editing Head of household and spouse have same sex No. RuleRelationshipSexAgeMarital statusFertility 1Head of household should be 15 years or older 2Spouse should be 15 years or older 3A spouse should be married 4If spouse present, head of household and spouse should be opposite sex 11 5Person less than 15 years old should be never married 6Male should have no fertility11 7For female 15 years or older fertility entry should not be blank Totals12 1

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Methods of Correcting and Imputing Data  The process of imputation changes one or more responses or missing values in a record or several records to ensure internally coherent records result  Before using any imputation method, the best strategy is to start with manual study of responses; imputation can then handle the remaining unresolved edit failures  Two methods of imputation: Cold Deck and Hot Deck  Cold Deck Imputation: Used mainly for missing or unknown values (not for inconsistent/invalid values) Values are imputed on a proportional basis from a distribution of valid responses (e.g., from previous census) In doing so, cold deck draws values from a fixed (but possibly outdated) distribution of values

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Methods of Correcting and Imputing Data  Hot Deck or Dynamic Imputation: Used for both missing data and inconsistent/invalid items Uses one or more variables to estimate the likely response based on data about individuals with similar characteristics The “donor set” (or imputation matrix) constantly changes through updating; therefore, imputations dynamically change during the process of editing all the records Thus, hot deck draws from a distribution that dynamically changes with each imputation and eventually (through modifications) “approaches” the distribution of current data set Caution: if the different items for a particular record have unknown values, hot deck may not use the same “donor” to impute for both missing values; in this case, it is preferable to use the same donor for both items

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Example of Hot Deck for Sample Household (Sex Only) ID numberRelationshipSexAgeDynamic Imputation Matrix 111391 222352 331131 439 1101 542402 64199*1 742132 859 299*2 951441 1052362 Missing Information: 9, 99 Relationship: 1=Head; 2=Spouse; 3=Child; 4=Other Relative; 5=Non-Relative Sex: 1=Male; 2=Female

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Example of Hot Deck for Age (Sex and Relationship) Relationship Head of Household (1)Spouse (2)Son/Daughter (3)Other Relative (4)Non-Relative (5) Male (1)35 1240 Female (2)32 1237 Initial Imputation Matrix For Age Based on Sex and Relationship

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Example of Hot Deck for Age (Sex and Relationship) ID numberRelationshipSexAge 11139 22235 33113 439 110 54240 64199 40 74213 859 299 37 95144 105236 Missing Information: 9, 99 Relationship: 1=Head; 2=Spouse; 3=Child; 4=Other Relative; 5=Non-Relative Sex: 1=Male; 2=Female

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Example of Hot Deck for Age (Sex and Relationship) Initial Imputation Matrix For Age Based on Sex and Relationship Relationship Head of Household (1)Spouse (2)Son/Daughter (3)Other Relative (4)Non-Relative (5) Male (1)35 1240 Female (2)32 1237 Relationship Head of Household (1)Spouse (2)Son/Daughter (3)Other Relative (4)Non-Relative (5) Male (1)39*3513*4044* Female (2)3235*1213*36* Dynamic Imputation Matrix After Multiple Changes

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Issues Related to Hot Deck  Devise dynamic imputation matrices based on people living in same small geographic area since they tend to be homogeneous with respect to many characteristics, i.e., different imputation matrices for different geographic areas should be created  Sometimes the simplest approaches are best: for example, for a missing housing attribute, it may be preferable to use the value of a neighboring household rather than using a complex imputation matrix that may result in the assignment of a value from outside the neighborhood  Before using dynamic imputation, an effort should be made to use related items instead. For example, if marital status is missing for an individual and there exists a spouse for that individual, then the value “married” should be assigned  One should edit key items such as age and sex first so that these can be used in other imputation matrices for lower priority items

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Issues Related to Hot Deck  Construct imputation matrices based on research from administrative sources or previous censuses and surveys  Standardized imputation matrices, (i.e., having standard dimensions, such as age and sex (e.g., for language)) can streamline process since they can be tested and applied quickly  BUT if language missing, first look to language of others in the same household or to race, ethnicity, birthplace before using dynamic imputation; i.e., an attempt should be made to use related information to assign values before resorting to imputation  Some editing teams keep more than one value per cell in imputation matrices to protect against same value being imputed multiple times; e.g., in case of 4 male children in household all with ages unknown, different values will be assigned

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Issues Related to Hot Deck  Imputation matrices that are too big (with too many dimensions) cannot be updated thoroughly, leading to inefficiencies and inaccuracies  Imputation matrices that are too small (with too few dimensions or too few groupings within dimensions) may lead to the same donor value being used repeatedly in imputation before the matrix is updated  Some items such as occupation and industry are notoriously difficult to edit since the large number of categories can make dynamic imputation very cumbersome; in such cases, may be counter-productive to impute and may be preferable to use “not stated”

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Methods of Correcting and Imputing Data: General Principles  Imputed record should closely resemble the failed edit record; impute for a minimum number of variables  Imputed record should satisfy all edits  All imputed values should be flagged and methods and sources of imputation should be clearly specified  Both un-imputed and imputed values should be stored to allow for evaluation of degree and effects of imputation

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Edit Trails and the Use of Imputation Flags  Important to generate edit trail showing all data changes and substituted values with their tallies  Counters of several types are essential to process planning and management: i) number of cases of each type of error; ii) non-response rates for each item; iii) imputation rates for each item, ….  Imputation flags are binary flags that change from initial value of 0 to 1 if original value of data is changed in any way; flags should be added onto each item that is imputed  Although a separate file with imputation flags takes up considerable space, this information is critical for planning of future censuses; e.g., As a means to investigate age threshold below which female with “child ever born” triggers a query edit and to decide if threshold should be modified for future rounds

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 THANK YOU!

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and.

Similar presentations

Presentation on theme: "UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and.

Similar presentations

Presentation on theme: "UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and."— Presentation transcript:

Similar presentations

About project

Feedback