Presentation is loading. Please wait.

Presentation is loading. Please wait.

IPUMS-International Integration Process Matt Sobek Minnesota Population Center

Similar presentations


Presentation on theme: "IPUMS-International Integration Process Matt Sobek Minnesota Population Center"— Presentation transcript:

1 IPUMS-International Integration Process Matt Sobek Minnesota Population Center sobek@pop.umn.edu

2 DATA METADATA Data files Data dictionary Enumeration forms Enum. instructions Sample information Batch samples Reformat data Donation Draw sample Confidentiality A Translate to English Images to editable files Ipums data dictionary Code clean-up Verify data Confidentiality B Tag enumeration text Document unharmonized variables Harmonize codes Variable programming Constructed variables Variable descriptions Sample design Input material 1 Pre-processing 2 Standardization 3 Integration 4

3 End Matt Sobek Minnesota Population Center sobek@pop.umn.edu

4 Batch Samples In spring we identify the samples to integrate the following year. Samples are processed as a group – one per year. The entire batch of samples is processed through each stage before we proceed to the next step. There is little flexibility in the work process. If a sample is not available for processing during the earliest stages of integration, it cannot be included in the data release for that year.

5 Original Input Data Some examples of differing file formats: SPSS and SAS system files Redatam-format IMPS format Records that combine household and person characteristics Separate files for persons, households (and dwellings, buildings) Different types of records (mortality or migration) Separate files for different administrative units

6 Reformatting: Original Data File

7 Reformatting: Data File after Reformatting

8 geographyhousing person (head) person (child) geographyhousingperson (head) geographyhousingperson (child) geographyhousingperson (child) geographyhousingperson (head) geographyhousingperson (spouse) geographyhousingperson (child) geographyhousingperson (child) geographyhousing person (head) person (spouse) person (child) (Brazil 1980) (Person records only; household data duplicated on person records) Reformatting: Rectangular Sample

9 dwelling household person (head) person (spouse) person (child) household person (head) person (child) person (head) person (spouse) dwelling household dwellinghousehold person (head) person (spouse) person (child) dwellinghousehold person (head) person (child) dwellinghousehold person (head) person (spouse) (Chile 1992) (Separate dwelling and household records) Reformatting: Dwelling-Household-Person Sample

10 serial 001head serial 001spouse serial 002head serial 002child serial 003head serial 001geog & housing serial 002geog & housing serial 003geog & housing serial 001household serial 001head serial 001spouse serial 003household serial 002household serial 002head serial 002child serial 003head Household File Person File (Brazil 2000) Reformatting: Merge Household and Person Files

11 geogpersonhousinggeogperson geogpersonhousinggeogperson geogpersonhousinggeogperson geogpersonhousinggeogperson geogpersonhousinggeogperson household person household (Mexico 1960) geogpersonhousinggeogperson geogpersonhousinggeogperson geogpersonhousinggeogperson geogpersonhousinggeogperson geogpersonhousinggeogperson (Individuals only; not organized in households) Reformatting: Persons not Organized in Households

12 Donation and Error Correction Data are tested for errors that affect structural integrity, such as merged households, unmatched person and household records, corrupted records, etc. Such errors often do not affect tabulations, but create inconsistencies across records within households that affect sophisticated analyses. Some problems can be resolved with custom programming. Other problems are resolved by donating (substituting) a donor household for the corrupted one. Households are divided into strata based on predictor variables. Donors are drawn from the same strata as the corrupted household, ensuring they share key characteristics. If a sample is drawn from the full census, a substitute donor record is used; if we are already starting with a sample, the donor record is duplicated. A flag indicates that a record was duplicated.

13 Drawing a Sample About one-third of IPUMS samples are drawn from full-count data. After reformatting, we draw a systematic sample of every Nth dwelling to yield the desired sample density – typically 10%. If the input data are not full-count (for example, they include only the long-form records), the sample design might have to account for differing sample densities between areas. Very large dwelling units (over 30 persons) are sampled at the individual level – not as intact units – in order to reduce sampling error. Every Nth individual is taken.

14 Confidentiality Measures: A Swap a small percentage of cases between geographic areas. Reorder households within geographic areas. Suppress low-level geographic variables. Suppress any variable deemed too sensitive by the National Statistical Office. Encrypt all versions of the data prior to the imposition of these confidentiality measures.

15 Code Clean-Up: Recoding Unharmonized Variables Recode the input variables to conform to some basic standards for treatment of missing values, etc. Recode stray values into a consolidated missing category as appropriate. Convert non-numeric characters to numeric. Most recoding is performed using a data translation matrix like the one below for Marital Status in 1984 Costa Rica. If the recoding requires more complex logic, use custom programming.

16 Verify Data: Unharmonized Variables Examine the marginal frequencies of every input variable. Analyze the data universe for each variable – the population at risk of having a response. Determine the theoretical universe from enumeration materials or other documentation, then empirically determine any discrepancies from that universe. Document the universe for each variable and any other observations.

17 Confidentiality Measures: B Recode geographic units to ensure small localities cannot be identified (typically those with fewer than 20,000 persons). For recent censuses: Identify cells that represent very small numbers of persons in the population. Code them to a residual category or combine them. Top- or bottom-code continuous variables that have a long tail that could identify small subpopulations. Suppress specific categories of variables as requested by the National Statistical Office.

18 China1982Colombia1973Kenya1989Mexico1970U.S.A.1990 Harmonize Codes: Translation Matrix for Marital Status

19 Variable Programming Some variable manipulations are too complex to be handled using the translation matrix tables. Typically these involve continuous variables or recoding logic that refers to multiple variables. This programming is written in C++.

20 PernumRelateAgeSexMarstChborn 1head46malemarriedn/a 2spouse44femalemarried3 3aunt77femalewidow7 4child15femalesingle0 5child13femalesinglen/a 6child11malesinglen/a PernumRelateAgeSexMarstChborn 1head46malemarriedn/a 2spouse44femalemarried3 3aunt77femalewidow7 4child15femalesingle0 5child13femalesinglen/a 6child11malesinglen/a Spouse’s Mother’sFather’s Location 2 1 0 0 0 0 0 0 00 0 0 21 1 1 2 2 (Colombia 1985) (Simple household) Constructed “Pointer” Variables

21 PernumRelationshipAgeSexMarstChborn 1head53femaleseparated6 2child28malesinglen/a 3child22malesinglen/a 4child21malesinglen/a 5child25femalemarried2 6child-in-law28malemarriedn/a 7grandchild3malesinglen/a 8grandchild1malesinglen/a 9non-relative32femaleseparated2 10non-relative10malesinglen/a 11non-relative5femalesinglen/a Location 0 0 0 0 0 6 5 0 0 0 0 0 0 1 1 1 1 0 5 5 0 9 9 0 0 0 6 6 0 0 0 0 0 Spouse’sFather’sMother’s (Complex household) (Colombia 1985) Constructed “Pointer” Variables

22 Original Data Dictionary – Kenya 1989

23 Original Data Dictionary – Romania 1992

24 Original Data Dictionary – China 1982

25 Original Data Dictionary – Mexico 1990

26 Enumeration Form: Original File

27 Enumeration Instructions: Original File (Mexico 1990)

28 Sample Information – from Statistical Office Sample information is difficult for the IPUMS project to collect. Often only limited information can be gleaned from available documentation. It is extremely helpful when countries collate the information themselves, as was done below by the Netherlands:

29 Translate Documents to English Many countries provide their census documentation in English. For those that do not, the IPUMS project hires translators from around the world. Often these are persons currently or formerly associated with National Statistical Offices. Some common languages are translated by staff in Minnesota.

30 Editable Enumeration Form – In English

31 IPUMS Data Dictionary

32 XML-Tagged Enumeration Form

33 Document Unharmonized Variables The enumeration form and instruction text provides most of the documentation for the unharmonized input variables. Other documentation is written as needed to clarify the interpretation of the variable for users. We also empirically determine the universe of persons or households with valid values for each variable.

34 Variable Description (Literacy)

35 Sample Design


Download ppt "IPUMS-International Integration Process Matt Sobek Minnesota Population Center"

Similar presentations


Ads by Google