Editing and Imputing Income Data in the 2008 Integrated Census prepared by Yael Klejman Israel Central Bureau of Statistics Good afternoon, my name.

Editing and Imputing Income Data in the Integrated Census prepared by Yael Klejman Israel Central Bureau of Statistics Good afternoon, my name is Yael Klejman I am with the Israel Central Bureau of Statistics I work in the census Planning and Development Sector. I am involved in all aspects of the planning of the next census… I am going to present to you the methodology of editing and imputation of the income data that was applied in the 2008 population census And I will give you a short review of what is planned for the coming 2020 census UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Hague, April 2017

The 2008 Population Census Integrated Census:
Use of administrative files for demographic data, mainly the Central Population Registry Large sample survey (17% of households) Statistical correction of the addresses provided by the CPR Collecting socio-economic information The 2008 population census was an integrated census: it integrated administrative sources with a large field survey . The main administrative source is the Central Population Registry (CPR) This was the source of the demographic data for 100% of the census population. The sample field survey’s purpose was : to correct the addresses provided by the CPR + collecting socio-economic data not available in administrative sources. The socio-economic subjects covered the following subjects: education, labor force characteristics, household typology, housing, ownership of valuable goods and disability.

Income Data Administrative sources: Advantages:
Income Tax Authority for work income National Insurance Institute for allowances Advantages: Reduced burden on population Improved accuracy of data In past censuses, respondents in the field survey were inquired as to their work and income. Following the identification of deficiencies in the reporting and editing of income in the 1995 census, ahead of the 2008 Population Census it was decided to use administrative files as the source of income data. The admin source are the Income Tax Authority file for wor income, NII for allowances. The use of administrative files for income data from work has 2 advantages: Reduced response burden on the population Increased quality of the data It has to be noted that The field survey included questions on the non-work income such as: income from pensions, income from rent (low reply rate )

Census (Socio Economic File)
The first step was to map out the population involved. This resulted in the following scheme: The Socio Economic Census File includes the answers of all the individuals who responded: …… The discrepancies that required further investigation were the following 2 groups: A B Group A Group B

Group A: reported in census as being in annual workforce but not found in Tax File
Missing employers Missing occupations Rest of population Analysing group A, the following cases were found: … I am going to elaborate on these groups: NOT: 86% employees, 13% self employed, 1% unpaid or Kibbutz members Reason: Late reporting of employer or self-employed individual Decision: impute for these 2 sub-groups (employees and self-employed) It was decided not to impute for the unpaid workers and the Kibbutz members because they have a very particular work agreement characteristic

Imputation for Missing Employer
condition Employer is completely missing from 2008 Tax File Employee was employed by same employer in 2007 process Find employee in 2007 Tax file Adjust salary from 2007 to 2008 by industry result Insert into Socio Economic File The first group included the employees whose employers were entirely missing from the 2008 TAX file. We found large and small employers who had not yet filed Tax reports for 2008. We searched those employers in the 2007 TAX file. If we found that the employee was employed by the same employer, we imputed his 2007 salary , after adjusting it to 2008 by index. So basically, it Cold deck with adaptation for inflation The imputation was based on the number of work months reported by the individual in the 2008 census

Missing occupations Caretakers Career Military Personnel
Alternative chosen for both: statistical imputation based on the Income Survey The second group includes individuals who reported as working as caregivers, nannies, babysitters, cleaners of private homes. Those individuals are missing from the TAX file because they don’t report to the tax authority due to low earnings. Military personnel are removed from the TAX file received by CBS for security reasons. The income survey is a yearly survey carried out by CBS. It has a yearly sample of 15’000 households and focuses on the income of the households.

Statistical imputation for missing occupations
Job extent, occupation Job extent, occupation, (industry) Job extent, occupation, (industry), age group The base for this imputation was the income survey ( ). The imputation was applied in stages: A the first stage, the Average income (employee-self employed) was calculated from the sample out of the income survey by the 4 variables (job extent, occupation, industry, age group). For every missing record of a caretaker, a record was randomly selected from income survey by these 4 variables. If no fit was found, the wages were imputed by weighed average income of the 3 variables If no fit was found, the wages were imputed by weighed average income of the 2 variables For employed and self-employed together. It was found that there is no correlation between employment type (employee or self-employed) and wages. Average wage difference was explained by differences in age distribution of self-employed versus employees. Important to maintain similar distribution in imputed records compared to income survey. In order to achieve similar variance, for every missing record of a caretaker, a record was randomly selected from income survey.

Rest of Population “Nearest neighbor” method using Canceis program developed in Canada At individual level Socio-economic variables: job extent, occupation, industry, highest education degree, gender, age group, residence locality, number of children in household, marital status Separate imputation for institution residents For the rest of the population, the imputation method of the nearest neighbor was applied, using the Canceis program developed in Canada. The imputation was done at the individual level, taking into account the socio-economic characteristics of the individual (the household level would have provided less donors). Separate income imputation was conducted for residents of institutions and included an extra variable “type of institution”. Canseis is a complex program that allows for a hotdeck probablity imputation. It uses many parameters, allows to set logical rules. It calculates the distance between records in order to determine the closeness between records when searching for a donor contributing a missing value. Two main constraints were introduced into the process of choosing the nearest neighbor: In calculating the distance between a record with missing income and a record with income, a higher weight was given to the variables: occupation, highest degree, gender and age group (after examination those were found to have the highest correlation to income) A donor can be used only one time (in order to maintain variance) The program performs random sampling from a group of the nearest neighbors.

Donor population Donor population: individuals in yearly workforce.
NOT included: kibbutz members , imputed occupations Individuals earning highest income percentile of each occupation (2 digits) International classification of occupation

Group B: reported as not in annual workforce but found in Tax File
73% income from salary 23% non-work related income Reason: Irregular employment pattern or response by proxy Decision: include work income Status added: “has income from work but reported as not in annual workforce” It was decided to use the Tax Authority File as the admin source for this variable, but also to add a special status, so that the user can decide whether to use this data or not.

Topcoding procedure Calculate interquartile range at locality level
Define threshold Calculate interquartile range at locality level Multiply by factor (urban=4, rural=3) ID records Identify records above threshold Minimum 3 records per locality Edit Calculate average of all top earners Replace income for those records In order to avoid identifying individuals with exceptionally high income, it was decided to edit the income data for those individuals. Several alternatives were examined, after which the following procedure was applied: At the locality level a threshold was defined by calculating the interquartile range and multiplying it by a factor. 4 for urban localities, 3 for rural authorities The second step was to identify the records above the threshold. For each locality in which at least one record was identified, the procedure was applied to at least 3 records Finally, the average of those records was calculated and used in all of those records. This method maintained the average and distribution of the income data at the locality level.

Allowances Based on Personal Identification Number, allowances were received from National Insurance Institute Eight types of allowances Number of months received Side file used to calculate variables in SEF at individual and household level Just to complete the picture, the work income data was complemented by the allowance data in order to calculate the income available to the household and the individual. The admin source was the NII (National Insurance Institute). The data was taken as is, no editing or imputation methods were used. DID we check the data somehow?

Results 84.9% Records in workforce with income in income file 5.3%
Percent Imputation Type 84.9% Records in workforce with income in income file 5.3% Records with income in 2007 Income File 2.7% Imputation based on Income Survey for missing occupations 6.2% Nearest neighbor imputation - Canseis 1.0% Nearest neighbor imputation for institution residents – Canseis 100.0% Records in workforce The results were as follows: Almost 85% of the records were found in the TAX authority file and were taken as is. The rest were taken as I explained: …

Evaluation Difference Average income of records from Tax Authority
Average income of imputed records Age group 29% 2167 2785 Under 20 11% 4870 5400 20-29 2%- 9185 8962 30-39 -2% 10493 10312 40-49 -5% 11164 10623 50-59 -8% 10529 9636 60 + The evaluation process included : Comparing the average income of the imputed records with the average income in Tax File, by several variables: highest degree, age group, occupation. All came back similar, without great discrepancies. Here you can see by age group.

Maintain distribution Maintain statistics
Difference Average income of Records from Tax Authority Average income of Imputed records Occupation -7% 14416 13457 Academic professionals -2% 8280 8077 Associate professionals and technicians 1% 18927 19051 Managers 7% 7255 7755 Clerical workers 19% 5478 6531 Agents, sale workers and services workers 5% 7541 7945 Skilled agricultural workers -4% 7913 7561 Mechanics, electricians 6404 6257 Painters, tailors, printing workers, workers in food processing -10% 6864 6171 Drivers, ship deck crews, packaging machine operators, potters and glass makers 4703 4745 Unskilled workers Maintain distribution Maintain statistics Here it is displayed by Average income by Occupation: imputed records vs records in Tax Authority file

Evaluation (cont.) Maintain distribution Maintain statistics
Records from Tax Authority Imputed records Statistic 8696 8845 Mean 6031 6185 Median 4000 Mode 99 97 CV 8626 8618 Standard deviation 3.3 3.2 Skewness Maintain distribution Maintain statistics The evaluation also included comparison of basic statistics relating to distribution of income of imputed records vs records from Tax Authority File. Result were Satisfying The imputations methods Maintained distribution and Maintained statistics

Future Plans Examination of Multiple Imputation Method (MI): simultaneous imputation on several variables. The method maintains distribution. Foreigners living in Israel: inquiring administrative sources and develop models for income data. Our methodologists are currently examining the option using MI for imputing the education variables. MI allows simultaneous imputation on several variables, maintains distribution of all variables in file. The program assumes Missing at Ransom of the missing values. For Nearest neighbor: MI: זה בבדיקה בתחום המתודות שלנו בודקים את זה מול משתנים מסוימים, בדקו זקיפה של משתני השכלה תוצאות לא פחות טובים מאשר קנסיס יכולת לשמור על התפלגות יותר טובה מאשר קנסיס אבל אנחנו לא יודעים את האמת. Foreigners : are not included in Tax File. We are examining to include enumeration methods for the next census, and inquiring administrative sources for income data.

Thank you!

Editing and Imputing Income Data in the 2008 Integrated Census prepared by Yael Klejman Israel Central Bureau of Statistics Good afternoon, my name.

Similar presentations

Presentation on theme: "Editing and Imputing Income Data in the 2008 Integrated Census prepared by Yael Klejman Israel Central Bureau of Statistics Good afternoon, my name."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Editing and Imputing Income Data in the 2008 Integrated Census prepared by Yael Klejman Israel Central Bureau of Statistics Good afternoon, my name.

Similar presentations

Presentation on theme: "Editing and Imputing Income Data in the 2008 Integrated Census prepared by Yael Klejman Israel Central Bureau of Statistics Good afternoon, my name."— Presentation transcript:

Similar presentations

About project

Feedback