Presentation is loading. Please wait.

Presentation is loading. Please wait.

Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.

Similar presentations

Presentation on theme: "Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young."— Presentation transcript:

1 Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young

2 Context Work plan Description of the short-listed methods Quantitative Evaluation – some results! Conclusions and Further Work Outline

3 Context SDC for 2011 Census outputs is a major concern for users Different SDC methodologies were adopted for tabular 2001 Census outputs across UK Late addition of small cell adjustment by ONS/NISRA resulted in high level of user confusion and dissatisfaction Publicised commitment to aim for a common UK SDC methodology for all 2011 Census outputs

4 Phase 1 (March ’06 – Jan ’07) –UK agreement of key SDC policy issues Phase 2 (Jan ’07 – Sept ’08) –Evaluation of all methods complying with agreed SDC policy position in terms of risk/utility framework and feasibility of implementation Phase 3 (Sept ’08 – Spring/Summer ’09) –Recommendations and UK agreement of SDC methodologies for 2011 Census tabular outputs Phase 4 (Feb ’09 onwards) –Evaluate and develop SDC methods for microdata, future work on output specification, system specification, development and testing Workplan

5 Progress Development of SDC Strategy –UK SDC working group established to take forward methodological work consisting of representatives from Wales, Northern Ireland and Scotland –UKCDMAC subgroup set up to QA work Methodological research: –Determine the short-list of SDC methods (Aug ‘07) –Quantitative evaluation of short-list (complete Sep ’08) Focus on tabular outputs whilst considering impact on other outputs (e.g. microdata)

6 Quantitative Evaluation Examine how methods protect and manage risk and how they impact on data utility Using a range of 2001 Census tables, varying parameters, different geographies Information Loss software used to evaluate each short-listed method

7 Short-listed Methods being considered for 2011 Census data Applied so that ‘safe’ tabular outputs can be released Record Swapping Over-imputation ABS Cell Perturbation (developed by the Australian Bureau of Statistics) 2001 Census SDC methods used as a baseline for comparison: Record Swapping and Small Cell Adjustment (SCA)

8 Short-listed SDC methods Record Swapping pre-tabular (applied Over-imputation directly to the microdata) ABS Cell Perturbation: post-tabular (applied to tables) SCA (a type of rounding) is also a post-tabular method

9 Record Swapping Swap the geographical location of a small number of households Households are paired according to similar characteristics (to avoid too much data distortion) Creates uncertainty in the data Can swap unique records only (those at greater risk)

10 B Area B A Treatment: FFind a different geographical Area F Identify another individual in a different area with virtually all the same characteristics F Swap the two records Characteristics: Age: 22, Sex: Male, Marital Status: Married N o of Cars: 3 Region: Area A Characteristics Age: 22, Sex: Male, Marital Status: Married N o of Cars: 1 Region: Area B Matches all variables except N o of Cars Unique as only person with 3 cars in Area A Swap records Record Swapping

11 Over-Imputation Imputation is a standard procedure for census data used to insert plausible values for those missing due to non-response Since it is not known whether these records are true or false, can also be used for SDC Carried out by the Edit and Imputation team at ONS using CANCEIS Algorithm: distance based nearest neighbour to use as a donor based on a set of matching variables

12 1)Blank out values for certain records in the data 2) Replace blanked out values with ‘imputed values’ using a nearest neighbour donor 25 malesingle 6 people in hhld 0 carsstudent 21 malesingle 6 people in hhld 0 carsstudent Blank out age from record Find a donor to impute age Over-Imputation

13 Which variables to impute? Risky variables? Ethnicity, elderly, other minority populations CANCEIS may impute exactly if using nearest neighbour donor Impute age (all donors) and small area geography (use only donors within same local authority): get a small margin of error

14 (ABS) Cell Perturbation Developed by the Australian Bureau of Statistics (ABS) Perturb each cell value in a table to create uncertainty around the true value Two stage method: –Stage 1: Adding Perturbation –Stage 2: Restoring Additivity

15 (ABS) Cell Perturbation Stage 1: Each cell is always perturbed in the same way using microdata keys – CONSISTENCY Stage 2: Restoring ADDITIVITY means consistency is lost slightly An improved approach is being developed in collaboration with Southampton University: optimise consistency and additivity – INVARIANT cell perturbation.

16 Results What is the effect on statistical quality of the data? –Tendency to increase correlations? –Tendency to distort distance metrics? –etc (many ways to measure infoloss) Impact on disclosure risk Examine different types of data

17 Results Only Over-Imputation, Record Swapping and Record Swapping with SCA have been evaluated so far. Both targeted and random approaches are being looked at. Note there are different ways of carrying out swapping and imputation, so interpretation of the results should take this into account.

18 SJ EA; approx. 200,000 households and 500,000 persons Four census tables so far: (1) Country of birth by religion by sex Individuals at ward level (2) Number of persons by accommodation type Households at OA and ED level (3) Age by religion by gender Individuals at OA and ED level (4) Origin-destination table Flows between home and travel to work location Data for Analysis

19 Measures of Quality Impact on Tests for Independence: C ramer’s V measure of association: where is the Pearson chi-square statistic Also, the same measure for entropy and the Pearson Statistic Variance of Cell Counts: For each row : and 

20 Measures of Utility Impact on Rank Correlations: Sort original cell counts and define deciles Repeat on perturbed cell counts where I is the indicator function and the number of rows Log Linear Analysis: Ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model:

21 Impact on Disclosure Risk

22 Quality Measures


24 Swapping does not change the overall set of household locations  Totals and subtotals by geography preserved Over-Imputation does change set of locations  Totals and subtotals by geography not preserved Swapping has no impact on Origin-Destination total flows – NO PROTECTION Over-Imputation does not preserve O/D total flows – POOR QUALITY Changes to Totals / Subtotals

25 Conclusions Decide whether to drop over-imputation: test on another EA? Quantitative Evaluation to be finished by September ’08 ABS cell perturbation method currently being evaluated – results are looking good

26 Further Work Setting of parameter values for final method; e.g. level of perturbation Protection of microdata samples Communal establishments Output specification / geography System specification, development and testing

27 Contact Details Useful links: a/outputconfidentiality.asp

Download ppt "Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young."

Similar presentations

Ads by Google