Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1

Main points of presentation What are Open Data Why the need for Open Data Steps to protect the data prior to release Privacy Concerns An Example The Future for Open Data 2

Definition of Open Data “Open Data simply encompasses data that are made available by organisations, businesses and individuals for anyone to access, use and share no matter where they are and what they want to do with the data.” Open Data Institute: Guides – What are Open Data 3

Why there is interest in Open Data Current release options were considered too restrictive and not allowing data to be used to full capacity Government commitment to the open data agenda resulted in the Open Government Licence (OGL) - no restrictions on use, no registration required - disclosure risk must be reduced to low or negligible risk 4

Open data – Initial steps Different to other microdata releases - negligible possibility of identification All direct identifiers removed Possibly of limited use to researchers Sensitive variables may be recoded or removed But Can be used as teaching / training datasets 5

Open data – Initial steps Users are allowed to publish, adapt and combine with other data as long as the information is not personal data Difficult balancing act between producing Open Data which are of some use and protected to a reasonable level even when combined with other data sources 6

Risk – Utility Disclosure Risk: Information about confidential units Data Utility: Information about legitimate items High Low 7 The Risk Utility Relationship Data Utility: Information about legitimate items Low High

1. Assess dataset background and context User requirements If possible discuss with potential users. Think about the following - Variables - Level of Detail - Geography 8

1. Assess dataset background and context Dataset details -Is the original dataset a survey sample, an administrative dataset or a census? Sample survey – doubt as to whether a member of the population is in the sample. Administrative dataset or census – could release a sample from the complete data 9

2. Intruder Scenarios 10 Why might an intruder want to discover confidential information? Identity theft Gain against commercial competitors Self identification Journalist after a good ‘public interest’ story Sensitive information about people - salary, health Discredit government or GSS Nosy neighbour Database enhancement

2. Intruder Scenarios Nosy neighbour – They would know certain facts in the dataset about a neighbour or colleague. Could use this to discover private details. Journalist – Could use the data to find out personal information about an individual who is unique on a set of variables. 11

3. Determine the Key Variables Variables most likely to lead to confidential information being found in a dataset Either Visible variables - possibly that an intruder might know through observation Or Sensitive variables - if known by an intruder, it would be likely to assist in an immediate identification 12

3. Determine the Key Variables The choice of key variables will depend on the dataset. Typical key variables are: Age (individual or grouped) Sex Health indicator (more likely to be a key variable if a specific condition) Size / composition of household Income (household or individual) Occupation / Industry Ethnic Group Religion Country of birth Marital Status 13

3. Determine the Key Variables Key variables unique to particular datasets Dwelling characteristics Household structure Education variables ‘Response' variable for each record, an outcome that relates to the specific purpose of the collection. Income Expenditure Gas /Electricity Consumption 14

4. Outputs from variable combinations Select combination of key variables Look for rare combinations or uniques in thee combinations Protect data by methods such as removal of variables or records, recoding or record swapping. Carry out intruder testing (internal and external). Repeat above steps Publish data under an Open Licence 15

What is Intruder Testing? Use ‘Friendly Intruders’ (Usually internal: ONS staff for example) to see if they can re-identify anyone in the dataset Discover what additional information is used by the intruders Discover which variables are used by intruders when attempting identification Determine the level of disclosure risk in the dataset empirically. 16

Privacy concerns Data released under OGL although protected will have a residual risk due to: The ‘mosaic’ effect. Linking different similar datasets may help identify a record in the data. The possibility of this will increase with greater computer power and matching software Access cannot be withdrawn from an open dataset 17

Example: 2011 Census teaching dataset In addition to Safeguarded and Secure Access data Teaching dataset introduced. Also acts as a taster for more detailed datasets Approx 500,000 records for England and Wales Protection is given to members of this dataset by: Small sample size – Small likelihood of an individual being in the sample Record swapping – Geographic perturbation 18

Steps to follow: 2011 Census teaching dataset Remove all direct personal identifiers such as Name, Address and Date of Birth There are a large number of variables in the original data. Decide on the variables to include in the teaching dataset. To include geography (Region) and basic demographic information Identify Key variables Create combinations of Key variables from both sample and population. Look for unique or rare combinations 19

Steps to follow: 2011 Census teaching dataset Recode some of the most identifying variables Recode Age into 8 Categories Recode Ethnic group from 16 to 5 categories Additional recodes for Industry, Economic Activity and Religion Recreate the variable combinations. Many sample uniques but very few population uniques. Carry out some additional record swapping. Swap a small number of most risky records at Region level Release the data under OGL 20

The Future Currently Open Data are often released alongside other (more detailed) licensed data. Confidentiality of data released under OGL protected by law They will be of limited use for complex research projects and used mainly for training / teaching purposes. Will the data be used by more than a small group of people? 21

Any Questions? 22

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Similar presentations

Presentation on theme: "Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Similar presentations

Presentation on theme: "Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1."— Presentation transcript:

Similar presentations

About project

Feedback