Presentation is loading. Please wait.

Presentation is loading. Please wait.

- National Institute for Demographic Studies

Similar presentations


Presentation on theme: "- National Institute for Demographic Studies"— Presentation transcript:

1 - National Institute for Demographic Studies
ANONYMISATION, DATA PROTECTION AND CONFIDENTIALITY : The INED’s Survey Department contribution throughout several examples Survey department - National Institute for Demographic Studies - Céline Dauplait Raphaël Laurent

2 Structure of this presentation
Introducing the survey department Who owns the identity data? Examples: ‘Adoption’ and ‘ERFI’ surveys The work on identifiers variables. Examples: ‘MGIS’, ‘EHF’ and ‘EFE’ surveys. Conclusion : questions about the limits of data diffusion

3 Presentation of the Survey Department
Specialized surveys department which oversees data collection for projects conducted by INED’s researchers Member of the « Réseau Quetelet » 1- Survey production; 2- Methodological research; 3- Survey archiving and data dissemination. The survey Department of the National Institute for Demographic Studies is a specialized survey department which oversees data collection for projects conducted by the Institute researchers and in partnership with other institutions. This department is a member of the Réseau Quételet It has three main missions: The first key mission is focused around quantitative and qualitative data collection, ranging from questionnaires’ design to final data presentation. The second goal of the department lie on methodological research and assessing innovative survey techniques and protocols in social sciences And a third mission is to manage the survey archives and make survey data available to users.

4 Owning the data : A precaution or a pressure
Many procedures are used by ined to insure that confidentiality, anonymisation and data protection are respected. The first one is not to own the dataset of the respondents.

5 “Adoption in France” Survey
INED doesn’t own the coordinates Survey made with the assistance of child welfare services to collect data. Anonymous data file INED CWS Questionnaires For example, the survey about Adoption, which was conducted in 2003 and 2004 in 10 departments in France. All persons wishing to adopt apply to the child welfare services of their “departement”. The first data were taken from files of all candidates who applied for approval to adopt a child (1857 files). Further information have been obtained by a short questionnaire (40% replied). Data from the files and the questionnaires were anonymous. So for this survey, the choice has been made not to contact directly the respondents. The child welfare services answered to the first data collection and give out the questionnaires to the persons. So the INED had no contact with the respondents and didn’t own their coordinates. By this way, the anonymity is preserved. Despite this, that way to proceed make the re-launch and the improvement of the answer’s rate impossible (only 40% of answers, which mostly correspond to the persons who got the approval). Candidates

6 Study of family and inter-generational relationships
Sample from the 1999 population census Follow-up and authorization forms 2008 INSEE’s sample INED INSEE The longitudinal Survey about family and inter-generational relationship was conducted in 2005. It is an example of the way INED constitutes an address dataset. The cross-section of this survey has been drawn from the 1999 population census made by the National Institute for Statistics and Economic Studies. This Institute detains the address data and doesn’t give the coordinates of the people. So that the survey and the cross-section can be followed-up, the addresses of the respondents have been asked during the interview (follow-up form). By this way, the respondents who accept to participate to the second interrogation shall effectively be contacted again in 2008 by INED (So, 88,8% of the interrogated persons accepted to be contact again. Moreover, another form has been presented to the respondents: an authorisation form for questions which have been judged as delicate ones by the CNIL (the French National Commission for Data protection and the Liberties) (religion, questions inducting the sexual orientations). This last precaution has sparked off many questions of the respondents: the confidentiality as been assured during the whole interview, so why do I have to sign a form? => To conclude on the addresses dataset possessing, we have seen through these two examples, that the ownership of the data by a third one both guarantees the respect of anonymity and constitutes a difficulty for raising the interviewed persons. On the other hand, concerning the dataset, many manipulations have been made to assure the anonymisation ; for example, some variables has been recoded or suppressed. For the ERFI survey, the main recoding work has been made on the socio-professional variables (in 2 numbers)

7 The work on dataset : Until where? Principles and particular cases

8 Identifiers and anonymisation : dealing with confidentiality
Direct identifiers : names, addresses (including postcodes information), telephone numbers, etc. Indirect identifiers : variables that when linked with other sources could result in a breach of confidentiality Procedure : remove the identifiers, aggregate or reduce the precision, bracket a coded variable, generalise the meaning of a nominal variable, restrict the upper or lower ranges The direct and indirect identifiers variables in dataset have to disappear. The direct identifiers are names, addresses, telephone numbers, etc. The indirect identifiers are variables which include information that when linked with other publicly available sources, could result in a breach of confidentiality. This could include geographical information, workplace / organisation, education institution or occupation… So, each survey has its potential sources of confidentiality breach depending on its own specifications. The main techniques used for quantitative data are: * to Remove the identifier from the dataset : remove any detailed personal information that is a direct identifier and is not generally relevant for legitimate secondary research. Example: remove respondent's names; addresses; postcode information; institution and telephone numbers. * To Aggregate/reduce the precision of a variable : reduce the precision of potentially revealing socio-demographic characteristics, such as the respondent's age and place of residence. As a general rule, report the lowest level of geo-referencing that will not potentially breach respondent confidentiality. The exact scale depends on the sort of data collected, but very detailed geo-references like full postcodes; wards; or names of small towns or villages are always likely to be problematic. Example: record the year of birth rather than the day, month and year; record postcode sectors (first 3 or 4 digits) rather than full postcodes. * To Bracket a coded variable : a term applied to the aggregation of coded (categorical) variables. Any coded variables that potentially identify the respondent should be aggregated into broader codes. For variables that use standard hierarchical codes (like the standard occupational classification), such aggregation can be automated. Example: if detailed 'unit group' standard occupational classification (SOC) employment codes might indirectly identify individuals, they can be aggregated up to 'minor group' codes by removing the terminal digit. * Generalise the meaning of a nominal (string) variable : another form of aggregation but generally relates to nominal rather than coded variables. Nominal variables have often been directly transcribed from a free-text response by the respondent, and such variables may require subsequent manual editing to ensure they no longer directly or indirectly identify individual respondents. Example: a study of doctors' expertise contains several nominal (string) variables that list detailed areas of medical expertise that could indirectly identify a doctor. These variables could be replaced by more general text, or even coded into generic responses such as 'one area of medical speciality', 'two or more areas of medical speciality', etc. * Restrict the upper or lower ranges of a continuous variable : continuous variables i.e., ones that record the value of an actual quantity, may indirectly identify a respondent if the values are unusual or atypical of the respondents surveyed. In such circumstances the unusually large or small values might be collapsed into a single code, even if the other responses are kept as actual quantitites (or alternatively one might code all responses). Example: annual salary could be 'top-coded' to avoid identifying highly paid individuals e.g. a top code of £100,000 or more could be applied (even if lower incomes are not coded into groups). We are now going to examine many surveys, as examples of anonymisation work made.

9 MGIS: Geographical Mobility and Social Insertion survey
In 1992, with INSEE’s collaboration An anonymisation work on dataset (variables removed, recoded) by survey department Exploitation made by the creative researchers A lost in quality and precision? The survey on geographical mobility and social insertion has been conducted in About immigrants’ and their children’s conditions of life. Cross-section from INSEE’s population census. A part of the variables of the dataset have been estimated as potentially dangerous ones : Those that contains a direct identifiers (children’s name), those that contains answers in plain text, dates judge too precise (birth month) or language or country indications when the nomenclature seemed to be too detailed. The variables have been recoded if that was possible. If not, they have been removed from the dataset. For example, the geographical variables have been recoded in three variables (one regional code, one random identifying code and one code concerning the kind of urban unit), which made possible to verify the mobility of people without identifying the person by crossing many sources. As for the language variables, the choice has been taken to group the classes in a larger code, not to be too precise (local dialects don’t appear). And we can ask a question: in viewpoint of the essential quality and aims of the survey, isn’t it appearing that the dataset, by recoding, loses its interest? (even if the original data is still approachable, if well-motivated: signature of a pledge form) As a case in point, another survey has chosen not to lose the fullness of data.

10 EHF: The Study of family history of France's 1999 Family Survey
The family survey : coupled with the population census since 1954 The subject chosen for 1999 : inter-generational transmission of languages and dialects A survey that enables the investigation on exceptional events or particular population INED: one of the depositaries of the data set The family survey is coupled with the population census since 1954, making it one of INSEE's oldest sample surveys. In 1999, the subject chosen for the family survey’s questionnaire was the inter-generational transmission of languages and dialects. About the anonymisation work, the usual procedure has been adopted : names and addresses removed, birth country code aggregated, etc. The size of this survey enables to investigate on exceptional events or particular population (adoption, population… Therefore, the necessity to have a rough data file. So the INED became one of the depositaries of the data set, with INSEE and French Archives. To get the precise data, the researchers have to sign a pledge form.

11 EFE: The “Family and Employers” survey
Household survey Establishment survey www-efe.ined.fr A particular matched survey on family and professional life conciliation >3,000 respondents 4,600 resp. in establishments  20 pers. 4,600 estab. A particular matched survey on family and professional life conciliation. From a Sample of 11,759 dwellings (INSEE), a face to face survey has been conducted by INSEE interviewers among 9,400 individuals. During the interview, the respondents were asked to give their employers name, address, identification numbers as well as their size and sector and indicated if there was a service in charge of the staff within the establishment (for the postal mailing to be targeted). The employers (of the 4641 respondents working in an establishment employing 20 persons and over) have been contacted through a self administered postal and internet survey : around 4,500 establishments. And more than 3,000 of them participated to the survey. >9,400 respondents 11,759 dwellings

12 EFE: Data protection Specificity: Protect the identities of both employees and establishments Specific data INSEE owns the identifiers data of household About the data protection concerning this survey: To protect the identity of employees and establishments, it first has been chosen not to interview the establishments that employs less than 20 persons. So that the employee who answered to the family survey can not be identified. Moreover, identifier variables have been suppressed (names, addresses, siret number) and some variables have been recoded, as, for example, the activity nomenclature which has been coded into larger fields. The recoding of the manpower (?) effectif or the industry field considered, but shall not change. The names and s of the establishments has been kept by ined so that they can be contact again. But the individual data set is own by INSEE. So it poses a problem for the some establishments to answer. The establishments were several categories of personnel (hospitals, transports…) don’t know for each category it has to answer. And the INED didn’t have the possibility to get the information. So the instruction was done to the employers to answer to the employees in majority. This way to proceed occurs a slant for the employers’ data set.

13 Anonymisation: For who ? For what? To be continued…
The dissemination supports: searching engine, survey’s Internet sites (information about anonymity) Specificities of the dissemination area : Who are the users? What does data bound to? The survey department participate to the data distribution and the respect of confidentiality. for many surveys, an Internet site has been developed. In those sites, some information about the anonymity of the survey can be delivered (what are the data used for? Legislative information are given and the (Informatics and freedoms National commission) cnil’s control is explained as a guarantee… In addition, concerning the distribution of the archived surveys, the department has developed a searching engine to find one survey’s documentation. The questionnaire and the presentation of the surveys are in open access. To get the dataset the potential users* have to complete a form where they expose their project. This demand is examined by INED which gives or not its approval. Recently, the Department has been informed that the questionnaire of the “Sexual context in France (CSF)” survey has been used by someone. *Usually, the askers are universities’ staffs (students, researchers…) Moreover a work is in progress to give access to data in English as well as in French. And questions are appearing concerning the limit of the translation work. To conclude, we can say that the questions around anonymity and data protection is relative to the distribution circle’s attributes: Who are the users? Are they researchers of our own Institute? Or are data distributed to ‘anyone’ that want them? Where are going data? What are they bound to? Should those attributes determine a security level? Which resources can be in open access?

14 Thank you for your attention. contacts :. celine. dauplait@ined. fr
Thank you for your attention! contacts :

15

16


Download ppt "- National Institute for Demographic Studies"

Similar presentations


Ads by Google