Presentation is loading. Please wait.

Presentation is loading. Please wait.

In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO.

Similar presentations


Presentation on theme: "In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO."— Presentation transcript:

1 in a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO Tim Churches, Sax Institute 28 October 2012

2 Introduction to the problem Virtual Data Centres Proposed solution Overview Confidentiality in Virtual Data Centres | Christine O’Keefe

3 Provides access to linkable de-identified health data for research  Improving outcomes  Improving policy Traditionally  Supplies linkable de-identified health data directly to researchers Loss of control over data heightens risk of:  External attack on datasets  Accidental or inadvertent actions by researcher  Deliberate attack by trusted researcher Population Health Research Network* Confidentiality in Virtual Data Centres | Christine O’Keefe *www.phrn.org.au

4 Secure remote access to virtual workstations and network in a data centre Secure Unified Research Environment* Confidentiality in Virtual Data Centres | Christine O’Keefe *Sax Institute SURE User Guide v1.2

5 Governance  Comply with privacy legislation and regulation  Honour assurances to data providers Restrict access to approved researchers  Information security measures Restrict amount and detail of data available  Apply statistical disclosure control methods before releasing data to researcher – No further confidentiality measures  Enable access via secure on-line system – Manual checking for confidentiality issues in statistical analysis outputs – “…developing valid output checking processes that are automated is an open research question” (Duncan, Elliot, Salazar-González 2012) Confidentiality Protection for Health Data Confidentiality in Virtual Data Centres | Christine O’Keefe

6 Remote Analysis  Researcher cannot see data itself, only “Output for publication” Virtual Data Centre  Researcher authorised to see data and “Output” as well as “Output for publication” Conceptual Model for online access Confidentiality in Virtual Data Centres | Christine O’Keefe RA VDC

7 Assumptions  Custodian prepares data to comply with legislation, regulation and assurances  Researcher complies with applicable researcher agreements  Researcher authorised to see data itself – Do not need to protect dataset records from researcher – Do not need to protect against malicious attacks by researcher – Data transformations and analyses are unrestricted – Confidentiality issues with respect to readers of academic literature – Confidentiality issues with repect to outputs of genuine queries Virtual Data Centre Confidentiality in Virtual Data Centres | Christine O’Keefe

8 Individual values Small cells/samples … threshold Dominance Differencing Linear or other algebraic relationships in data Precision Main Disclosure Risks in Statistical Output Confidentiality in Virtual Data Centres | Christine O’Keefe

9 1.Dataset preparation - by Custodian 2.Confidentialisation of statistical analysis output for publication – by Researcher Confidentiality Protection in a Virtual Data Centre – two stage process Confidentiality in Virtual Data Centres | Christine O’Keefe 1 2 Similarities to: ESSNet SDC Guidelines for checking output based on microdata research … Hundepool, Domingo-Ferrer, Franconi, Giessing, Nordholt, Spicer, de Wolf 2012 Statistics New Zealand Data Lab Output Guide

10 Custodian 1.Removes obvious identifiers 2.Ensures dataset has sufficient records 3.Ensures published datasets differ by sufficiently many records 4.Ensures variables and combinations of variables have suff many records 5.Reduces detail in data using aggregation (esp dates, locations) 6.Other measures as needed – statistical disclosure control Dataset preparation – by Custodian Confidentiality in Virtual Data Centres | Christine O’Keefe 1

11 Researcher 1.uses Checklist of tests to identify outputs that fail one or more tests 2.considers context and interations of outputs to identify potential disclosure risks 3.applies treatments from Checklist to reduce potential disclosure risk Confidentialisation of statistical analysis output for publication – by Researcher Confidentiality in Virtual Data Centres | Christine O’Keefe

12  Individual value: an individual data value is directly revealed  Threshold n: A cell or statistic is calculated on fewer than n data values  Threshold p%: A cell contains more than p% of the values in a table margin  Dominance (n,k): Amongst the records used to calculate a cell value or statistic, the n largest account for at least k% of the value  Dominance p%: Amongst the records used to calculate a cell value or statistic, the total minus the two largest values is less than p% of the largest value  Differencing: A statistic is calculated on populations that differ in fewer than n records  Relationships: The statistic involves linear or other algebraic relationships  Precision: The output involves a high level of precision in terms of significant figures and/or decimal places  Degrees of Freedom: The model output has fewer than n degrees of freedom Checklist of Tests Confidentiality in Virtual Data Centres | Christine O’Keefe

13 Statistic Confidentiality Test Treatment Notes Number e.g. sample size Threshold n  Try to get more data  Suppress value  If this test is failed, the study is probably unreliable due to the small sample size. MeanThreshold n  Recode variable  Round reported value  Suppress denominator  Suppress value  The tests and treatments are only necessary if the denominator is known so the sum can be inferred  The mean has a strong algebraic relationship with the sum so is potentially disclosive Dominance (n,k) Dominance p% Differencing  Redefine one or both populations Ratios and percentages Individual values  Suppress individual values  For a ratio, the tests and treatments are only necessary if one of the terms is known so the other can be inferred (this is an example of the relationship test) Threshold n  Recode variables  Round reported values  Suppress values Threshold p% Dominance (n,k) Dominance p% Differencing  Redefine one or both populations Relationships  Round reported values Precision  Round reported values Checklist - examples Confidentiality in Virtual Data Centres | Christine O’Keefe

14 Checklist - examples Confidentiality in Virtual Data Centres | Christine O’Keefe StatisticConfidentiality Test Treatment Notes Precision  Round reported values Relative risk Threshold n  Recode variables  In some cases data might be reconstructed from sample size and relative risk value alone. If so, the data would need to be checked for disclosure risk, and treatments applied if necessary. Precision  Round reported value Confidence interval Degrees of freedom  Change model or data groups to increase degrees of freedom  A confidence interval based on a normal distribution reveals a mean and standard error. These might be disclosive - see the tests and treatments under Summary Statistics Note that in a regression context it is claimed they can be used to reconstruct the fitted values Threshold n  Recode variables Precision  Round reported values p value of a test Precision  Round reported value  A p value can reveal the value of a test statistic which might be disclosive in combination with other reported information; see the 1 st note on Confidence Intervals Kaplan- Meier plot Other cumulative distributio n plots Individual value  Do not show individual values  This can be done by either smoothing the plot or recoding variables  There exists software that can read data values from a pdf version of a plot Threshold n  Only relevant if data already grouped in plot  Recode variables Threshold p% Dominance (n,k) Dominance p%

15 Virtual Data Centres  Becoming more popular  Manual checking of outputs for confidentiality risk not sustainable  Automated methods for confidentiality protection in statistical analysis outputs still under development Interim Solution 1.Dataset preparation by Custodian 2.Researchers confidentialise their own outputs for publication – Training – Checklist of tests and confidentiality treatments Summary Confidentiality in Virtual Data Centres | Christine O’Keefe

16 Thank you Computational Informatics Dr Christine O’Keefe Research Program Leader, Decision and User Science t+61 2 6216 7021 eChristine.OKeefe@csiro.au wwww.csiro.au COMPUTATIONAL INFORMATICS


Download ppt "In a Virtual Data Centre Protecting Confidentiality COMPUTATIONAL INFORMATICS Christine O’Keefe, Mark Westcott, Adrien Ickowicz, Maree O’Sullivan, CSIRO."

Similar presentations


Ads by Google