JSM, Boston, August 8, 2014 Privacy, Big Data and The Public Good: Statistical Framework Stefan Bender (IAB)

Slides:



Advertisements
Similar presentations
Chapter 3 Striving for Integrity in the Research Process Zina OLeary.
Advertisements

Commercial confidentiality and PSI Razvan Dinca University of Bucharest.
1 ABCs of PKI TAG Presentation 18 th May 2004 Paul Butler.
The Social Scientific Method An Introduction to Social Science Research Methodology.
The Challenge of the New Data Mark Elliot, Social Sciences University of Manchester January 2013
Security by Design A Prequel for COMPSCI 702. Perspective “Any fool can know. The point is to understand.” - Albert Einstein “Sometimes it's not enough.
Correlational and Differential Research
Spring 2000CS 4611 Security Outline Encryption Algorithms Authentication Protocols Message Integrity Protocols Key Distribution Firewalls.
21-1 Last time Database Security  Data Inference  Statistical Inference  Controls against Inference Multilevel Security Databases  Separation  Integrity.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Supervisor : Mr. Hadi Salimi Advanced Topics in Information Systems Mazandaran University of Science and Technology February 4, 2011 Survey on Cloud Computing.
Publishing qualitative studies H Maisonneuve April 2015 Edinburgh, Scotland.
Privacy Issues and Techniques for Monitoring Applications Vibhor Rastogi RFID Security Group.
Sabine Mendes Lima Moura Issues in Research Methodology PUC – November 2014.
Leveraging Collaborative Technologies for Sharing Tacit Knowledge: An Integrative Model (Research in progress) Vikas Sahasrabudhe
Page 1 Secure Communication Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
Chapter 3 Researching the Social World Copyright 2012, SAGE Publications, Inc.
Implementing and Auditing Ethics Programs
Chapter 11: Qualitative and Mixed-Method Research Design
Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.
Environmental Science
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Assumes that events are governed by some lawful order
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
Name Position Organisation Date. What is data integration? Dataset A Dataset B Integrated dataset Education data + EMPLOYMENT data = understanding education.
Privacy Framework for RDF Data Mining Master’s Thesis Project Proposal By: Yotam Aron.
Supporting Researchers and Institutions in Exploiting Administrative Databases for Statistical Purposes: Istat’s Strategy G. D’Angiolini, P. De Salvo,
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Nursing research Is a systematic inquiry into a subject that uses various approach quantitative and qualitative methods) to answer questions and solve.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Introducing the Administrative Data Research Network Tanvi Desai.
COSC 513 Operating Systems Project Presentation: Internet Security Instructor: Dr. Anvari Student: Ying Zhou Spring 2003.
Sociological Research Methods. The Research Process Sociologists answer questions about society through empirical research (observation and experiments)
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
BY: CHRIS GROVES Privacy in the Voting Booth. Reason for Privacy Voters worry that their vote may be held against them in the future  People shouldn’t.
Copyright 2010, The World Bank Group. All Rights Reserved. Recommended Tabulations and Dissemination Section B.
Insights and Inference Opportunities and challenges with administrative data and non-probability sources (including organic data)
Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.
Introduction to research
HIPAA Compliance Case Study: Establishing and Implementing a Program to Audit HIPAA Compliance Drew Hunt Network Security Analyst Valley Medical Center.
Definition: a statement that is put forward as the basis of something to be proved What is a Thesis Statement?
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Julia Lane, and many many coauthors. BIG DATA DEFINITION “Big Data” is an imprecise description of a rich and complicated set of characteristics, practices,
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Introduction Ms. Binns.  Distinguish between qualitative and quantitative data  Explain strengths and limitations of a qualitative approach to research.
Managing Trust Professor Richard Walton CB. Exam Question The importance of Trust in Data Protection (This essay should discuss the relationship between.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
TAG Presentation 18th May 2004 Paul Butler
How to Analyze a Photograph How to Analyze a Political Cartoon
Web Applications Security Cryptography 1
Researching the Social World
Differentially Private Verification of Regression Model Results
TAG Presentation 18th May 2004 Paul Butler
UK Data Service Secure Lab
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Sociological Research
Sabrina Iavarone Senior User Services Officer
Data Protection Act and Anonymisation of Research Data
BETTER AND PROPER ACCESS TO PACIFIC MICRODATA
Classification Trees for Privacy in Sample Surveys
Drew Hunt Network Security Analyst Valley Medical Center
Protecting Confidential Data
Disclosure Avoidance: An Overview
Published in: IEEE Transactions on Industrial Informatics
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

JSM, Boston, August 8, 2014 Privacy, Big Data and The Public Good: Statistical Framework Stefan Bender (IAB)

Waterconsumption in Berlin during the Final

Content

Key themes Importance of valid inference – and the role of statisticians New analytical framework: differential privacy Inadequacy of current statistical disclosure limitation approaches Possibilities for accessing big data (without harming privacy)

Extracting Information from Big Data (Kreuter/Peng) The challenges of extracting (meaningful) information from big data are similar to those of surveys. Two main concerns when it comes extracting information from data:  Measurement and  Inference.

Extracting Information from Big Data (Kreuter/Peng)  Knowledge of the data generating process is need (Total Survey Error framework).  Good starting point  Need for development  It is the difference between designed and organic data (Bob Groves) that poses challenges to the extraction of information.  Solutions and new challenges: data linkage and information integration.

Access and Linkage (Kreuter/Peng) 7 Essential to understand data quality and break-downs Challenged by... different privacy requirements  Open issues of ownership  Lack of trusted third parties However... likely leads to good data documentation  Reproducible research  Transparency

The Need for a Measure for Privacy (Dwork)  Big data mandates a mathematically rigorous theory of privacy, a theory amenable to measure – and minimize – cumulative privacy, as data are analyzed, re-analyzed, shared, and linked.  Nothing is absolute safe/secure.

Differential Privacy (Dwork)  Definition of privacy has to take into account; that we want to learn useful facts out of the data. It does not matter if you are in the data base, because the generalized result affects you: differential privacy.  Data usage should be accompanied by publication of the amount of privacy loss, that is, its privacy ‘price’.  The chosen statistics should be published using differential privacy, together with the privacy losses.

Releasing Record-level Data (Karr/Reiter) Risky for data subjects and stewards Data often from administrative sources, hence available to others. Large number of variables means everyone is a populaton unique. Facing the Future

Might typical disclosure control methods provide an answer? (Karr/Reiter) Many data stewards alter data before releasing them  Aggregate data, swap records, add noise...  Usually minor perturbations for quality reasons Typical methods not likely to be effective  Low intensity perturbations not protective  High intensity perturbations destroy quality Facing the Future

A Potential Path Forward (Karr Reiter) An integrated system including  unrestricted access to highly redacted data (synthetic data), followed with  means for approved researchers to access the confidential data via remote access solutions, glued together by  verification servers that allow users to assess the quality of their inferences with the redacted data. Facing the Future

We Have the Building Blocks (Karr/Reiter) Synthtic data  Synthetic Longitudinal Business Database.  Automated methods based on machine learning. Remote access solutions  NORC virtual data enclave.  Virtual machines and protected data networks. Verification servers  Not been built yet, but we have ideas for quality measures. Facing the Future

Data Access for Research to Big Data  Data access and combination of data sources is needed (Kreuter/Peng)  Ideal scenario: data is held be a trusted or trustworthy curator: the data remain secret, the responses are published. Cryptography helps to be close to the ideal scenario (Dwork).  Wallet Gardens (Stodden).  „The New Deal on Data“ (Greenwood et al.). 14Facing the Future 2013

My Conclusion Blend big data and survey-based/official data. Use RDC structure for access to big data or combined data. No longer hands on work with data. Discussion of many topics needed: informed consent, non- participation, inference, privacy … Main issues: data protection, access and trust.  We have to be more active in the public discussion, because big data is affecting our daily work!!!

Stefan Bender