Download presentation

Presentation is loading. Please wait.

Published byBryson Hinchcliffe Modified over 2 years ago

1
Privacy-preserving Data Mining for the Internet of Things: State of the Art Yee Wei Law ( ) wsnlabs.com

2
Speakers brief bio Ph.D. from University of Twente for research on security of wireless sensor networks (WSNs) in EU project EYES Research Fellowship on WSNs from The University of Melbourne – ARC projects Trustworthy sensor networks: theory and implementation, BigNet – EU FP7 projects SENSEI, SmartSantander, IoT-i, SocIoTal – IBES seed projects on participatory sensing, smart grids – Taught Masters course Sensor Systems Professional membership: – Associate of (ISC) 2 (junior CISSP) – Smart Grid Australia Research Working Group Current research interests: Privacy-preserving data mining Secure/resilient control Applications of above to the IoT and smart grid Current research orientation: Mixed basic/applied research in data science or network science Research involving probabilistic/statistical, combinatorial, matrix analysis

3
Agenda The IoT and its research priorities – Participatory sensing (PS) – Collaborative learning (CL) Introduction to privacy-preserving data mining Schemes suitable for PS and CL Research opportunities challenges If time permits, SOCIOTAL

4
A dynamic global network infrastructure with self- configuring capabilities based on standard and interoperable communication protocols where physical and virtual things have identities, physical attributes, and virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network. H. Sundmaeker et al., Vision and Challenges for Realising the Internet of Things, Cluster of European Research Projects on the Internet of Things, Mar

5
Evidence of the Internet of Things Nissan EPORO robot carsSmart grid

6
Research priorities ITU-T: Through the exploitation of identification, data capture, processing and communication capabilities, the IoT makes full use of things to offer services to all kinds of applications, whilst maintaining the required privacy. Among research priorities: Mathematical models and algorithms for inventory management, production scheduling, and data mining Privacy aware data processing Smart transport Smart grid Smart water Smart whatever

7
ARPAnet Machine-to-machine communications Some graphics from Sabina Jeschke We have enough tech to hook things up, now we should make find better ways of capturing and analyzing data. Introducing participatory sensing and collaborative learning... Shifting priorities

8
Participatory sensing A process whereby individuals and communities use evermore-capable mobile phones and cloud services to collect and analyze systematic data for use in discovery. Source: Estrin et al. Citizen-provided data can improve governance with benefits including: Increased public safety Increased social inclusion and awareness Increased resource efficiency for sustainable communities Increased public accountability

9
Data sharing scenarios Lindell and Pinkas [2000]: privacy-preserving data mining refers to privacy-preserving distributed data mining

10
Data sharing scenarios (contd) Collaborative learning: Multiple data owners collaboratively analyze the union of their data with the involvement of a third- party data miner. Agrawal and Srikant [2000] coined the term privacy-preserving data mining to refer to privacy-preserving collaborative learning. Encrypting data to data miner is inadequate, data should be masked, at a balanced point between accuracy and privacy.

11
Privacy-preserving collaborative learning Requirement imposed by participatory sensing: – online data submission, offline data processing Design space: – Data type: continuous or categorical voice, images, videos, etc. – Data structure: relational or time series for relational data: horizontal or vertical partitioned – Data mining operation Adversarial models Semantic Syntactic Privacy criterion SMC Randomization Proposed criterion Differential privacy Linear Nonlinear Additive Multiplicative

12
Adversarial models Semi-honest (honest but curious) Passive attacker tries to learn the private states of other parties, without deviating from protocol By definition, semi-honest parties do not collude Malicious Active attacker tries to learn the private states of other parties, and deviates arbitrarily from protocol Common approach: Design in the semi-honest model, enhance it for the malicious model General method: zero-knowledge proofs often not practical Semi-honest model often realistic enough Common approach: Design in the semi-honest model, enhance it for the malicious model General method: zero-knowledge proofs often not practical Semi-honest model often realistic enough

13
Syntactic privacy criteria To prevent syntactic attacks, e.g., table linkage: – Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table – Attacker can infer the presence of its targets record in the anonymous table from the targets record in the nonanonymous table Relevant for relational data, not time series data Example: – k -anonymity Semantic privacy criteria To minimize the difference between adversarial prior knowledge and adversarial posterior knowledge about individuals represented in the database General enough for most data types, relational or time series Example: – Cryptographic privacy – Differential privacy Cryptographic privacy Differential privacy Secure Multiparty Computation Randomization

14
Secure multiparty computation Oblivious transfer Introduced by Rabin [1981] Killian [1988] showed oblivious transfer is sufficient for secure two- party computation Naor et al. [2001] reduce the amortized overhead of oblivious transfer to one exponentiation per a log number of oblivious transfers Homomorphic encryption can be used in the semi-honest model f(x1,x2)f(x1,x2) f(x1,x2)f(x1,x2) x1x1 x2x2 Output Garbled circuits for arbitrary functions [Beaver et al. 1990] Metaphor: Yaos millionaire problem [1982] Building blocks: Oblivious transfer Building blocks: Oblivious transfer Sender Receiver chooses a value Sender doesnt know which n values 1-out-of- n oblivious transfer

15
Differential privacy In cryptography, semantic security: whatever is computable about the cleartext given the ciphertext is also efficiently computable without the ciphertext Useless for PPDM: A DB satisfying above has no utility Dwork [2006] proposed differential privacy for statistical disclosure control: add noise to query results

16
Differential privacy (contd) Theoretical basis for answering sum queries Sum queries can be used for histogram, mean, covariance, correlation, SVD, PCA, k-means, decision tree, etc. Row indexRow Differential privacy Sensitivity Laplace noise Noisy sum queries

17
Taxonomy of attacks against randomization-based approaches Known input/sample attack: The attacker has some input samples and all output samples but does not know which input sample corresponds to which output sample Typically begins with establishing correspondences between the input samples and the output samples Known input-output attack: The attacker has some input samples and all output samples, and knows which input sample corresponds to which output sample Proposed privacy criterion: The distance between f(X) and estimated f(X) kept above a specified threshold under known attacks

18
Randomization Additive perturbation: adds noise data to data iid noise susceptible to: Spectral filtering attack by Kargupta et al. [2003] PCA attack by Huang et al. [2005]: – Estimate covariance matrix of original data – Find eigenvalues and eigenvectors of covariance matrix through PCA – Bayesian estimation may not have analytic form Randomization Linear Nonlinear Additive perturbation Multiplicative perturbation Randomized distortion or perturbation of data Time series data Relational data eigenvectors of covar

19
Collaborative learning using additive perturbation Compared to multiplicative perturbation, easier to recover the source data distribution f X (x) from the perturbed data distribution and noise distribution Against attacks: noise to be correlated with data and participant-specific PoolView [Ganti et al. 2008] builds a model of the data, then generate noise from the model: With a common noise model, a participant ( i ) can reconstruct another participants ( j ) data from the perturbed data: Estimated with kernel density estimation Solved through deconvolution Attac k

20
Collaborative learning using additive perturbation Zhang et al. [2012] Data-dependence Participant-dependence Catches: – The data miner has to know the participants parameterssystem not resilient to collusion – Data correlation between participants expose them to attacks (recall the PCA-based attack?) PDF reconstructed by data miner based on PDF of y and noise

21
Multiplicative perturbation Rotation perturbation [Chen et al. 2005] Noise matrix is an orthogonal matrix with orthonormal rows and columns Giannella et al.s [2013] attack can estimate the original data using all perturbed data and a small amount of original data Attack stage 1 Find maximally unique map β that satisfies Then we know which x i is mapped to which y i Attack stage 2 Find that maximizes Enhanced version: geometric perturbation Multiplies data with noise Input x Output y Perturbation

22
Multiplicative projection: random projection Projection by Gaussian random matrix – Statistically orthogonal – essentially a Johnson- Lindenstrauss transform Other Johnson- Lindenstrauss transforms: Attack against orthogonal transform adaptable for this? Perturbed vectors d dimensionk dimension inter-point distances change by factor (1±ε) as long as kO(ε -2 logn)

23
Collaborative learning using multiplicative perturbation Goal is to use a different perturbation matrix for a different participant Liu et al. [2012]: mean, covariance synthesized data matrix Z Learn in approx an inverse of R u and R v Data miner then get an estimation of X u and X v ! What about the privacy criterion?

24
Nonlinear perturbation Relies on linear perturbation to achieve projection Near-many-to-one mapping provides privacy property Many-to-one mapping extended to the normal part of the curve? Random matrices Nonlinear function Nonlinear + linear perturbation: Normalized values Extreme values (potential outliers) are squashed =tanh(x)

25
Bayesian estimation attacks against multiplication perturbation Solve underdetermined system Y=RX for X Maximum a posteriori estimation (why?) If R is known Gaussian original data obviously simplifies the attackers problem If R is not known Difficult optimization problem, although Gaussian data simplifies the problem Choice of p(R) matters

26
Independent component analysis against multiplicative perturbation Prerequisites for attacker – independence – at most one Gaussian component – sparseness (Laplace) – m(n+1)/2 Steps: – estimate R – estimate X – resolve permutation and scaling ambiguity Perturbation matrix treated as mixing matrix Blind source separation m

27
Research opportunities and challenges Commercial interest? Large design space: effectiveness depends as much on the nature of data as the data mining algorithms Challenging multidisciplinary problems necessitate broad range of tools: – Scenario-dependent privacy criteria – Defenses and attacks evolve side-by-side – Role of dimensionality reduction? – Steganography for traitor tracing? – Many more from syntactic privacy, SMC, etc. Multiplicative perturbation Nonlinear perturbation Participants data Bayesian estimation attacks ICA attacks Tools: Statistical analysis, Bayesian analysis, matrix analysis, time series analysis, optimization, signal processing Data mining algorithms Perturbed data

28
What is Big Data? Unsupervised learning of Big Data, e.g., Deep Learning

29
Vision: Business-centric Internet of Things Citizen- centric Internet of Things Main non-technical aim: Create trust and confidence in Internet of Things systems, while providing user-friendly ways to contribute to and use the system thus encouraging creation of services of high socio-economic value. Main technical aims: – Reliable and secure communications – Trustworthy data collection – Privacy-preserving data mining Motivating use cases: Alices sensor network monitoring her house Alices friend Bob granted access to Alices network while Alices on vacation Sensor network monitoring community microgrid feeding data to stakeholders

30
Duration: Sep Aug 2016 Funding scheme: STREP Total Cost: 3.69 m EC Contribution: 2.81m Contract Number: CNECT-ICT

31
Conclusion Looking back: 1970s gives us statistical disclosure control; 2000s gives us PPDM Technological development expands design space, invites multidisciplinary input Socio-economical development plays critical role Adversarial models Semantic Syntactic Privacy criterion SMC Randomization Proposed criterion Differential privacy Linear Nonlinear Additive Multiplicative Source: Cisco IBSG, April 2011

32

33
Syntactic privacy criteria/definitions To prevent syntactic attacks: Table linkage: – Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table – Attacker can infer the presence of its targets record in the anonymous table from the targets record in the nonanonymous table Record linkage: – Attacker has access to an anonymous table and a nonanonymous table, and the knowledge that its target is represented in both tables – Attacker can uniquely identify the targets record in the anonymous table from the targets record in the nonanonymous table Attribute linkage: – Attacker has access to an anonymous table, and the knowledge that its target is represented in the table, the attacker can infer the value(s) of its targets sensitive attribute(s) from the group (e.g., year-old females) the target belongs to Examples: k -anonymity

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google