# Yee Wei Law (罗裔纬) wsnlabs.com

## Presentation on theme: "Yee Wei Law (罗裔纬) wsnlabs.com"— Presentation transcript:

Yee Wei Law (罗裔纬) wsnlabs.com
Privacy-preserving Data Mining for the Internet of Things: State of the Art Yee Wei Law (罗裔纬) wsnlabs.com

Speaker’s brief bio Ph.D. from University of Twente for research on security of wireless sensor networks (WSNs) in EU project EYES Research Fellowship on WSNs from The University of Melbourne ARC projects “Trustworthy sensor networks: theory and implementation”, “BigNet” EU FP7 projects “SENSEI”, “SmartSantander”, “IoT-i”, “SocIoTal” IBES seed projects on participatory sensing, smart grids Taught Master’s course “Sensor Systems” Professional membership: Associate of (ISC)2 (junior CISSP) Smart Grid Australia Research Working Group Current research interests: Privacy-preserving data mining Secure/resilient control Applications of above to the IoT and smart grid Current research orientation: Mixed basic/applied research in data science or network science Research involving probabilistic/statistical, combinatorial, matrix analysis

Agenda The IoT and its research priorities
Participatory sensing (PS) Collaborative learning (CL) Introduction to privacy-preserving data mining Schemes suitable for PS and CL Research opportunities challenges If time permits, SOCIOTAL

A dynamic global network infrastructure with self-configuring capabilities based on standard and interoperable communication protocols where physical and virtual “things” have identities, physical attributes, and virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network. H. Sundmaeker et al., “Vision and Challenges for Realising the Internet of Things,” Cluster of European Research Projects on the Internet of Things, Mar H. Sundmaeker et al., “Vision and Challenges for Realising the Internet of Things,” Cluster of European Research Projects on the Internet of Things, Mar

Evidence of the Internet of Things
Nissan EPORO robot cars Smart grid

Research priorities ITU-T: “Through the exploitation of identification, data capture, processing and communication capabilities, the IoT makes full use of things to offer services to all kinds of applications, whilst maintaining the required privacy.” Smart transport Smart grid Smart water Smart whatever Among research priorities: Mathematical models and algorithms for inventory management, production scheduling, and data mining Privacy aware data processing

Shifting priorities Machine-to-machine communications
ARPAnet Machine-to-machine communications We have enough tech to hook things up, now we should make find better ways of capturing and analyzing data. Introducing participatory sensing and collaborative learning... Some graphics from Sabina Jeschke

Participatory sensing
A process whereby individuals and communities use evermore-capable mobile phones and cloud services to collect and analyze systematic data for use in discovery. Citizen-provided data can improve governance with benefits including: Increased public safety Increased social inclusion and awareness Increased resource efficiency for sustainable communities Increased public accountability A process whereby individuals and communities use evermore-capable mobile phones and cloud services to collect and analyze systematic data for use in discovery. Source: Estrin et al.

Data sharing scenarios
Data publishing: In this scenario, a data curator publishes data records (not aggregate statistics or any data mining results) about individuals. Statistical disclosure: In this scenario, a data curator releases aggregate statistics (e.g., sample mean and count) about a group of individuals represented in the database to a third party. Computation outsourcing: In this scenario, a data curator, for lack of local resources, outsources its data mining operations to a third-party data miner with more resources. If the data miner is not allowed to learn both data and data mining result, and the data curator must be able to verify the result, the problem is called verifiable computing [16]. If the data miner is not allowed to learn both data and data mining result, but the result needs not be verified, the problem is called secure outsourced computation [17] or private single-client computing [18]. Distributed data mining: In this scenario, multiple data curators collaboratively analyze the union of their data without the involvement of a third party. Lindell and Pinkas [2000]: “privacy-preserving data mining” refers to privacy-preserving distributed data mining

Data sharing scenarios (cont’d)
Collaborative learning: Multiple data owners collaboratively analyze the union of their data with the involvement of a third-party data miner. Agrawal and Srikant [2000] coined the term “privacy-preserving data mining” to refer to privacy-preserving collaborative learning. Encrypting data to data miner is inadequate, data should be masked, at a balanced point between accuracy and privacy. Collaborative learning: In this scenario, multiple data owners collaboratively analyze the union of their data with the involvement of a third-party data miner. Collaborative learning: Process where multiple participants contribute individually collected training samples so as to collaboratively construct statistical models for tasks in pattern recognition.

Privacy-preserving collaborative learning
Adversarial models Privacy criterion Requirement imposed by participatory sensing: online data submission, offline data processing Design space: Data type: continuous or categorical voice, images, videos, etc. Data structure: relational or time series for relational data: horizontal or vertical partitioned Data mining operation Semantic Syntactic SMC Randomization $n[k]=g(\theta,u)[k]$ Proposed criterion Differential privacy Linear Nonlinear Additive Multiplicative

Semi-honest (honest but curious) Malicious Passive attacker tries to learn the private states of other parties, without deviating from protocol By definition, semi-honest parties do not collude Active attacker tries to learn the private states of other parties, and deviates arbitrarily from protocol Common approach: Design in the semi-honest model, enhance it for the malicious model General method: zero-knowledge proofs often not practical Semi-honest model often realistic enough

Syntactic privacy criteria Semantic privacy criteria
To prevent syntactic attacks, e.g., table linkage: Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymous table Relevant for relational data, not time series data Example: k-anonymity To minimize the difference between adversarial prior knowledge and adversarial posterior knowledge about individuals represented in the database General enough for most data types, relational or time series Example: Cryptographic privacy Differential privacy Cryptographic privacy Secure Multiparty Computation Differential privacy Randomization

Secure multiparty computation
Output x1 Receiver chooses a value Sender doesn’t know which f(x1,x2) Sender x2 n values Metaphor: Yao’s millionaire problem [1982] 1-out-of-n oblivious transfer Oblivious transfer Introduced by Rabin [1981] Killian [1988] showed oblivious transfer is sufficient for secure two-party computation Naor et al. [2001] reduce the amortized overhead of oblivious transfer to one exponentiation per a log number of oblivious transfers Homomorphic encryption can be used in the semi-honest model Garbled circuits for arbitrary functions [Beaver et al. 1990] SMC = distributed computation of a publicly known function f(x1,...,xn) by n parties with respective inputs x1 , , xn such that at the end of the computation, each party only learns its own input and the function’s output. Any polynomial-time function can be expressed as a combinatorial circuit of polynomial size. Rabin [1981] introduced oblivious transfer. Even et al. [1985] used 1-out-of-2 oblivious transfer for SMC. Killian [1988] showed oblivious transfer is sufficient for secure two-party computation. Each invocation of oblivious transfer typically requires a constant number of public-key operations (typically exponentiations). Building blocks: Oblivious transfer

Differential privacy In cryptography, semantic security: whatever is computable about the cleartext given the ciphertext is also efficiently computable without the ciphertext Useless for PPDM: A DB satisfying above has no utility Dwork [2006] proposed “differential privacy” for statistical disclosure control: add noise to query results

Differential privacy (cont’d)
Theoretical basis for answering “sum queries” Sum queries can be used for histogram, mean, covariance, correlation, SVD, PCA, k-means, decision tree, etc. Row index Row $\sum_ig(i,x_i)$ Sensitivity imposes a Lipschitz condition on f. Differential privacy Sensitivity Laplace noise Noisy sum queries

Taxonomy of attacks against randomization-based approaches
Known input/sample attack: Known input-output attack: The attacker has some input samples and all output samples but does not know which input sample corresponds to which output sample Typically begins with establishing correspondences between the input samples and the output samples The attacker has some input samples and all output samples, and knows which input sample corresponds to which output sample Proposed privacy criterion: The distance between f(X) and estimated f(X) kept above a specified threshold under known attacks The attacker has some input samples (i.e., some samples of x) and all output samples (i.e., all samples of y), but does not know which input sample corresponds to which output sample. The attacker has some input samples and all output samples, and knows which input sample corresponds to which output sample. A known input attack typically begins with establishing correspondences between the input samples and the output samples, thereby converting itself to a known input-output attack.

Randomization Spectral filtering attack by Kargupta et al. [2003]
Randomized distortion or perturbation of data Spectral filtering attack by Kargupta et al. [2003] PCA attack by Huang et al. [2005]: Estimate covariance matrix of original data Find eigenvalues and eigenvectors of covariance matrix through PCA Bayesian estimation may not have analytic form Randomization Linear Nonlinear Additive perturbation Multiplicative perturbation H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the privacy preserving properties of random data perturbation techniques,” in Third IEEE International Conference on Data Mining (ICDM 2003), 2003, pp. 99–106. Z. Huang, W. Du, and B. Chen, “Deriving private information from randomized data,” in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005, pp. 37–48. $\hat{X}=Y\hat{Q}\hat{Q}^T$ Time series data Relational data Additive perturbation: adds noise data to data iid noise susceptible to: eigenvectors of covar

Attack Compared to multiplicative perturbation, easier to recover the source data distribution fX(x) from the perturbed data distribution and noise distribution Against attacks: noise to be correlated with data and participant-specific PoolView [Ganti et al. 2008] builds a model of the data, then generate noise from the model: With a common noise model, a participant (i) can reconstruct another participant’s (j) data from the perturbed data: Papadimitriou et al. [2007] add data-dependent noise to significant frequency components only. $n[t]=g(\theta,u)[t]$ $y_j[t]=x_j[t]+n_j[t]$ $f_{Y_j}(y_j)[t]=f_{X_j}(x_j)[t] * f_{N_i}(n_i)[t]$ Solved through deconvolution Estimated with kernel density estimation

Zhang et al. [2012] Catches: The data miner has to know the participants’ parameters—system not resilient to collusion Data correlation between participants expose them to attacks (recall the PCA-based attack?) Data-dependence Participant-dependence n_i[t]&=x_i[t]\eta_i[t]+c_i[t]\\ \eta&\sim N(\mu_i,\sigma_i^2) $\therefore x_i[t]=\frac{y_i[t]-c_i[t]}{1+\eta_i[t]}$ PDF reconstructed by data miner based on PDF of y and noise

Multiplicative perturbation
Multiplies data with noise Rotation perturbation [Chen et al. 2005] Noise matrix is an orthogonal matrix with orthonormal rows and columns Giannella et al.’s [2013] attack can estimate the original data using all perturbed data and a small amount of original data Attack stage 1 Find maximally unique map β that satisfies Then we know which xi is mapped to which yi Attack stage 2 Find that maximizes Enhanced version: geometric perturbation Input x Output y Perturbation

Multiplicative projection: random projection
Projection by Gaussian random matrix Statistically orthogonal essentially a Johnson-Lindenstrauss transform Other Johnson-Lindenstrauss transforms: Perturbed vectors d dimension k dimension inter-point distances change by factor (1±ε) as long as k≥O(ε-2logn) embed n points from d-dim space in k-dim space, s.t. inter-point distance changes by factor (1±ε), where k≥O(ε-2logn) Attack against orthogonal transform adaptable for this?

Collaborative learning using multiplicative perturbation
Goal is to use a different perturbation matrix for a different participant Liu et al. [2012]: synthesized data matrix Z mean, covariance B. Liu, Y. Jiang, F. Sha, and R. Govindan, “Cloud-enabled privacy- preserving collaborative learning for mobile sensing,” in Pro- ceedings of the 10th ACM Conference on Embedded Network Sensor Systems, ser. SenSys ’12. ACM, 2012, pp. 57–70. $Z_u=R_u(Z+\epsilon_u)$ $Z_v=R_v(Z+\epsilon_v)$ $Z_u=R_uX_u$ $Z_v=R_vX_v$ Learn in approx an inverse of Ru and Rv Data miner then get an estimation of Xu and Xv ! What about the privacy criterion?

Nonlinear perturbation
Nonlinear + linear perturbation: Relies on linear perturbation to achieve projection Near-many-to-one mapping provides privacy property Many-to-one mapping extended to the “normal” part of the curve? Nonlinear function Random matrices Extreme values (potential outliers) are “squashed” =tanh(x) Normalized values

Bayesian estimation attacks against multiplication perturbation
Solve underdetermined system Y=RX for X Maximum a posteriori estimation (why?) If R is known Gaussian original data obviously simplifies the attacker’s problem If R is not known Difficult optimization problem, although Gaussian data simplifies the problem Choice of p(R) matters &\max_X p(X|Y,R) = \max_{X\in\mathcal{X}} p(X), \\ &\text{where }\mathcal{X}=\{X|Y=RX\} &\max_{R,X} p(R,X|Y) = \max_{(R,X)\in\Theta} p(R)p(X), \\ &\text{where }\Theta=\{(R,X)|Y=RX\}

Independent component analysis against multiplicative perturbation
Prerequisites for attacker independence at most one Gaussian component sparseness (Laplace) m≥(n+1)/2 Steps: estimate R estimate X resolve permutation and scaling ambiguity Perturbation matrix treated as mixing matrix Blind source separation m=n m<n m>n Whereas PCA extracts a set of uncorrelated signals (this process is called whitening or sphering) from a set of mixtures, ICA extract a set of independent signals from the mixture set. $Y=R_{m\times n}X$ $Y=RX=(R\Lambda P)(P^{-1}\Lambda^{-1}X)$ Overcomplete/underdetermined ICA Sparse representation Nonnegative matrix factorization

Research opportunities and challenges
Tools: Statistical analysis, Bayesian analysis, matrix analysis, time series analysis, optimization, signal processing Multiplicative perturbation Nonlinear perturbation Data mining algorithms Bayesian estimation attacks ICA attacks Participants’ data Perturbed data Commercial interest? Large design space: effectiveness depends as much on the nature of data as the data mining algorithms Challenging multidisciplinary problems necessitate broad range of tools: Scenario-dependent privacy criteria Defenses and attacks evolve side-by-side Role of dimensionality reduction? Steganography for “traitor tracing”? Many more from syntactic privacy, SMC, etc.

Unsupervised learning of Big Data, e.g., Deep Learning
Big Data: Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. What is Big Data? Unsupervised learning of Big Data, e.g., Deep Learning

Motivating use cases: Vision: Business-centric Internet of Things  Citizen-centric Internet of Things Main non-technical aim: Create trust and confidence in Internet of Things systems, while providing user-friendly ways to contribute to and use the system thus encouraging creation of services of high socio-economic value. Main technical aims: Reliable and secure communications Trustworthy data collection Privacy-preserving data mining Alice’s sensor network monitoring her house Alice’s friend Bob granted access to Alice’s network while Alice’s on vacation Sensor network monitoring community microgrid feeding data to stakeholders

Duration: Sep Aug 2016 Funding scheme: STREP Total Cost: €3.69 m EC Contribution: €2.81m Contract Number: CNECT-ICT

Source: Cisco IBSG, April 2011
Conclusion Looking back: 1970s gives us statistical disclosure control; 2000s gives us PPDM Technological development expands design space, invites multidisciplinary input Socio-economical development plays critical role Adversarial models Semantic Syntactic Privacy criterion SMC Randomization Proposed criterion Differential privacy Linear Nonlinear Additive Multiplicative

Syntactic privacy criteria/definitions
To prevent syntactic attacks: Table linkage: Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymous table Record linkage: Attacker has access to an anonymous table and a nonanonymous table, and the knowledge that its target is represented in both tables Attacker can uniquely identify the target’s record in the anonymous table from the target’s record in the nonanonymous table Attribute linkage: Attacker has access to an anonymous table, and the knowledge that its target is represented in the table, the attacker can infer the value(s) of its target’s sensitive attribute(s) from the group (e.g., year-old females) the target belongs to Examples: k-anonymity