Different distributions, but the same moments and estimates of regression coefficients. How about making D3 have the same mean and covariance?
1. Linear Transformation Let D1 be the original p-dimensional data with mean, E1 and covariance matrix S1. Let D2 be the post-microaggregated p- dimensional data with mean, E2 and covariance matrix, S2. Transform D2 into T(D2) such that E[T(D2)]=E1 & S[T(D2)]=S1.
How to compute A and b? Mathematically, A and b are obtained as. Use SVD decomposition to calculate
NOTES Linear transformed masked data yields the same analysis based on mean and covariance. How about higher moments? There is no clear answer, but higher moments rely on distributions other than A, b, mean and covariance. We need data utility measures. Linear transformation does not preserve positivity. Can we improve data utility of other SDLs through linear transformation?
Linear transformation with constraint of positivity. Partition X into Transform X2 but not X1. Replace final negative values with minimum of original data or zero after transforming X2. It is the middle of non-transformed microaggregated and transformed microaggregated data. The utility of this method depends on how many negative values are in transformed microaggregated data.
How to partition X? The way of partitioning X: 1. Initially, transform X in Y=AX+b. 2. Sort Y according to descending order. 3. Count how many records are negative, n`. 4. Partition Y into Y1 and Y2, where Y1 has 1 st to (n`+ n*p)-th observations of Y and Y2 contains the rest of them. 5. Partition X in X1 and X2 corresponding to Y1 and Y2. More observations are added to Y1 in order to reduce the possibility of getting negative values after transforming X2.
Example Here are eight different types of data. For most of data violating signs, the procedure above improves utilities. Since it is the middle of non-transformed and transformed microaggregated data, it does not always improve three data utilities comparing to transformed microaggregated data. Improvement of Non-symmetric Low Positive is the largest, that of Non-symmetric High Positive is the next, and the last one is Non- symmetric Low Negative.