Presentation is loading. Please wait.

Presentation is loading. Please wait.

Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Similar presentations


Presentation on theme: "Journal Club Journal of Chemometrics May 2010 August 23, 2010."— Presentation transcript:

1 Journal Club Journal of Chemometrics May 2010 August 23, 2010

2 An efficient nonlinear programming strategy for PCA models with incomplete data sets Rodrigo López-Negrete de la Fuentea, Salvador García-Muñozb and Lorenz T. Biegler J. Chemometrics 2010; 24: 301–311

3 Questions addressed: – How to obtain the parameters of PCA models in the presence of incomplete data sets based in non-linear programming strategy. – How nonlinear programming approach is better suited when there are large amounts of missing values.

4 Methods PCA with full data-set: – Given: where, T: Scores(projection) P:Loading Rx:Residulas Problem 1: Solution: the solution will be given by the largest eigenvalue of the covariance matrix of X. Largest eigen values: largest variance variance of XX’ Problem 2: Solution: Has the same form as that of the solution of the maximization Problem.

5 Methods PCA with full data-set: Using SVD: Two problems have the same solution.

6 Methods PCA with full data-set: Principal Components via NIPALS algorithm:

7 Methods PCA with incomplete data-set: Taking gradient, wrt t and p

8 Methods PCA with incomplete data-set: Principal Components via modified NIPALS algorithm:

9 Methods PCA with incomplete data-set: - X is the matrix of data where the missing elements have been zeroed out. - Constraint (20b) forces the loadings to be orthonormal. - Constraint (20c) makes the score vectors orthogonal to each other. - Constraint (20d) forces the scores to have zero mean. - It is clear that if there are no missing values problem (20) will reduce to problem (4) (min problem) for the first a principal components. Let Yi,j = Xi,j + Zi,j where Xi,j are the values of the data that are equal to zero for the missing elements, and Zi,j are the imputed values that should be zero for the nonmissing elements.

10 Methods PCA with incomplete data-set: constrained problem will be solved directly, the scores and loadings obtained with the NLP will be orthogonal as needed by the PCA model, which is not true for the modified NIPALS.

11 RESULTS Numerical simulations were done by generating a data set with 1000 rows and 100 columns from a known four-dimensional latent space with added random Gaussian noise. Values were then removed to generate data sets with missing value percentages ranging from 1 to 70%.

12 RESULTS

13 Industrial Example: - Data from 76 common pharmaceutical materials were made available (Pfizer Inc.) and the data span over 10 years of testing. Due to the reasons outlined above, approximately 61% of the data were missing. - For this example, three principal components were used in all models due to: a sudden drop in the eigenvalue for the fourth component (from 13 to 2) for the NIPALS model and the very low percent of the total variance for the fourth component (1.4%) in the NLP.

14 Conclusion The NLP solutions take less time and iterations than the current state-of the-art algorithms, while still satisfying the constraints of the PCA model. The current platform allows the potential inclusion of a large number of observations that otherwise would be excluded from the model building exercise, still yielding a robust model with desirable properties. In the presence of large amounts of missing data, this method reduces the computational time (and number of iterations) required to calculate them.


Download ppt "Journal Club Journal of Chemometrics May 2010 August 23, 2010."

Similar presentations


Ads by Google