Presentation is loading. Please wait.

Presentation is loading. Please wait.

CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.

Similar presentations


Presentation on theme: "CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev."— Presentation transcript:

1 CLASSIFICATION

2 Periodic Table of Elements

3 1789 Lavosier 1869 Mendelev

4 Measures of similarity i) distance ii) angular (correlation)

5 x k T d kl = || x T k -x T l || x l T Variable space Var 1 Var 2 Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, d kl angular

6 Measuring similarity Distance i) Euclidean ii) Minkowski (“Manhatten”, “taxis”) iii) Mahalanobis (correlated variables)

7 X1X1 X2X2 p1p1 p2p2 Euclidean Euclidean: Manhattan: Distance

8 Classification using distance: Nearest neighbor(s) define the membership of an object. KNN (K nearest neighbors) K = 1 K = 3

9 Classification X1X1 X2X2 X 1 and X 2 is uncorrelated, cov(X 1, X 2 ) = 0 for both subsets (classes) => can use KNN to measure similarity

10 Classification X1X1 X2X2 PC1 PC2 Class 3 Class 4 Class 1 Class 2 Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation. For class 3 and class 4, PC analysis provides excellent separation on PC2.

11 Classification X1X1 X2X2 X 1 and X 2 is correlated, cov(X 1, X 2 )  0 for both “classes” (high X 1 => high X 2 ) KNN fails, but PC analysis provides the correct classification

12 Classification Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances. Drawback: No separation of noise from information Cure: Use scores from major PCs

13 VARIABLE CORRELATION AND SIMILARITY BETWEEN OBJECTS

14 CORRELATION&SIMILARITY Variable space Var 1 Var 2

15 CORRELATION&SIMILARITY Variable space Var 1 Var 2 PC class 2 PC class 1 SUPERVISED COMPARISON (SIMCA)

16 CORRELATION-SIMILARITY Variable space Var 1 Var 2 PC1 PC2 UNSUPERVISED COMPARISON (PCA)

17 CORRELATION&SIMILARITY eTkeTk xcTxcT xTkxTk Var 2 Var 1 Variable Space

18 CORRELATION&SIMILARITY Unsupervised: PCA - score plot Fuzzy clustering Supervised: SIMCA

19 CORRELATION-SIMILARITY 0 10 20 30 KM Characterisation and Correlation of crude oils…. Kvalheim et al. (1985) Anal. Chem.

20 CORRELATION&SIMILARITY Sample 1 Sample 2 Sample N

21 CORRELATION&SIMILARITY SCORE PLOT 8 8 1 1 3 3 4 4 11 9 6 10 6 9 13 2 2 7 14 5 t1t1 t2t2 PC1 PC2

22 Soft Independent Modelling of Class Analogies (SIMCA)

23 SIMCA Data (Variance) = Model (Covar. pattern) Residuals (Unique variance, noise) + Angular correlation Distance

24 SIMCA Objects 1 2 3. N N+1 N+N’ 1 2 3 4 ………… …………...M Data matrix Variables X ki Class 1 Class 2 Unassigned objects Class Q Training set (Reference set) Test set Class - group of similar objects Object - sample, individual Variable - feature, characteristics, attribute

25 SIMCA Chromatogram 1 2 3. N N+1 N+N’ 1 2 3 4 ………… …………...M Data matrix Peak area X ki Oil field 1 Oil field 2 New samples Oil field Q Training set (Reference set Test set

26 PC MODELS 2* 3* 1* 2 3 1 x 2’ 3’ 1’ 2 3 1 x p1p1 x ki = x i + e ki x’ k = x’ + e’ k x ki = x i + t k p’ i + e ki x’ k = x’ + t k p’ + e’ k

27 PC MODELS 2’ 3’ 1’ 2 3 1 X p1p1 p2p2 x ki = x i +  t k p’ i + e ki x’ k = x’ + t k1 p’ 1 + t k2 p’ 2 + e’ k

28 PRINCIPAL COMPONENT CLASS MODEL X C = X C + T C P` C + E C information (structure) noise k = 1,2,…,N (object,sample) i = 1,2,…,N (variable) a = 1,2,….,A (principal component c = 1,2,----,C (class)

29 PC MODELS 1234567812345678 1234567812345678 1234567812345678 1 2 3 4 5 Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold

30 CROSS VALIDATING PC MODELS i) Calculate scores and loadings for PC a+1 ; t a+1 and p` a+1, excluding elements in one group ii) Predict values for elements e ki, a+1 = t k,a+1 p` a+1,i iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elements v) Compare with Adjust for degrees of freedom

31 1-component PC model S max p= 0.05 S max p= 0.01 PC 1

32 Residual Standard Deviation (RSD) S max PC 1 S0S0 Mean RSD of class: RSD of object:

33 t upper t lower PC 1 s max t max t min 1/2s t

34 CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)

35 CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines For a = 1,2,...,A Calculate the residuals to the object: ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) k  class q > F critical => k  class q

36 t upper t lower PC 1 s max t max t min 1/2s t Objects outside the model

37 PC 1 Detection of atypical objects RSD max slsl t max + 1/2s t t lower t max t min t min - 1/2s t tltl sksk Object k: S k > RSD max => k is outside the class Object l: t l is outside the “normal area”, {t min -1/2s t, t max +1/2s t } => Calculate the distance to the extreme point, that is, s l > RSD max => l is outside the class

38 Detection of outliers 1. Score plots 2. DIXON-TESTS on each LATENT VARIABLE, 3. Normal plots of scores for each LATENT VARIABLE 4. Test of residuals, F-test (class model)

39 MODELLEING POWER DISCRIMINATION POWER

40 MODELLEING POWER The variables contribution to the class model q (intra- class variation) MP i q = 1 - S q i,A / S q i,0 MP i = 1.0 => the variable i is completely explained by the class model MP i = 0.0 => the variable i does NOT contribute to the class model

41 DISCRIMINATION POWER The variables ability to separate two class models (inter-class variation) DP r,q i = 1.0 => no discrimination power DP r,q i > 3-4 => “Good” discrimination power

42 s l (q) l k Class q Class r s k (q) s k (r) s l (r) SEPARATION BETWEEN CLASSES Worst ratio:,l  r Class distance: => “good separation”

43 POLISHED CLASSES 1) Remove “outliers” 2) Remove variables with both low MP <  0.3-0.4 and low DP <  2-3

44 How does SIMCA separate from other multivariate methods? i) Models systematic intra-class variation (angular correlation) ii) Assuming normally distributed population, the residuals can be used to decide class belonging (F-test)! iii) “Closed” models iv) Considers correlation, important for large data sets v) SIMCA separates noise from systematic (predictive) variation in each class

45 Latent Data Analysis (LDA) Separating surface New classes ? Outliers Asymmetric case? Looking for dissimilarities

46 MISSING DATA x1x1 x2x2 f 2 (x 1,x 2 ) f 1 (x 1,x 2 ) ? ? ? ?

47 WHEN DOES SIMCA WORK? 1. Similarity between objects in the same class, homogenous data. 2. Some relevant variables for the problem in question (MP, DP) 3. At least 5 objects, 3 variables.

48 ALGORITHM FOR SIMCA MODELLING Read Raw-data Pretreatment of data Select Subset/Class Evaluation of subsets Cross validated PC-model Variable Weighting Outliers? More Classes? Remodel? Yes “Polished” subsets Standardise Eliminate variables with low modelling and discriminated power Square Root, Normalise and more Fit new objects


Download ppt "CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev."

Similar presentations


Ads by Google