Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Similar presentations


Presentation on theme: "Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo."— Presentation transcript:

1 Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo

2 Classificaton Unsupervised (cluster analysis) –Searching for groups in the data Suspicion or general exploration –Hierarchical methods, partitioning methods Supervised (discriminant analysis) –Groups determined by other information External or from a cluster analysis –Understand differences between groups –Allocate new objects to the groups Scoring, finding degree of membership

3 Group 1 Group 2 New object X ? ? What is the difference?Where?

4 Why supervised classification? Authenticity studies –Adulteration, impurities, different origin, species etc. Raw materials Consumer products according to specification When quality classes are more important than chemical values raw materials acceptable or not raw materials for different products

5 Flow chart for discriminant analysis

6 Main problems Selectivity –Multivariate methods are needed Collinearity –Data compression is needed Complex group structures –Ellipses, squares or ”bananas”?

7 X1X1 X2X2 Authentic Adulterated The selectivity problem

8 Solving the selectivity problem Using several measurements at the same time –The information is there! Multivariate methods. These methods combine several instrumental NIR variables in order to determine the property of interest Mathematical ”purification” instead of wet chemical analysis

9 Multivariate methods Too many variables can also sometimes create problems –Interpretation –Computations, time and numerical stability –Simple and difficult regions (nonlinearity) –Overfitting is easier (dependentent on method used) Sometimes important to find good compromises (variable selection)

10 Conflict between flexibility and stability Estimation error Model error

11 Some main classes of methods Classical Bayes classification –LDA, QDA Variants, modifications used to solve the collinearity problem –RDA, DASCO, SIMCA Classification based on regression analysis –DPLS, DPCR KNN methods, flexible with respect to shape of the groups

12 Bayes classification Assume prior probabilities p j for the groups –If unknown, fix them to be p j = 1/C or –equal to the proportions in the dataset Assume known probability model within each class (f j (x)) –Estimated from the data, usually covariance matrices and means

13 Bayes classification + well understood, much used, often good properties, easy to validate easy to modify for collinear data Easy to updated, covariances Can be modified for cost Outlier diagnostics (not directly, but can be done, M-distance) - Can not handle too complex group structures, designed for elliptic structures not so easy to interpret directly often followed by a Fisher’s linear discriminant analysis. Directly related to interpreting differences between groups

14 Bayes rule Maximise porterior probability Normal data, minimise Estimate model parameters, Mahalanobis distance plus determinant minus prior probability

15 Different covariance structures

16 Mahalanobis distance is constant on ellipsoids

17 Best known members Equal covariance matrix for each group –LDA Unequal covariance matrices –QDA Collinear data, unstable inverted covariance matrix (see equation) –Use principal components (or PLS components) –RDA, DASCO estimate stable inverse covariance matrices

18 Classification by regression 0,1 dummy variables for each group Run PLS-2 (or PCR) or any other method which solves the collinearity Predict class membership. –The class with the highest value gets the vote All regular interpretation tools are available, variable selection, plotting outliers diagnostics etc. Linear borders between subgroups, not too complicated groups. Related to LDA, not covered here If large data sets, we can use more flexible methods

19 Example, classification of mayonnaise based on different oils The oils were soybean sunflower canola olive corn grapeseed Indahl et al (1999). Chemolab 16 samples in each group, Feasibility study, authenticity

20 Classification properties of QDA, LDA and regression Start out low

21 Comparison LDA and QDA gave almost identical results It was substantially better to use LDA/QDA based on PLS/PCA components instead of using PLS directly

22 Fisher’s linear discriminant analysis Closely related to LDA Focuses on interpretation –Use “spectral loadings” or group averages Finds the directions in space which distinguish the most between groups –Uncorrelated Sensitive to overfitting, use PC’s first

23 Fisher’s method. Næs, Isaksson, Fearn and Davies (2001). A user friendly guide to cal. and class.

24 Not possible to distinguish the groups from each other Plot of PC1 vs PC2

25 Mayonnaise data, clear separation Canonical variates based on PC’s

26 PCAFisher’s method Forina et al(1986), Vitis Italian wines from same region, but based on different cultivars, 27 chromatic and chemical variables Barolo Grignolino Barbera

27 Error rates Validated properly LDA –Barolo 100%, Grignolino 97.7%, Barbera 100% QDA –Barolo 100%, Grignolino 100%, Barbera100%

28 KNN methods No model assumptions Therefore: needs data from “everywhere” and many data points Flexible, complex data structures Sensitive to overfitting, use PC’s

29 New sample KNN, finds the N samples which are closest In this case 3 samples

30 Cluster analysis Unsupervised classification Identifying groups in the data –Explorative

31 Examples of use Forina et al(1982). Olive oil from different regions (fatty acid composition). Ann. Chim. Armanino et al(1989), Olive oils from different Tuscan provinces (acids, sterols, alcohols). Chemolab.

32 Methods PCA (informal/graphical) –Look for structures in scores plots –Interpretation of subgroups using loadings plots Hierarchical methods (more formal) –Based on distances between objects (Euclidean or Mahalanobis) –Join the two most similar –Interpret dendrograms

33 Armanino et al(1989), Chem.Int. lab. Systems. 120 olive oils from one region in Italy, 29 variables (fatty acids, sterols, etc.)


Download ppt "Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo."

Similar presentations


Ads by Google