Presentation is loading. Please wait.

Presentation is loading. Please wait.

PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton

Similar presentations


Presentation on theme: "PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton"— Presentation transcript:

1 PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton r.g.brereton@bris.ac.uk

2 CLUSTER ANALYSIS - UNSUPERVISED PATTERN RECOGNITION Grouping of objects according to similarity. No predefined classes

3 TAXONOMY

4 CHEMICAL TAXONOMY Grouping organisms according to similarity from chemical fingerprints DNA base pairs, proteins NMR and pyrolysis of extracts NIR spectra

5 SIMILAR PRINCIPLES IN ALL TYPES OF CHEMISTRY Chemical archaeology Environmental samples Food

6 STEPS IN CLUSTER ANALYSIS Similarity measures. Calculate similarity between objects. Example

7 Correlation coefficient : higher, more similar Euclidean distance : smaller, more similar Euclidean distance

8 Manhattan distance Manhattan distance : smaller, more similar

9 Use correlations for illustration. Group samples. 1. Find most similar, highest correlation. Objects 2 and 5. 2. Combine them. 3. Work out new correlation of the new object 2&5 with the other objects (1,3,4,6).

10 Linkage methods – determination of new similarity measures of groups. Several methods. Nearest neighbour uses the highest correlation Furthest neighbour uses the lowest correlation Average linkage uses an average. Illustrate with nearest neighbour.

11

12 Dendrograms

13 CLUSTER ANALYSIS : SUMMARY Similarity measures Linkage methods Dendrogram

14 CLASSIFICATION Many methods. CONVENTIONAL LDA (Linear discriminant analysis) Original statistics : projections

15 Examples Orange juices, can we class into origins and can we detect adulteration from NIR spectra? Class modelling of mussels, can we find which come from polluted site from GC? Detailed mathematical model

16 PRINCIPLES : BIVARIATE EXAMPLE

17 Often exact cut-off impossible

18 Class distance plots

19 Multivariate data : several measurements per class Example – Fisher Iris data – four measurements per iris Petal width, petal length, sepal width, sepal length 150 Irises, divided into 50 of each species I. Setosa I. Versicolor I. Verginica

20 SPECIAL DISTANCES USED. Linear discriminant function between classes A and B The first term is simply the difference between the centres of each class – so a more positive value indicates class A. The middle term is the inverse of the “pooled variance covariance matrix. What does this mean? Sometimes measurements are correlated. Sometimes classes are more dispersed. Puts distances on common scale. The final term is the measurement for each object.

21

22 Can shift the scale so that positive score probably class A, negative score probably class B. Note some ambiguities. W AB.

23 Extending to more than 2 classes Three classes – 2 out of 3 possible discriminant parameters If we have 3 classes and choose to use W AB and W AC as the functions, it is easy to see that an object belongs to class A if W AB and W AC are both positive, an object belongs to class B if W AB is negative and W AC is greater than W AB, and an object belongs to class C if W AC is negative and W AB is greater than W AC.

24

25 Mahalanobis distance Similar idea to the Euclidean distance, i.e. distance to the centre of a class but use the variance covariance matrix for scaling.

26

27

28

29

30 Many classical statistical methods developed first in biology. Problem for chemists: Mahalanobis distance depends on measurements being more than variables Spectroscopy, chromatography : often a huge number of measurements per sample.

31 Solutions Variable selection PCA prior to performing classification

32 Many diagnostics Modelling power of variables Discriminatory power of variables Quality of class model Probabilities of class membership Ambiguous classification : is analytical data good enough?

33 MANY SOPHISTICATIONS Large number of methods for classification based on LDA. Bayesian methods – based on prior probabilities. Methods that try to find optimal groupings before class modelling.

34 LOTS OF INFORMATION Class membership Outliers Whether another new class Is a class well defined or are there subclasses e.g. subspecies or species from different environments What measurements are most useful for discrimination. Can we reduce the number of measurements? Are there ambiguous samples, and if so do we need more or better measurements? Replicates analysis. Is our method sufficiently good for repeatability. Clinical diagnostics.

35 SIMCA sometimes used in chemometrics as an alternative Soft Independent Modelling of Class analogy

36 Use PCA models

37 Use PCA to model each class independently Choose optimal number of PCs Use distance from PC model as an indicator of class distance

38 VALIDATION OF A CLASS MODEL Procedure. Establish a training set. Assess model with a test set. Use model on real data. Information Graphical - e.g. diagrams Quantitative - class distances Quantitative - probability of membership of a given class.

39 Training set Test set

40 SUMMARY Cluster analysis – unsupervised pattern recognition Similarity measures Linkage Dendrograms Classification – supervised pattern recognition Class models Class distances Graphical methods


Download ppt "PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton"

Similar presentations


Ads by Google