Presentation is loading. Please wait.

Presentation is loading. Please wait.

Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ.

Similar presentations


Presentation on theme: "Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ."— Presentation transcript:

1 Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ Professor Emeritus, School of Computer Science & Information Systems, Birkbeck College University of London, UK 1

2 Outline: Clustering as empirical classification K-Means and its issues: (1) Determining K and initialization (2) Weighting variables Addressing (1): Data recovery clustering and K-Means (Mirkin 1987, 1990) One-by-one clustering: Anomalous patterns and iK-Means Other approaches Computational experiment Addressing (2): Three-stage K-Means Minkowski K-Means Computational experiment Conclusion 2

3 WHAT IS CLUSTERING; WHAT IS DATA K-MEANS CLUSTERING: Conventional K-Means; Initialization of K- Means; Intelligent K-Means; Mixed Data; Interpretation Aids WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One Clustering DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability 3

4 Referred recent work: B.G. MirkinB.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads B.G. MirkinB.G. Mirkin, Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60Choosing the number of clusters B.G. MirkinB.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering 4

5 What is clustering? Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis 5

6 Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996) Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006) 6

7 Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001) Cluster # sites Interpretation 1 View Refine 24Society, religion Israel and Iudaism Judaica collection 2 View Refine 12Middle East, War, History The state of Israel Arabs and Palestinians 3 View Refine 31Economy, Travel Israel Hotel Association Electronics in Israel 7

8 Clustering algorithms: Nearest neighbour Agglomerative clustering Divisive clustering Conceptual clustering K-means Kohonen SOM Spectral clustering …………………. 8

9 Batch K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) * * * * * * * * * @ @ @ ** * * * 9

10 K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * * * @ @ @ ** * * * 10

11 K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * * * @ @ @ ** * * * 11

12 K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * @ * * * @ * * * * ** * * * @ 12

13 K-Means criterion: Summary distance to cluster centroids Minimize * @ * * * @ * * * * ** * * * @ 13

14 Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’ Shortcomings of K-Means - Initialisation: no advice on K or initial centroids - No deep minima - No defence of irrelevant features 14

15 Initial Centroids: Correct Two cluster case 15

16 Initial Centroids: Correct Initial Final 16

17 Different Initial Centroids 17

18 Different Initial Centroids: Wrong Initial Final 18

19 (1) To address: * Number of clusters Issue: Criterion W K < W K-1 * Initial setting * Deeper minimum The two are interrelated: a good initial setting leads to a deeper minimum 19

20 Number K: conventional approach Take a range RK of K, say K=3, 4, …, 15 For each K  RK Run K-Means 100-200 times from randomly chosen initial centroids and choose the best of them W(S,c)=W K. Compare W K for all K  RK in a special way and choose the best; such as Gap statistic (2001) Jump statistic (2003) Hartigan (1975): In the ascending order of K, pick the first K at which H K = [ W K / W K+1 - 1 ]/(N-K-1)  10 20

21 (1) Addressing * Number of clusters * Initial setting with a PCA-like method in the data recovery approach 21

22 Representing a partition Cluster k : Centroid c kv (v - feature) Binary 1/0 membership z ik (i - entity) 22

23 Basic equations (same as for PCA, but score vectors z k constrained to be binary) y – data entry, z – 1/0 membership, not score c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster 23

24 Quadratic data scatter decomposition (Pythagorean) K-means: Alternating LS minimisation y – data entry, z – 1/0 membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster 24

25 Equivalent criteria (1) A. Bilinear residuals squared MIN Minimizing difference between data and cluster structure B. Distance-to-Centre Squared MIN Minimizing difference between data and cluster structure 25

26 Equivalent criteria (2) C. Within-group error squared MIN Minimizing difference between data and cluster structure D. Within-group variance Squared MIN Minimizing within-cluster variance 26

27 Equivalent criteria (3) E. Semi-averaged within distance squared MIN Minimizing dissimilarities within clusters F. Semi-averaged within similarity squared MAX Maximizing similarities within clusters 27

28 Equivalent criteria (4) G. Distant Centroids MAX Finding anomalous types H. Consensus partition MAX Maximizing correlation between sought partition and given variables 28

29 Equivalent criteria (5) I. Spectral Clusters MAX Maximizing summary Raileigh quotient over binary vectors 29

30 PCA inspired Anomalous Pattern Clustering y iv =c v z i + e iv, where z i = 1 if i  S, z i = 0 if i  S With Euclidean distance squared c S must be anomalous, that is, interesting 30

31 Initial setting with Anomalous Pattern Cluster Tom Sawyer 31

32 Anomalous Pattern Clusters: Iterate 0 Tom Sawyer 1 2 3 32

33 iK-Means: Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final 33

34 iK-Means: Defining K and Initial Setting with Iterative Anomalous Pattern Clustering Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres 34

35 Study of eight Number-of-clusters methods (joint work with Mark Chiang): Variance based: Hartigan (HK) Calinski & Harabasz (CH) Jump Statistic (JS) Structure based: Silhouette Width (SW) Consensus based: Consensus Distribution area (CD) Consensus Distribution mean (DD) Sequential extraction of APs (iK-Means): Least Square (LS) Least Moduli (LM) 35

36 Experimental results at 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size Estimated number of clusters Adjusted Rand Index Large spread Small spread Large spreadSmall spread HK CH JS SW CD DD LS LM 1-time winner 2-times winner 3-times winner Two winners counted each time 36

37 37

38 (2) Address: Weighting features according to relevance w: feature weights=scale factors 3-step K-Means: -Given s, c, find w (weights) -Given w, c, find s (clusters) -Given s,w, find c (centroids) -till convergence 38

39 Minkowski’s centers Minimize over c At  >1, d(c) is convex Gradient method 39

40 Minkowski’s metric effects The more uniform distribution of the entities over a feature, the smaller its weight Uniform distribution  w=0 The best Minkowski power  is data dependent The best  can be learnt from data in a semi- supervised manner (with clustering of all objects) Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record) 40

41 Conclusion: Data recovery K-Means-wise model of clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects Further work: Extending the approach to other data types – text, sequence, image, web page Upgrading K-Means to address the issue of interpretation of the results DecoderModelCoder Data clustering Clusters Data recovery 41

42 HEFCE survey of students’ satisfaction HEFCE method: ALL 93 of highest mark STRATA: 43 best, ranging 71.8 to 84.6 42


Download ppt "Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФ."

Similar presentations


Ads by Google