Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University.

Similar presentations


Presentation on theme: "Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University."— Presentation transcript:

1 Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University of London UK Supported by: - “Teacher-Student” grants from the Research Fund of NRU HSE Moscow (2011-2013) -International Lab for Decision Analysis and Choice NRU HSE Moscow (2008 – pres.) -Laboratory of Algorithms and Technologies for Networks Analysis NRU HSE Nizhniy Novgorod Russia (2010 – pres.)

2 Data clustering: Topics of Current Interest 1.K-Means clustering and two issues 1.Finding right number of clusters 1.Before clustering (anomalous) 2.While clustering (divisive no minima of density function) 2.Weighting features (3-step iterations) 2.K-Means at similarity clustering (kernel k-means) 3.Semi-average similarity clustering 4.Consensus clustering 5.Spectral clustering, Threshold clustering and Modularity clustering 6.Laplacian pseudo-inverse transformation 7.Conclusion

3 Batch K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) * * * * * * * * * @ @ @ ** * * * 3

4 K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * * * @ @ @ ** * * * 4

5 K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence * * * * * * * * * @ @ @ ** * * * 5

6 K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * @ * * * @ * * * * ** * * * @ 6

7 K-Means criterion: Summary distance to cluster centroids Minimize * @ * * * @ * * * * ** * * * @ 7

8 Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’ Shortcomings of K-Means - Initialisation: no advice on K or initial centroids - No deep minima - No defence of irrelevant features 8

9 CODA Week 8 by Boris Mirkin9

10 10

11 11 Preprocess data by centering to Reference point, typically grand mean. 0 is grand mean since that. Build just one Anomalous cluster. CODA Week 8 by Boris Mirkin

12 12 Preprocess data by centering to Reference point, typically grand mean. 0 is grand mean since that. Build Anomalous cluster: 1. Initial center c is entity farthest away from 0. 2. Cluster update. if d(y i,c) < d(y i,0), assign y i to S. 3. Centroid update: Within-S mean c' if c'  c. Go to 2 with c  c'. Otherwise, halt. CODA Week 8 by Boris Mirkin

13 13 Anomalous Cluster is (almost) K-Means up to: (i) the number of clusters K=2: the “anomalous” one and the “main body” of entities around 0; (ii) center of the “main body” cluster is forcibly always at 0; (iii) a farthest away from 0 entity initializes the anomalous cluster. CODA Week 8 by Boris Mirkin

14 14 Anomalous Cluster  iK-Means is superior of: (Chiang, Mirkin, 2010) CODA Week 8 by Boris Mirkin

15 Issue: Weighting features according to relevance and Minkowski  -distance (Amorim, Mirkin, 2012) w: feature weights=scale factors 3-step K-Means: -Given s, c, find w (weights) -Given w, c, find s (clusters) -Given s,w, find c (centroids) -till convergence 15

16 Issue: Weighting features according to relevance and Minkowski  -distance 2 Minkowski’s centers Minimize over c At  >1, d(c) is convex Gradient method 16

17 Issue: Weighting features according to relevance and Minkowski  -distance 3 Minkowski’s metric effects The more uniform distribution of the entities over a feature, the smaller its weight Uniform distribution  w=0 The best Minkowski power  is data dependent The best  can be learnt from data in a semi- supervised manner (with clustering of all objects) Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record) 17

18 K-Means kernelized 1 18

19 K-Means kernelized 2 19 K-Means equivalent criterion: find partition {S 1,…, S K } to maximize G(S 1,…, S K )= where (S k ) – within cluster mean Mirkin (1976, 1996, 2012): Build partition {S 1,…, S K } finding one cluster at a time

20 K-Means kernelized 3 20 K-Means equivalent criterion and one cluster S at a time: maximizing g(S)= (S)|S| where (S) – within cluster mean AddRemAdd(i) algorithm by adding/removing one entity at a time

21 K-Means kernelized 4 21 Semi-average criterion: max g(S)= (S)|S| where (S) – within cluster mean with AddRemAdd(i) (1)Spectral: max (2) Tight: the average similarity of S and j > (S) /2 if j  S < (S) /2 if j  S

22 Three extensions to entire data set Partitional: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all i  I; – 2. Take S=S(i*) for i* maximizing f(S(i)) over all I – 3. Remove S from I; if I is not empty, goto 1; else halt. Additive: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all i  I; – 2. Take S=S(i*) for i* maximizing f(S(i)) over all I – 3. subtract a(S)ss T from A; if No-stop-condition, goto 1; else halt. Explorative: Take set of all entities I – 1. Compute S(i)=AddRem(i) for all i  I; – 2. Leave those S(i) that do not much overlap. 22

23 Consensus partition I: Given partitions R1,R2,…,Rn, find an “average” R 23

24 Consensus partition 2: Given partitions R1,R2,…,Rn, find an “average” R 24 This is equivalent to max:

25 Consensus partition 3: Given partitions R1,R2,…,Rn, find an “average” R 25 Mirkin, Shestakov (2013): (1)This is superior to a bunch of contemporary consensus clustering approaches (2)Consensus clustering of results of multiple runs of K-Means is better in cluster recovery than best K-Means

26 Additive clustering I Given similarity A=(A(i,j)), find clusters u 1 =(u i 1 ), u 2 =(u i 2 ),…, u K =(u i K ) u i k either 1 or 0 - crisp clusters 0  u i k  1 - fuzzy clusters  1 u 1,  2 u 2,…,  K u K - intensity Additive Model: A=  1 2 u i 1 u j 1 + …+  V 2 u i V u j V +E; min  E  2 Shepard, Arabie 1979 (presented 1973); Mirkin 1987 (1976 in Russian) 26

27 Additive clustering II 27

28 Additive clustering III 28

29 Different criteria I Summary Uniform (Mirkin 1976 in Russian) Within-S sum of similarities A(i,j)-  to maximize Relates to those considered Summary Modular (Newman 2004) Within-S sum of similarities A(i,j)-B(i,j) to maximize B(i,j)= A(i,+)A(+,j)/A(+,+) 29

30 Different criteria II 30

31 FADDIS: Fuzzy Additive Spectral Clustering Spectral: B = Pseudo-inverse Laplacian of A – One cluster at a time Min ||B –  2 u i u j || 2 (One cluster to find) Residual similarity B  B –  2 u i u j Stopping conditions – Equivalent: Rayleigh quotient to maximize Max u T Bu/u T u [follows from model in contrast to a very popular, yet purely heuristic, approach by Shi and Malik 2000] Experimentally demonstrated: Competitive over – ordinary graphs for community detection – conventional (dis)similarity data – affinity data (kernel transformations of feature space data) – in-house synthetic data generators 31

32 Competitive at: Community detection in ordinary graphs Conventional similarity data Affinity similarity data Lapin transformed similarity data D=diag(B*1 N ) L = I - D -1/2 BD -1/2 L + = pinv(L) There are examples at which Lapin doesn’t work 32

33 Example at which Lapin does work, but no square error 33

34 Conclusion Clustering is yet far from a mathematical theory, however, it gets meaty + Gaussian kernels bringing distributions + Laplacian transformation bringing dynamics To make it to a theory, a way to go – Modeling dynamics – Compatibility at Multiple data and metadata – Interpretation


Download ppt "Data clustering: Topics of Current Interest Boris Mirkin 1,2 1 National Research University Higher School of Economics Moscow RF 2 Birkbeck University."

Similar presentations


Ads by Google