Haim Kaplan and Uri Zwick Clustering Haim Kaplan and Uri Zwick Algorithms in Action Tel Aviv University 2016 Last updated: April 9 2016
A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that Metric space A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that d 𝑥,𝑦 =0⇔𝑥=𝑦 𝑑 𝑥,𝑦 =𝑑(𝑦,𝑥) 𝑑 𝑥,𝑦 ≤𝑑 𝑥,𝑧 +𝑑 𝑧,𝑦
Examples 𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 2 + 𝑥 2 − 𝑦 2 2 𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 2 + 𝑥 2 − 𝑦 2 2 𝐿 1 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 +| 𝑥 2 − 𝑦 2 | 𝐿 ∞ : 𝑑 𝑥,𝑦 =max 𝑥 1 − 𝑦 1 ,| 𝑥 2 − 𝑦 2 |
(Finite) Metric space A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) v w u 2 1 2
(Discrete) Metric space A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) Could be a set of points in ℝ 𝑑 and Euclidean distances v w u 2 1 2 Could be vertices in a graph and the lengths of the shortest paths between them u w v
k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
k-centers (alt. formulation) Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 congruent disks (centered at points of 𝑋) of minimum radius 𝑟 that cover A
k-centers NP-hard to approximate to within a factor of 2−𝜖 for any 𝜖≥0 (simple reduction from dominating set) For the (planar) Euclidean metric also hard to approximate to within a factor > 1.822
Farthest fit Pick an arbitrary point 𝑥 1 as the first center Pick the point farthest away from 𝑥 1 as the second center 𝑥 2
Farthest fit Pick an arbitrary point 𝑥 1 as the first center For 𝑗=2…𝑘 pick 𝑥 𝑗 to be the point which is farthest away from 𝑥 1 ,…, 𝑥 𝑗−1
Example
Example
Example
Example
Example
Example 𝑟
What can we say about this ? Theorem: 𝑂𝑃𝑇≥ 𝑟 2 𝑟
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 𝑟 ≥𝑟
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟
Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟 ≥𝑟
Proof We have 𝑘+1 points, each pair is of distance ≥𝑟 Theorem: 𝑂𝑃𝑇≥ 𝑟 2 We have 𝑘+1 points, each pair is of distance ≥𝑟 In 𝑂𝑃𝑇 at least 2 of these points are assigned to the same center This center must be of distance ≥𝑟/2 from at least one of them ≥𝑟
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2
Where is the point that minimizes the sum of the distances ? 1-median on the line The median Where is the point that minimizes the sum of the distances ?
1-median on the line All points here are 1-medians Where is the point that minimizes the sum of the distances ? In higher dimension its not related to the median anymore but we still use the name “k-median”
k-medians We’ll see a local search algorithm that guarantees an approx. ratio of 5 (𝑂(𝑛𝑘) neighborhood size) Can be improved to give an approx. ratio of (3+𝜖) (𝑂(( 𝑛𝑘) 2 𝜖 ) neighborhood size) (Using different techniques) Can get a ratio of (1+ 3 +𝜖) (in 𝑂 𝑛 𝑂 1⁄ 𝜖 2 time) NP-hard to get a ratio better than 1.736
Local search for k-medians Start with an arbitrary set of 𝑘 centers Swap a center with some point which is not a center if the sum of the distances decreases. Arya,Garg,Khandekar,Meyerson,Munagala, Pandit, Local search heuristics for facility location problems, SICOMP 2004 Gupta, Tangwongsam, Simpler analysis of local search algorithms, 2008
The closest facility in Local for each facility in 𝑂𝑃𝑇 Analysis o1 o2 o3 o4 The closest facility in Local for each facility in 𝑂𝑃𝑇
Lets assume for simplicity that this is a matching Analysis o1 o2 o3 o4 Lets assume for simplicity that this is a matching
Consider the swaps defined by this matching Analysis o1 o2 o3 o4 Consider the swaps defined by this matching
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +…
Analysis o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴
Analysis 𝐶 𝐵 o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴
Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴
Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴≤2𝐵
Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…
Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )
Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) …………… 𝐿≤3𝑂𝑃𝑇
What happens if this is not a matching ? Analysis o1 o2 o3 o4 What happens if this is not a matching ?
Which swaps do we consider ? Analysis o1 o2 o3 o4 Which swaps do we consider ?
Analysis o1 o2 o3 o4 We can always define a set of swaps such that: Vertices of 𝐿 with in-degree ≥2 do not participate Each vertex of 𝐿 participates in at most 2
Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) o1 o2 o3 o4
Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 4 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 4 −cos 𝑡 𝐿 𝑂 4 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 𝐿≤5𝑂𝑃𝑇
Summary To get a better result we replace more facilities in a single step If we swap up to 𝑝 then the competitive ratio is 3+ 2 𝑝
k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶)
k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶) Can we use the previous algorithm ? We can but the analysis breaks 𝑑 2 𝑥,𝑦 is not a metric 𝑧 𝑥 𝑦 1 𝑑 𝑥,𝑧 ≥𝑑 𝑥,𝑦 +𝑑(𝑦,𝑧)
Local search The analysis of the “switching” algorithm generalizes (with some difficulties) We get an approximation ratio of 25 for a single switch 3+ 2 𝑝 2 for a switch of up to 𝑝 centers
Where is the point that minimizes the sum of the squared distances ? 1-mean on the line A Where is the point that minimizes the sum of the squared distances ? min A 𝑥 1 −𝐴 2 + 𝑥 2 −𝐴 2 +…+ 𝑥 7 −𝐴 2 ? 𝐴= 𝑥 1 + 𝑥 2 +…+ 𝑥 7 7
1-mean in Euclidean space of higher dimension It is the center of mass (mean) We will focus on the Euclidean metric
2-means in the plane Fix the partition to minimize the sum of squared distances each center must be the mean of the points in its cluster
Lloyd’s algorithm Most frequently used clustering algorithm Related to the EM (Expectation Maximization) algorithm for learning Gaussians Mixtures Models (GMMs)
Lloyd’s algorithm Start with some arbitrary set of 𝑘-centers Iterate: Assign each point to its closest center Recalculate centers: each new center is the mean of the points in a cluster
Example (k=3)
Pick initial centers
Assign each point to its closest center
Replace centers by clusters’ means
Assign each point to its closest center
Replace centers by clusters’ means
Assign each point to its closest center
Replace centers by clusters’ means No changes terminate
Properties Very easy to implement Sum of squared distances always decreases (like local search)
Quality of the local opt ? 𝑘=3 𝑦 𝑧 𝑥 𝑂𝑃𝑇 𝐿 2 𝑦 2 4 2 𝑥 2 4 = 𝑦 2 𝑥 2 Can be made as large as we want
Running time Each step we have a partition of the points -- by the closest center We cannot repeat a partition in 2 different iterations Bounded by the # of possible partitions of 𝑛 points to 𝑘 clusters: 𝑘 𝑛 Is this tight ? Say for 𝑘=2 ?
Voronoi diagram The Voronoi diagram of a set of points 𝑝 1 , 𝑝 2 ,…, 𝑝 𝑛 is a partition of the plane to 𝑛 cells, cell 𝑖 contains all points closest to 𝑝 𝑖
Voronoi diagram
Voronoi partition After each point picks its closest center the partition is consistent with the Voronoi diagram of the centers (Voronoi partition) Each point is in the cell of its center
Assign each point to its closest center
Voronoi partitions of 2-centers Is this a Voronoi partition ? 𝑐 2 𝑐 1
Voronoi partitions of 2-centers This is not a Voronoi partition
Voronoi partition We cannot have the same Voronoi partition in different iterations So the total # of Voronoi partitions (with respect to every possible set of 𝑘 centers) is an upper bound on the running time How many partitions are consistent with a Voronoi diagram of 𝑘 points ?
Voronoi partitions of 2-centers How many partitions are Voronoi partitions of some 2 centers 𝑐 1 , 𝑐 2 ? 𝑐 1 𝑐 2
Voronoi partitions Define 2-centers 𝑐 1 , 𝑐 2 and 𝑐 3 , 𝑐 4 as equivalent if they induce the same partition
Here is a pair of equivalent 2 centers Voronoi partitions Here is a pair of equivalent 2 centers 𝑐 1 𝑐 3 𝑐 4 𝑐 2
Counting Voronoi partitions The number of equivalence classes of this relation equals to the number of Voronoi partitions So we want an upper bound on the number of equivalence classes of this relation
2 centers A Voronoi partition corresponds to a line (hyperplane) separating the blue from the red We may assume the line touches 2 input points 𝑂 𝑛 2 such lines
General technique We model 3 centers 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 ,( 𝑥 3 , 𝑦 3 ) as a point 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6
Counting Voronoi partitions Each point 𝑝 of the input and 2 centers 𝑐 1 , 𝑐 2 define a surface 𝑆 𝑐 1 , 𝑐 2 𝑝 containing all triples of centers in which the first 2 centers are at equidistance from p 𝑝 𝑥 − 𝑥 1 2 + 𝑝 𝑦 − 𝑦 1 2 = 𝑝 𝑥 − 𝑥 2 2 + 𝑝 𝑦 − 𝑦 2 2 𝑆 𝑐 1 , 𝑐 2 𝑝 consists of all points 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6 that satisfy: One one side of the surface 𝑥 1 , 𝑦 1 is closer to 𝑝 than 𝑥 2 , 𝑦 2 and on the other side 𝑥 2 , 𝑦 2 is closer to 𝑝 than 𝑥 1 , 𝑦 1
Counting Voronoi partitions We get 𝑛 3 2 surfaces, one per a point in the input and 2 centers These surfaces partition ℝ 6 into 𝑂 𝑛 6 cells In a cell the relative order of each point from all centers is fixed All points in a cell correspond to equivalent centers This gives an upper bound of 𝑂 𝑛 6 on the number of equivalence classes and thereby on the running time
Counting Voronoi partitions 𝑆 𝑐 1 , 𝑐 2 𝑝 𝑆 𝑐 1 , 𝑐 3 𝑝 For all triples of centers here 𝑝 will choose 𝑐 1 𝑆 𝑐 2 , 𝑐 3 𝑝
Counting Voronoi partitions 𝑆 𝑐 1 , 𝑐 2 𝑞 𝑆 𝑐 1 , 𝑐 3 𝑞 𝑆 𝑐 2 , 𝑐 3 𝑞
Counting Voronoi partitions
Voronoi partitions This argument works for any 𝑘 and 𝑑 We get that the complexity is 𝑂 𝑛 𝑘 2 𝑘𝑑
Summary Very powerful in practice – one of the most common clustering algorithms A lot of effort has been made to speed it up
Speeding up using triangle inequality Each iteration we compute 𝑛𝑘 distances How do we reduce the # of distances that we compute ?
Speeding up using triangle inequality (ver 1) At the beginning of an iteration compute all distances between centers If 𝑑 𝑐 1 , 𝑐 2 ≥2𝑑 𝑐 1 ,𝑝 then 𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2 So we can save the computation of 𝑑 𝑝, 𝑐 2 p 𝑑 𝑐 1 , 𝑐 2 ≤𝑑 𝑐 1 ,𝑝 +𝑑 𝑝, 𝑐 2 𝑑 𝑐 1 , 𝑐 2 −𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2
Speeding up using triangle inequality (ver 2) At the beginning of an iteration compute all distances between centers Sort each row of this distance matrix For a point 𝑝 previously assigned to center 𝑐 check the centers in the order they appear in the row of 𝑐 Stop when you reach a center 𝑐 ′ such that 𝑑 𝑐,𝑐′ ≥2𝑑(𝑝,𝑐)
Results A data set from a satellite image: 2370686 points, each has 6 brightness levels
Results Total running time
Results Average # of comparisons per point in all iterations and in the last iteration