Haim Kaplan and Uri Zwick

Slides:



Advertisements
Similar presentations
Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Advertisements

Weighted Matching-Algorithms, Hamiltonian Cycles and TSP
Approximation algorithms for geometric intersection graphs.
Chapter 4 Partition I. Covering and Dominating.
Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Great Theoretical Ideas in Computer Science
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Introduction to Algorithms
R. Johnsonbaugh Discrete Mathematics 5 th edition, 2001 Chapter 8 Network models.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
By Groysman Maxim. Let S be a set of sites in the plane. Each point in the plane is influenced by each point of S. We would like to decompose the plane.
Combinatorial Algorithms
Discrete geometry Lecture 2 1 © Alexander & Michael Bronstein
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
Great Theoretical Ideas in Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Approximation Algorithms: Concepts Approximation algorithm: An algorithm that returns near-optimal solutions (i.e. is "provably good“) is called an approximation.
Vertex Cover, Dominating set, Clique, Independent set
1 Vertex Cover Problem Given a graph G=(V, E), find V' ⊆ V such that for each edge (u, v) ∈ E at least one of u and v belongs to V’ and |V’| is minimized.
The Load Distance Balancing Problem Eddie Bortnikov (Yahoo!) Samir Khuller (Maryland) Yishay Mansour (Google) Seffi Naor (Technion)
On the ICP Algorithm Esther Ezra, Micha Sharir Alon Efrat.
Steiner trees Algorithms and Networks. Steiner Trees2 Today Steiner trees: what and why? NP-completeness Approximation algorithms Preprocessing.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Gene expression & Clustering (Chapter 10)
Randomized Online Algorithm for Minimum Metric Bipartite Matching Adam Meyerson UCLA.
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
Approximation algorithms for TSP with neighborhoods in the plane R 郭秉鈞 R 林傳健.
1 Combinatorial Algorithms Local Search. A local search algorithm starts with an arbitrary feasible solution to the problem, and then check if some small,
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Projects Network Theory VLSI PSM 1. Network 1. Steiner trees
The full Steiner tree problem Theoretical Computer Science 306 (2003) C. L. Lu, C. Y. Tang, R. C. T. Lee Reporter: Cheng-Chung Li 2004/06/28.
Exponential time algorithms Algorithms and networks.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Steiner Tree Problem Given: A set S of points in the plane = terminals
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
Approximation Algorithms by bounding the OPT Instructor Neelima Gupta
Clustering – Definition and Basic Algorithms Seminar on Geometric Approximation Algorithms, spring 11/12.
Steiner trees: Approximation Algorithms
Semi-Supervised Clustering
Upper and Lower Bounds on the cost of a Map-Reduce Computation
Lap Chi Lau we will only use slides 4 to 19
Topics in Algorithms Lap Chi Lau.
Chapter 3 The Real Numbers.
Instructor: Rajeev Alur
Unsupervised Learning
Great Theoretical Ideas in Computer Science
Chapter 3 The Real Numbers.
Vertex Cover, Dominating set, Clique, Independent set
Chapter 5. Optimal Matchings
Latent Variables, Mixture Models and EM
Computability and Complexity
Discrete Mathematics for Computer Science
k-center Clustering under Perturbation Resilience
Enumerating Distances Using Spanners of Bounded Degree
Haim Kaplan and Uri Zwick
Analysis of Algorithms
Haim Kaplan and Uri Zwick
Clustering.
Analysis of Algorithms
The Full Steiner tree problem Part Two
Biointelligence Laboratory, Seoul National University
Randomized Online Algorithm for Minimum Metric Bipartite Matching
Clustering.
Zero-Skew Trees Zero-Skew Tree: rooted tree in which all root-to-leaf paths have the same length Used in VLSI clock routing & network multicasting.
Clustering.
Chapter 6. Large Scale Optimization
Analysis of Algorithms
Presentation transcript:

Haim Kaplan and Uri Zwick Clustering Haim Kaplan and Uri Zwick Algorithms in Action Tel Aviv University 2016 Last updated: April 9 2016

A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that Metric space A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that d 𝑥,𝑦 =0⇔𝑥=𝑦 𝑑 𝑥,𝑦 =𝑑(𝑦,𝑥) 𝑑 𝑥,𝑦 ≤𝑑 𝑥,𝑧 +𝑑 𝑧,𝑦

Examples 𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 2 + 𝑥 2 − 𝑦 2 2 𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 2 + 𝑥 2 − 𝑦 2 2 𝐿 1 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 +| 𝑥 2 − 𝑦 2 | 𝐿 ∞ : 𝑑 𝑥,𝑦 =max 𝑥 1 − 𝑦 1 ,| 𝑥 2 − 𝑦 2 | ⁡

(Finite) Metric space A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) v w u 2 1 2

(Discrete) Metric space A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) Could be a set of points in ℝ 𝑑 and Euclidean distances v w u 2 1 2 Could be vertices in a graph and the lengths of the shortest paths between them u w v

k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

k-centers (alt. formulation) Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 congruent disks (centered at points of 𝑋) of minimum radius 𝑟 that cover A

k-centers NP-hard to approximate to within a factor of 2−𝜖 for any 𝜖≥0 (simple reduction from dominating set) For the (planar) Euclidean metric also hard to approximate to within a factor > 1.822

Farthest fit Pick an arbitrary point 𝑥 1 as the first center Pick the point farthest away from 𝑥 1 as the second center 𝑥 2

Farthest fit Pick an arbitrary point 𝑥 1 as the first center For 𝑗=2…𝑘 pick 𝑥 𝑗 to be the point which is farthest away from 𝑥 1 ,…, 𝑥 𝑗−1

Example

Example

Example

Example

Example

Example 𝑟

What can we say about this ? Theorem: 𝑂𝑃𝑇≥ 𝑟 2 𝑟

Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 𝑟 ≥𝑟

Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟

Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟

Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟 ≥𝑟

Proof We have 𝑘+1 points, each pair is of distance ≥𝑟 Theorem: 𝑂𝑃𝑇≥ 𝑟 2 We have 𝑘+1 points, each pair is of distance ≥𝑟 In 𝑂𝑃𝑇 at least 2 of these points are assigned to the same center This center must be of distance ≥𝑟/2 from at least one of them ≥𝑟

k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

Where is the point that minimizes the sum of the distances ? 1-median on the line The median Where is the point that minimizes the sum of the distances ?

1-median on the line All points here are 1-medians Where is the point that minimizes the sum of the distances ? In higher dimension its not related to the median anymore but we still use the name “k-median”

k-medians We’ll see a local search algorithm that guarantees an approx. ratio of 5 (𝑂(𝑛𝑘) neighborhood size) Can be improved to give an approx. ratio of (3+𝜖) (𝑂(( 𝑛𝑘) 2 𝜖 ) neighborhood size) (Using different techniques) Can get a ratio of (1+ 3 +𝜖) (in 𝑂 𝑛 𝑂 1⁄ 𝜖 2 time) NP-hard to get a ratio better than 1.736

Local search for k-medians Start with an arbitrary set of 𝑘 centers Swap a center with some point which is not a center if the sum of the distances decreases. Arya,Garg,Khandekar,Meyerson,Munagala, Pandit, Local search heuristics for facility location problems, SICOMP 2004 Gupta, Tangwongsam, Simpler analysis of local search algorithms, 2008

The closest facility in Local for each facility in 𝑂𝑃𝑇 Analysis o1 o2 o3 o4 The closest facility in Local for each facility in 𝑂𝑃𝑇

Lets assume for simplicity that this is a matching Analysis o1 o2 o3 o4 Lets assume for simplicity that this is a matching

Consider the swaps defined by this matching Analysis o1 o2 o3 o4 Consider the swaps defined by this matching

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +…

Analysis o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴

Analysis 𝐶 𝐵 o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴

Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴

Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴≤2𝐵

Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )

Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) …………… 𝐿≤3𝑂𝑃𝑇

What happens if this is not a matching ? Analysis o1 o2 o3 o4 What happens if this is not a matching ?

Which swaps do we consider ? Analysis o1 o2 o3 o4 Which swaps do we consider ?

Analysis o1 o2 o3 o4 We can always define a set of swaps such that: Vertices of 𝐿 with in-degree ≥2 do not participate Each vertex of 𝐿 participates in at most 2

Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) o1 o2 o3 o4

Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 4 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 4 −cos 𝑡 𝐿 𝑂 4 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 𝐿≤5𝑂𝑃𝑇

Summary To get a better result we replace more facilities in a single step If we swap up to 𝑝 then the competitive ratio is 3+ 2 𝑝

k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶)

k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶) Can we use the previous algorithm ? We can but the analysis breaks 𝑑 2 𝑥,𝑦 is not a metric 𝑧 𝑥 𝑦 1 𝑑 𝑥,𝑧 ≥𝑑 𝑥,𝑦 +𝑑(𝑦,𝑧)

Local search The analysis of the “switching” algorithm generalizes (with some difficulties) We get an approximation ratio of 25 for a single switch 3+ 2 𝑝 2 for a switch of up to 𝑝 centers

Where is the point that minimizes the sum of the squared distances ? 1-mean on the line A Where is the point that minimizes the sum of the squared distances ? min A 𝑥 1 −𝐴 2 + 𝑥 2 −𝐴 2 +…+ 𝑥 7 −𝐴 2 ? 𝐴= 𝑥 1 + 𝑥 2 +…+ 𝑥 7 7

1-mean in Euclidean space of higher dimension It is the center of mass (mean) We will focus on the Euclidean metric

2-means in the plane Fix the partition  to minimize the sum of squared distances each center must be the mean of the points in its cluster

Lloyd’s algorithm Most frequently used clustering algorithm Related to the EM (Expectation Maximization) algorithm for learning Gaussians Mixtures Models (GMMs)

Lloyd’s algorithm Start with some arbitrary set of 𝑘-centers Iterate: Assign each point to its closest center Recalculate centers: each new center is the mean of the points in a cluster

Example (k=3)

Pick initial centers

Assign each point to its closest center

Replace centers by clusters’ means

Assign each point to its closest center

Replace centers by clusters’ means

Assign each point to its closest center

Replace centers by clusters’ means No changes  terminate

Properties Very easy to implement Sum of squared distances always decreases (like local search)

Quality of the local opt ? 𝑘=3 𝑦 𝑧 𝑥 𝑂𝑃𝑇 𝐿 2 𝑦 2 4 2 𝑥 2 4 = 𝑦 2 𝑥 2 Can be made as large as we want

Running time Each step we have a partition of the points -- by the closest center We cannot repeat a partition in 2 different iterations Bounded by the # of possible partitions of 𝑛 points to 𝑘 clusters: 𝑘 𝑛 Is this tight ? Say for 𝑘=2 ?

Voronoi diagram The Voronoi diagram of a set of points 𝑝 1 , 𝑝 2 ,…, 𝑝 𝑛 is a partition of the plane to 𝑛 cells, cell 𝑖 contains all points closest to 𝑝 𝑖

Voronoi diagram

Voronoi partition After each point picks its closest center the partition is consistent with the Voronoi diagram of the centers (Voronoi partition) Each point is in the cell of its center

Assign each point to its closest center

Voronoi partitions of 2-centers Is this a Voronoi partition ? 𝑐 2 𝑐 1

Voronoi partitions of 2-centers This is not a Voronoi partition

Voronoi partition We cannot have the same Voronoi partition in different iterations So the total # of Voronoi partitions (with respect to every possible set of 𝑘 centers) is an upper bound on the running time How many partitions are consistent with a Voronoi diagram of 𝑘 points ?

Voronoi partitions of 2-centers How many partitions are Voronoi partitions of some 2 centers 𝑐 1 , 𝑐 2 ? 𝑐 1 𝑐 2

Voronoi partitions Define 2-centers 𝑐 1 , 𝑐 2 and 𝑐 3 , 𝑐 4 as equivalent if they induce the same partition

Here is a pair of equivalent 2 centers Voronoi partitions Here is a pair of equivalent 2 centers 𝑐 1 𝑐 3 𝑐 4 𝑐 2

Counting Voronoi partitions The number of equivalence classes of this relation equals to the number of Voronoi partitions So we want an upper bound on the number of equivalence classes of this relation

2 centers A Voronoi partition corresponds to a line (hyperplane) separating the blue from the red We may assume the line touches 2 input points 𝑂 𝑛 2 such lines

General technique We model 3 centers 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 ,( 𝑥 3 , 𝑦 3 ) as a point 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6

Counting Voronoi partitions Each point 𝑝 of the input and 2 centers 𝑐 1 , 𝑐 2 define a surface 𝑆 𝑐 1 , 𝑐 2 𝑝 containing all triples of centers in which the first 2 centers are at equidistance from p 𝑝 𝑥 − 𝑥 1 2 + 𝑝 𝑦 − 𝑦 1 2 = 𝑝 𝑥 − 𝑥 2 2 + 𝑝 𝑦 − 𝑦 2 2 𝑆 𝑐 1 , 𝑐 2 𝑝 consists of all points 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6 that satisfy: One one side of the surface 𝑥 1 , 𝑦 1 is closer to 𝑝 than 𝑥 2 , 𝑦 2 and on the other side 𝑥 2 , 𝑦 2 is closer to 𝑝 than 𝑥 1 , 𝑦 1

Counting Voronoi partitions We get 𝑛 3 2 surfaces, one per a point in the input and 2 centers These surfaces partition ℝ 6 into 𝑂 𝑛 6 cells In a cell the relative order of each point from all centers is fixed All points in a cell correspond to equivalent centers This gives an upper bound of 𝑂 𝑛 6 on the number of equivalence classes and thereby on the running time

Counting Voronoi partitions 𝑆 𝑐 1 , 𝑐 2 𝑝 𝑆 𝑐 1 , 𝑐 3 𝑝 For all triples of centers here 𝑝 will choose 𝑐 1 𝑆 𝑐 2 , 𝑐 3 𝑝

Counting Voronoi partitions 𝑆 𝑐 1 , 𝑐 2 𝑞 𝑆 𝑐 1 , 𝑐 3 𝑞 𝑆 𝑐 2 , 𝑐 3 𝑞

Counting Voronoi partitions

Voronoi partitions This argument works for any 𝑘 and 𝑑 We get that the complexity is 𝑂 𝑛 𝑘 2 𝑘𝑑

Summary Very powerful in practice – one of the most common clustering algorithms A lot of effort has been made to speed it up

Speeding up using triangle inequality Each iteration we compute 𝑛𝑘 distances How do we reduce the # of distances that we compute ?

Speeding up using triangle inequality (ver 1) At the beginning of an iteration compute all distances between centers If 𝑑 𝑐 1 , 𝑐 2 ≥2𝑑 𝑐 1 ,𝑝 then 𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2 So we can save the computation of 𝑑 𝑝, 𝑐 2 p 𝑑 𝑐 1 , 𝑐 2 ≤𝑑 𝑐 1 ,𝑝 +𝑑 𝑝, 𝑐 2 𝑑 𝑐 1 , 𝑐 2 −𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2

Speeding up using triangle inequality (ver 2) At the beginning of an iteration compute all distances between centers Sort each row of this distance matrix For a point 𝑝 previously assigned to center 𝑐 check the centers in the order they appear in the row of 𝑐 Stop when you reach a center 𝑐 ′ such that 𝑑 𝑐,𝑐′ ≥2𝑑(𝑝,𝑐)

Results A data set from a satellite image: 2370686 points, each has 6 brightness levels

Results Total running time

Results Average # of comparisons per point in all iterations and in the last iteration