Presentation is loading. Please wait.

Presentation is loading. Please wait.

Haim Kaplan and Uri Zwick

Similar presentations


Presentation on theme: "Haim Kaplan and Uri Zwick"— Presentation transcript:

1 Haim Kaplan and Uri Zwick
Clustering Haim Kaplan and Uri Zwick Algorithms in Action Tel Aviv University 2016 Last updated: April

2 A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that
Metric space A set 𝑋 and a function 𝑑:𝑋×𝑋→ ℝ ≥0 such that d 𝑥,𝑦 =0⇔𝑥=𝑦 𝑑 𝑥,𝑦 =𝑑(𝑦,𝑥) 𝑑 𝑥,𝑦 ≤𝑑 𝑥,𝑧 +𝑑 𝑧,𝑦

3 Examples 𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 2 + 𝑥 2 − 𝑦 2 2
𝐿 2 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 𝑥 2 − 𝑦 2 2 𝐿 1 : 𝑑 𝑥,𝑦 = 𝑥 1 − 𝑦 1 +| 𝑥 2 − 𝑦 2 | 𝐿 ∞ : 𝑑 𝑥,𝑦 =max 𝑥 1 − 𝑦 1 ,| 𝑥 2 − 𝑦 2 | ⁡

4 (Finite) Metric space A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) v w u 2 1 2

5 (Discrete) Metric space
A complete weighted graph satisfying the triangle inequality 𝑑 𝑣,𝑤 ≤𝑑 𝑣,𝑢 +𝑑(𝑢,𝑤) Could be a set of points in ℝ 𝑑 and Euclidean distances v w u 2 1 2 Could be vertices in a graph and the lengths of the shortest paths between them u w v

6 k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

7 k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

8 k-centers Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize max 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

9 k-centers (alt. formulation)
Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 congruent disks (centered at points of 𝑋) of minimum radius 𝑟 that cover A

10 k-centers NP-hard to approximate to within a factor of 2−𝜖 for any 𝜖≥0 (simple reduction from dominating set) For the (planar) Euclidean metric also hard to approximate to within a factor > 1.822

11 Farthest fit Pick an arbitrary point 𝑥 1 as the first center
Pick the point farthest away from 𝑥 1 as the second center 𝑥 2

12 Farthest fit Pick an arbitrary point 𝑥 1 as the first center
For 𝑗=2…𝑘 pick 𝑥 𝑗 to be the point which is farthest away from 𝑥 1 ,…, 𝑥 𝑗−1

13 Example

14 Example

15 Example

16 Example

17 Example

18 Example 𝑟

19 What can we say about this ?
Theorem: 𝑂𝑃𝑇≥ 𝑟 2 𝑟

20 Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 𝑟 ≥𝑟

21 Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟

22 Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟

23 Proof Theorem: 𝑂𝑃𝑇≥ 𝑟 2 ≥𝑟 ≥𝑟 ≥𝑟 ≥𝑟

24 Proof We have 𝑘+1 points, each pair is of distance ≥𝑟
Theorem: 𝑂𝑃𝑇≥ 𝑟 2 We have 𝑘+1 points, each pair is of distance ≥𝑟 In 𝑂𝑃𝑇 at least 2 of these points are assigned to the same center This center must be of distance ≥𝑟/2 from at least one of them ≥𝑟

25 k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

26 k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

27 k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

28 k-medians Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑(𝑥,𝐶) Suppose 𝑘=2

29 Where is the point that minimizes the sum of the distances ?
1-median on the line The median Where is the point that minimizes the sum of the distances ?

30 1-median on the line All points here are 1-medians
Where is the point that minimizes the sum of the distances ? In higher dimension its not related to the median anymore but we still use the name “k-median”

31 k-medians We’ll see a local search algorithm that guarantees an approx. ratio of 5 (𝑂(𝑛𝑘) neighborhood size) Can be improved to give an approx. ratio of (3+𝜖) (𝑂(( 𝑛𝑘) 2 𝜖 ) neighborhood size) (Using different techniques) Can get a ratio of ( 𝜖) (in 𝑂 𝑛 𝑂 1⁄ 𝜖 time) NP-hard to get a ratio better than 1.736

32 Local search for k-medians
Start with an arbitrary set of 𝑘 centers Swap a center with some point which is not a center if the sum of the distances decreases. Arya,Garg,Khandekar,Meyerson,Munagala, Pandit, Local search heuristics for facility location problems, SICOMP 2004 Gupta, Tangwongsam, Simpler analysis of local search algorithms, 2008

33 The closest facility in Local for each facility in 𝑂𝑃𝑇
Analysis o1 o2 o3 o4 The closest facility in Local for each facility in 𝑂𝑃𝑇

34 Lets assume for simplicity that this is a matching
Analysis o1 o2 o3 o4 Lets assume for simplicity that this is a matching

35 Consider the swaps defined by this matching
Analysis o1 o2 o3 o4 Consider the swaps defined by this matching

36 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

37 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

38 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

39 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

40 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

41 Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +…

42 Analysis o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴

43 Analysis 𝐶 𝐵 o1 𝑁 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴

44 Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴

45 Analysis 𝐶 𝐵 o1 𝑁 𝐷 𝐴 o2 𝑂 2 o3 o4 𝑁−𝐴≤𝐵+𝐶−𝐴≤𝐵+𝐷−𝐴≤2𝐵

46 Analysis o1 o2 𝑂 2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )

47 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

48 Analysis o1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤ ?

49 Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

50 Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

51 Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

52 Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +…

53 Analysis o1 𝑂 1 o2 o3 o4 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( )

54 Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) …………… 𝐿≤3𝑂𝑃𝑇

55 What happens if this is not a matching ?
Analysis o1 o2 o3 o4 What happens if this is not a matching ?

56 Which swaps do we consider ?
Analysis o1 o2 o3 o4 Which swaps do we consider ?

57 Analysis o1 o2 o3 o4 We can always define a set of swaps such that:
Vertices of 𝐿 with in-degree ≥2 do not participate Each vertex of 𝐿 participates in at most 2

58 Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) o1 o2 o3 o4

59 Analysis 0≤𝑐𝑜𝑠𝑡 𝐿− + o 3 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 3 −cos 𝑡 𝐿 𝑂 3 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) + 0≤𝑐𝑜𝑠𝑡 𝐿− + o 1 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 1 −cos 𝑡 𝐿 𝑂 1 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 2 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 2 −cos 𝑡 𝐿 𝑂 2 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 0≤𝑐𝑜𝑠𝑡 𝐿− + o 4 −𝑐𝑜𝑠𝑡(𝐿)≤cos 𝑡 𝑂𝑃𝑇 𝑂 4 −cos 𝑡 𝐿 𝑂 4 +2𝑐𝑜𝑠 𝑡 𝑂𝑃𝑇 ( ) 𝐿≤5𝑂𝑃𝑇

60 Summary To get a better result we replace more facilities in a single step If we swap up to 𝑝 then the competitive ratio is 3+ 2 𝑝

61 k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶)

62 k-means Given a set of 𝑛 points A of some metric space 𝑋, find a set 𝐶 of 𝑘 points in 𝑋, such that we minimize 𝑥∈𝐴 𝑑 2 (𝑥,𝐶) Can we use the previous algorithm ? We can but the analysis breaks 𝑑 2 𝑥,𝑦 is not a metric 𝑧 𝑥 𝑦 1 𝑑 𝑥,𝑧 ≥𝑑 𝑥,𝑦 +𝑑(𝑦,𝑧)

63 Local search The analysis of the “switching” algorithm generalizes (with some difficulties) We get an approximation ratio of 25 for a single switch 3+ 2 𝑝 2 for a switch of up to 𝑝 centers

64 Where is the point that minimizes the sum of the squared distances ?
1-mean on the line A Where is the point that minimizes the sum of the squared distances ? min A 𝑥 1 −𝐴 𝑥 2 −𝐴 2 +…+ 𝑥 7 −𝐴 2 ? 𝐴= 𝑥 1 + 𝑥 2 +…+ 𝑥 7 7

65 1-mean in Euclidean space of higher dimension
It is the center of mass (mean) We will focus on the Euclidean metric

66 2-means in the plane Fix the partition  to minimize the sum of squared distances each center must be the mean of the points in its cluster

67 Lloyd’s algorithm Most frequently used clustering algorithm
Related to the EM (Expectation Maximization) algorithm for learning Gaussians Mixtures Models (GMMs)

68 Lloyd’s algorithm Start with some arbitrary set of 𝑘-centers Iterate:
Assign each point to its closest center Recalculate centers: each new center is the mean of the points in a cluster

69 Example (k=3)

70 Pick initial centers

71 Assign each point to its closest center

72 Replace centers by clusters’ means

73 Assign each point to its closest center

74 Replace centers by clusters’ means

75 Assign each point to its closest center

76 Replace centers by clusters’ means
No changes  terminate

77 Properties Very easy to implement
Sum of squared distances always decreases (like local search)

78 Quality of the local opt ?
𝑘=3 𝑦 𝑧 𝑥 𝑂𝑃𝑇 𝐿 2 𝑦 𝑥 = 𝑦 2 𝑥 2 Can be made as large as we want

79 Running time Each step we have a partition of the points -- by the closest center We cannot repeat a partition in 2 different iterations Bounded by the # of possible partitions of 𝑛 points to 𝑘 clusters: 𝑘 𝑛 Is this tight ? Say for 𝑘=2 ?

80 Voronoi diagram The Voronoi diagram of a set of points 𝑝 1 , 𝑝 2 ,…, 𝑝 𝑛 is a partition of the plane to 𝑛 cells, cell 𝑖 contains all points closest to 𝑝 𝑖

81 Voronoi diagram

82 Voronoi partition After each point picks its closest center the partition is consistent with the Voronoi diagram of the centers (Voronoi partition) Each point is in the cell of its center

83 Assign each point to its closest center

84 Voronoi partitions of 2-centers
Is this a Voronoi partition ? 𝑐 2 𝑐 1

85 Voronoi partitions of 2-centers
This is not a Voronoi partition

86 Voronoi partition We cannot have the same Voronoi partition in different iterations So the total # of Voronoi partitions (with respect to every possible set of 𝑘 centers) is an upper bound on the running time How many partitions are consistent with a Voronoi diagram of 𝑘 points ?

87 Voronoi partitions of 2-centers
How many partitions are Voronoi partitions of some 2 centers 𝑐 1 , 𝑐 2 ? 𝑐 1 𝑐 2

88 Voronoi partitions Define 2-centers 𝑐 1 , 𝑐 2 and 𝑐 3 , 𝑐 4 as equivalent if they induce the same partition

89 Here is a pair of equivalent 2 centers
Voronoi partitions Here is a pair of equivalent 2 centers 𝑐 1 𝑐 3 𝑐 4 𝑐 2

90 Counting Voronoi partitions
The number of equivalence classes of this relation equals to the number of Voronoi partitions So we want an upper bound on the number of equivalence classes of this relation

91 2 centers A Voronoi partition corresponds to a line (hyperplane) separating the blue from the red We may assume the line touches 2 input points 𝑂 𝑛 2 such lines

92 General technique We model 3 centers 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 ,( 𝑥 3 , 𝑦 3 ) as a point 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6

93 Counting Voronoi partitions
Each point 𝑝 of the input and 2 centers 𝑐 1 , 𝑐 2 define a surface 𝑆 𝑐 1 , 𝑐 2 𝑝 containing all triples of centers in which the first 2 centers are at equidistance from p 𝑝 𝑥 − 𝑥 𝑝 𝑦 − 𝑦 = 𝑝 𝑥 − 𝑥 𝑝 𝑦 − 𝑦 2 2 𝑆 𝑐 1 , 𝑐 2 𝑝 consists of all points 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , 𝑥 3 , 𝑦 3 in ℝ 6 that satisfy: One one side of the surface 𝑥 1 , 𝑦 1 is closer to 𝑝 than 𝑥 2 , 𝑦 2 and on the other side 𝑥 2 , 𝑦 2 is closer to 𝑝 than 𝑥 1 , 𝑦 1

94 Counting Voronoi partitions
We get 𝑛 surfaces, one per a point in the input and 2 centers These surfaces partition ℝ 6 into 𝑂 𝑛 6 cells In a cell the relative order of each point from all centers is fixed All points in a cell correspond to equivalent centers This gives an upper bound of 𝑂 𝑛 6 on the number of equivalence classes and thereby on the running time

95 Counting Voronoi partitions
𝑆 𝑐 1 , 𝑐 2 𝑝 𝑆 𝑐 1 , 𝑐 3 𝑝 For all triples of centers here 𝑝 will choose 𝑐 1 𝑆 𝑐 2 , 𝑐 3 𝑝

96 Counting Voronoi partitions
𝑆 𝑐 1 , 𝑐 2 𝑞 𝑆 𝑐 1 , 𝑐 3 𝑞 𝑆 𝑐 2 , 𝑐 3 𝑞

97 Counting Voronoi partitions

98 Voronoi partitions This argument works for any 𝑘 and 𝑑
We get that the complexity is 𝑂 𝑛 𝑘 2 𝑘𝑑

99 Summary Very powerful in practice – one of the most common clustering algorithms A lot of effort has been made to speed it up

100 Speeding up using triangle inequality
Each iteration we compute 𝑛𝑘 distances How do we reduce the # of distances that we compute ?

101 Speeding up using triangle inequality (ver 1)
At the beginning of an iteration compute all distances between centers If 𝑑 𝑐 1 , 𝑐 2 ≥2𝑑 𝑐 1 ,𝑝 then 𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2 So we can save the computation of 𝑑 𝑝, 𝑐 2 p 𝑑 𝑐 1 , 𝑐 2 ≤𝑑 𝑐 1 ,𝑝 +𝑑 𝑝, 𝑐 2 𝑑 𝑐 1 , 𝑐 2 −𝑑 𝑐 1 ,𝑝 ≤𝑑 𝑝, 𝑐 2

102 Speeding up using triangle inequality (ver 2)
At the beginning of an iteration compute all distances between centers Sort each row of this distance matrix For a point 𝑝 previously assigned to center 𝑐 check the centers in the order they appear in the row of 𝑐 Stop when you reach a center 𝑐 ′ such that 𝑑 𝑐,𝑐′ ≥2𝑑(𝑝,𝑐)

103 Results A data set from a satellite image: points, each has 6 brightness levels

104 Results Total running time

105 Results Average # of comparisons per point in all iterations and in the last iteration


Download ppt "Haim Kaplan and Uri Zwick"

Similar presentations


Ads by Google