Download presentation

Presentation is loading. Please wait.

Published byMatthew Baldwin Modified over 2 years ago

1
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Dr. Raymond Greenlaw School of Computing Armstrong Atlantic State University and Dr. Sanpawat Kantabutra Department of Computer Science Chiang Mai University

2
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 2 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

3
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 3 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

4
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 4 Introduction Clustering is a division of data into groups of similar objects, where each group is given a more-compact representation. Used to model very large data sets. Points are more similar to their own cluster than to points in other clusters.

5
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 5 Introduction Useful tool in data mining, where immense data sets which are difficult to store and to manipulate are involved. Study the parallel complexity of the hierarchical clustering problem. Builds a tree of clusters. Sibling clusters in this tree partition the points associated with their parent. Can explore data using various levels of granularity.

6
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 6 Introduction Two widely studied models –Bottom-Up Starts with single-point clusters and then recursively merges two or more of the most-appropriate clusters. –Top-Down Starts with one large cluster consisting of all the data points and then recursively splits the most-appropriate cluster. In both methods, the process continues until a desired stopping condition is met such as a required number of clusters or a diameter bound of the largest cluster.

7
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 7 Introduction A variety of sequential versions of hierarchical-clustering methods have been studied: –Cure Guha, et al.: Bottom-Up, good for clusters having arbitrary shapes or outliers –Chameleon Karypis et al.: Bottom-Up, relies heavily on graph partitioning –Principal Direction Divisive Partitioning Boley: Top-Down, good for document collections –Hierarchical Divisive Bisecting k-means Steinbach: Top-Down

8
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 8 Introduction Address the parallel complexity of hierarchical clustering. Describe known sequential algorithms for top-down and bottom-up hierarchical clustering. Parallelize top-down, when n points are to be clustered, provide an O(log n)-time, n 2 -processor CREW-PRAM algorithm that computes the same output as the corresponding sequential algorithm.

9
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 9 Introduction Define a natural decision problem based on bottom-up hierarchical clustering and add this Hierarchical Clustering Problem (HCP) to the list of CC-complete problems, adding a data mining problem for the first time. Show that HCP is one of the computationally most-difficult problems in the Comparator Circuit Value Problem (CCVP) class.

10
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 10 Introduction Demonstrate that the HCP is very unlikely to have an NC algorithm. In sharp contrast, give an NC algorithm for the top-down sequential approach. Parallel complexities of top-down and bottom-up are different, unless CC equals NC.

11
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 11 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

12
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 12 Preliminaries Interested in relating the complexity of hierarchical clustering to that of a problem involving Boolean circuits containing comparator gates. Comparator gates have two output wires, the first outputting the minimum and the second outputting the maximum of its two inputs. Each output has a maximum fanout of one.

13
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 13 Preliminaries Based on the comparator gate Basis for an entire complexity class Comparator Circuit Value Problem (CCVP) Given: An encoding of a Boolean circuit composed of comparator gates, inputs x 1,…,x n, and a designated output y. Problem: Is output y of TRUE on input x 1,…,x n ?

14
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 14 Preliminaries Let P denote the class of all languages decidable in polynomial time. Let NC denote the class of all languages decidable in poly-logarithmic time using a polynomial number of processors on a PRAM. Let RNC denote the randomized version of NC. Let NLOG denote the class non-deterministic logarithmic space. Let CC denote the class of problems that are NC many-one reducible to CCVP.

15
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 15 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

16
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 16 Algorithms for Hierarchical Clustering Sequential Algorithms – Bottom-Up Input: set of points, distance function, bound B, and desired number of clusters, k Output: set of clusters Pair up all points starting with the two closest ones, then the next remaining two closest ones, and so on, until all are paired. Next, the sets of points X and Y minimizing d min (X,Y) over all remaining sets are merged, until only k sets remain.

17
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 17 Algorithms for Hierarchical Clustering Sequential Algorithms – Bottom-Up (cont.) Assumed that the number of input points is even. There are no restrictions placed on the distance function. In the first phase of the algorithm points are clustered whose distance is less than or equal to B. Operates in polynomial time.

18
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 18 Algorithms for Hierarchical Clustering Sequential Algorithms – Top-Down Function v(G) takes a graph as its argument and returns a set that consists of the vertices from G. Input: set of points, a distance function, and the desired number of clusters k Output: set of clusters All points start in the same cluster.

19
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 19 Algorithms for Hierarchical Clustering Compute a minimum-cost spanning tree. Form clusters by repeatedly removing the highest-cost edge from what remains of a minimum-cost spanning tree of the graph corresponding to the initial set of points with respect to the distance function, until exactly k sets have been formed. Runs in polynomial time.

20
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 20 Algorithms for Hierarchical Clustering Top-Down and Bottom-Up have different parallel complexities, unless CC equals NC. Prove that the exact same clusters as produced by the Sequential (Top-Down) Hierarchical Clustering Algorithm can be computed in NC. A natural decision problem based on the Sequential (Bottom-Up) Hierarchical Clustering Algorithm is CC-complete.

21
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 21 Algorithms for Hierarchical Clustering Since a CC-complete problem is very unlikely to have an NC algorithm and a problem with an NC algorithm is very unlikely to be CC-complete, the parallel complexities of these two sequential algorithms are different. For a fast parallel algorithm for hierarchical clustering, the algorithm should be based on the Top-Down approach.

22
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 22 Algorithms for Hierarchical Clustering Theorem: Let n denote the number of points to be clustered. The Parallel (Top-Down) Hierarchical Clustering Algorithm can be implemented in O(log n) time using n 2 processors on the CREW PRAM. This algorithm is an NC algorithm, which means that the clusters can be computed very fast in parallel. Any reasonable decision problem based on this algorithm will be in NC.

23
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 23 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

24
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 24 Complexity of Hierarchical Clustering Hierarchical Clustering Problem (HCP) Given: A set S of n points in R d, a distance function d S : S x S N, the number of clusters k n/2 N, a distance bound B, and two points x, y S. Problem: Are x and y with d S (x, y) B in the same cluster C after the first-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm?

25
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 25 Complexity of Hierarchical Clustering No restrictions placed on the properties the distance function must satisfy, the distances themselves must be natural numbers. This version of the problem easily reduces to the version where the weights come from R +. Not concerned with the distance between a point and itself, the k is the number of clusters to be formed. x and y are required to be no further apart than the distance bound B.

26
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 26 Complexity of Hierarchical Clustering Lexicographically First Maximal Matching Problem (LFMMP) Given: An undirected graph G = (V, E) with an ordering on its edges plus a distinguished edge e E. Problem: Is e in the lexicographically first maximal matching of G? A matching is maximal if it cannot be extended.

27
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 27 Complexity of Hierarchical Clustering LFMMP is CC-complete [Cook 1982, Mayr and Subramanian 1992]. Theorem: The Hierarchical Clustering Problem is NC many-one reducible to the Lexicographically First Maximal Matching Problem, that is, HCP LFMMP. HCP is in CC. m NC

28
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 28 Complexity of Hierarchical Clustering Theorem: The Lexicographically First Maximal Matching Problem is NC many-one reducible to the Hierarchical Clustering Problem, that is, LFMMP HCP. Proof Sketch: Let G = (V = {1,…,n},E), ø : E {1,…,|E|} be an ordering on E, and e = {u,v} E be an instance of the LFMMP. Construct instance of HCP, a set S of n points p 1,…,p n in R d, a distance function d S : S x S N, clusters k n/2 N, bound B, and x,y S. m NC

29
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 29 Complexity of Hierarchical Clustering Proof (cont.): Let S = {1,…,n,n+1,…,2n}. Let V = S – V. Define the distance function between each pair of points in S as follows: Let B = |E|, k = n, and take u and v as our points dS(x,y)dS(x,y)=ø({x,y})if {x,y} E =2|E|if x V and y V or vice versa =3|E|if x V, y V, and x y =4|E|if x n, y n, x y, and {x,y} E

30
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 30 Complexity of Hierarchical Clustering Theorem: The Hierarchical Clustering Problem is CC-complete. This expands the list of CC-complete problems and adds the first clustering/data mining problem to the class.

31
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 31 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

32
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 32 CC-Complete Problems Comparator Circuit Value Problem (CCVP) Given: An encoding of a Boolean circuit composed of comparator gates, inputs x 1,…,x n, and a designated output y. Problem: Is output y of TRUE on input x 1,…,x n ? References: [Cook 1982, Mayr and Subramanian 1992]

33
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 33 CC-Complete Problems Lexicographically First Maximal Matching Problem (LFMMP) Given: An undirected graph G = (V, E) with an ordering on its edges plus a distinguished edge e E. Problem: Is e in the lexicographically first maximal matching of G? References: [Cook 1982, Mayr and Subramanian 1992] Remarks: Resembles the Lexicographically First Maximal Independent Set Problem which is P- complete.

34
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 34 CC-Complete Problems Stable Marriage Problem (SMP) Given: A set of n men and a set of n women. For each person a ranking of the opposite sex according to their preference for a marriage partner. Problem: Does the given instance of the problem have a set of marriages that is stable? The set is stable if there is no unmatched pair {m, w} such that both m and w prefer each other to their current partners.

35
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 35 CC-Complete Problems Stable Marriage Problem (SMP) References: [Mayr and Subramanian 1992, Subramanian 1989] Remarks: If the preference lists are complete, there is always a solution. Several variations of the SMP are also known to be equivalent to the CCVP. The Male- Optimal Stable Marriage Problem finds a matching in which no man could do any better in a stable marriage.

36
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 36 CC-Complete Problems Stable Marriage Stable Pair Problem (SMSPP) Given: A set of n men and n women, for each person a ranking of the opposite sex according to their preference for a marriage partner, and a designated couple Alice and Bob. Problem: Are Alice and Bob a stable pair for the given instance of the problem? That is, is it the case that Alice and Bob are married to each other in some stable marriage? References: [Mayr and Subramanian 1992, Subramanian 1989]

37
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 37 CC-Complete Problems Stable Marriage Minimum Regret Problem (SMMRP) Given: A set of n men and n women, for each person a ranking of the opposite sex according to their preference for a marriage partner, and a natural number k, 1 k n. Problem: Is there a stable marriage in which every person has regret at most k? The regret of a person in a stable marriage is the position of her mate on her preference list.

38
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 38 CC-Complete Problems Stable Marriage Minimum Regret Problem (SMMRP) References: [Mayr and Subramanian 1992, Subramanian 1989] Remarks: The goal in this problem is to minimize the maximum regret of any person.

39
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 39 CC-Complete Problems Telephone Connection Problem (TCP) Given: A telephone line with a fixed channel capacity k, a natural number l, and a sequence of calls (s 1, f 1 ),…, (s n, f n ), where s i (f i ) denotes the starting (respectively, finishing) time of the i-th call. The i-th call can be serviced at time s i if the number of calls being served at that time is less than k. If the call cannot be served, it is discarded. When a call is completed, the channel is freed up.

40
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 40 CC-Complete Problems Telephone Connection Problem (TCP) Problem: Is the l-th call serviced? References: [Ramachandran and Wang 1991] Remarks: There is an O(min(,k) log n)- time EREW-PRAM algorithm that uses n processors for solving the TCP.

41
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 41 CC-Complete Problems Internal Diffusion Limited Aggregation Predication Problem (IDLAPP) Given: A time T and a list of moves (t,i,s), one for each time 0 t T indicating that at time t for particle i, if still active, will visit site s, plus a designated site d, and a designated particle p. A particle is active if it is still moving within the cluster, that is, the particle has not yet stuck to the cluster because all of the sites that it has visited so far were occupied already.

42
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 42 CC-Complete Problems Internal Diffusion Limited Aggregation Predication Problem (IDLAPP) Problem: Is site d occupied and is site p active at time T? References: [Moore and Machta 2000]

43
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 43 CC-Complete Problems Internal Diffusion Limited Aggregation Predication Square Lattice Problem Given: A time T and a list of moves (t,i,s) on a square lattice, one for each time 0 t T indicating that at time t for particle i, if still active, will visit site s, plus a designated site d, and a designated particle p. Problem: Is site d on the square latice occupied and is site p active at time T? References: [Moore and Matcha 2000]

44
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 44 CC-Complete Problems Hierarchical Clustering Problem (HCP) Given: A set S of n points in R d, a distance function d S : S x S N, the number of clusters k n/2 N, a distance bound B, and two points x, y S. Problem: Are x and y with d S (x, y) B in the same cluster C after the first-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm? Reference: [This work 2006]

45
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 45 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

46
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 46 Conclusions A natural decision problem based on bottom- up hierarchical clustering is CC-complete. Top-down hierarchical clustering is in NC. Brings the number of known CC-complete problems to ten, and shows that the HCP is unlikely to have a NC algorithm. Fast parallel algorithms for hierarchical clustering should be based on a top-down approach.

47
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 47 Open Problems Is Euclidean HCP CC-complete? (It is in CC.) Determine the complexity of the second- phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm. Add new problems to the class of CC-complete problems.

48
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 48 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

49
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 49 References [Blumenthal 1953] Theory and Applications of Distance Geometry. Oxford University Press. [Boley 1998] Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4): [Chong, Han, and Lam 2001] Concurrent threads and optimal parallel minimum spanning tree algorithms. Journal of the ACM, 48(2): [Cole 1988] Parallel Merge Sort. SIAM Journal of Computing, 17(4): [Cook 1985] A taxonomy of problems with fast parallel algorithms. Information and Control, 64(13):222. [Dash, Petrutiu, and Scheuermann 2004] Efficient parallel hierarchical clustering. Lecture Notes in Computer Science, 3149: [Feder 1992] A new fixed point approach to stable networks and stable marriages. Journal of Computer and System Sciences, 45(2): [Gibbons 1985] Algorithmic Graph Theory. Cambridge University Press. [Greenlaw 1992] A model classifying algorithms as inherently sequential with applications to graph searching. Information and Computing, 97(2):

50
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 50 References [Greenlaw, Hoover, and Ruzzo 1995] Limits to Parallel Computation: P-Completeness Theory. Oxford University Press. [Guha, Rastogi, and Shim 1998] Cure: An efficient clustering algorithm for large databases. In ACM SIGMOD, pages , Seattle, WA. Association for Computing Machinery. [Jain and Dubes 1988] Algorithms for Clustering Data. Prentice-Hall. [Karypis, Han, and Vumar 1999] CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):6875. [Kaufman and Rousseeuw 1990] Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons. [Li 1990] Parallel algorithms for hierarchical clustering and clustering validity. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(11): [Li and Fang 1989] Parallel clustering algorithms. Parallel Computing, 11(3): [Mayr and Subramanian 1992] The complexity of circuit value and network stability. Journal of Computer and System Sciences, 44(2):

51
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 51 References [Moore and Machta 2000] Internal diffusion-limited aggregation: Parallel algorithms and complexity. Journal of Statistical Physics, 99(3/4): [Olson 1995] Parallel algorithms for hierarchical clustering. Parallel Computing, 21(8): [Pólya, Tarjan, and Woods 1983] Notes on Introductory Combinatorics. Birkhäuser, Boston. [Rajasekaran 2005] Efficient parallel hierarchical clustering algorithms. IEEE Transactions on Parallel and Distributed Systems, 16(6): [Ramachandran and Wang 1991] Parallel algorithms and complexity results for telephone link simulation. In Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing, pages , Dallas, TX, December. IEEE. [Reif (ed) 1993] Synthesis of Parallel Algorithms. Morgan Kaufmann. [Sairam, Vitter, and Tamassia 1993] A complexity theoretic approach to incremental computation. In Finkel, Enjalbert, and Wagner, editors, STACS 93: 10 th Annual Symposium on Theoretical Aspects of Computer Science, volume 665 of Lecture Notes in Computer Sciences, pages , Wurzburg, Germany, Fbruary. Springer-Verlag.

52
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 52 References [Steinbach, Karypis, and Kumar 2000] A comparison of document clustering techniques. In 6 th ACM SIGKDD World Text Mining Conference, Boston, MA. Association for Computing Machinery. [Subramanian 1989] A new approach to stable matching problems. Technical Report STAN-CS Stanford University, Department of Computer Science. [Subramanian 1990] The Computational Complexity of the Circuit Value and Network Stability Problems, PhD thesis, Stanford University. Depatment of Computer Science Technical Report, STAN-CS [Tsai, Horng, Lee, Tsai, and Kao 1997] Parallel hierarchical clustering algorithms on processor arrays with a reconfigurable bus system. Pattern Recognition, 30(5): [Wu, Horng, and Tsai 2000] Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. Journal of Parallel and Distributed Computing, 60(9):

53
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 53 Outline Introduction Preliminaries Algorithms for Hierarchical Clustering Complexity of Hierarchical Clustering CC-Complete Problems Conclusions and Open Problems References Acknowledgments

54
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Greenlaw and Kantabutra 54 Acknowledgements Computer Science Department at Chiang Mai University, Thailand Fulbright Commissions of Thailand and the United States Jim Hoover and Larry Ruzzo for material from [Greenlaw, Hoover, and Ruzzo 1995]

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google