Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.

Similar presentations


Presentation on theme: "Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed."— Presentation transcript:

1 Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed

2 2 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Motivation Future science will be increasingly driven by huge data Its analysis needs effective parallel tools for handling large scale data, like MapReduce We are trying to help domain scientists by solving their interested problems better Astronomy is our first step

3 3 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Sky surveys: Sloan Digital Sky Survey (2000–2008): 230 million objects, 50 TByte Pan-STARRS (just started): Order of magnitude larger Large Synoptic Survey Telescope (2016): Two orders of magnitude larger Astronomy Dataset Simulations: McWilliams Center at CMU: Black holes and dark matter, 15B particles, 14 TByte / run LANL Coyote universe: 1B particles, 1 TByte / run, 30 run s

4 4 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Friends of Friends (FoF) technique: Two galaxies are “friends” if they are close to each other We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges We need to identify the connected components From astronomers: The number of connected components and their sizes reflect properties of the universe

5 5 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Friends of Friends (FoF) technique: Sequential algorithms Exact: O((n ∙ log n) 1.5 ) Approximate: O(n) When n is VERY large, parallel processing is required Input (id, x, y, z) for each object Output (id, group-id) for each object

6 6 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Distributed Friends of Friends (dFoF) Traditionally (HPC community) Divide the space into cubes But for this specific application, it needs more work (e.g. communication) at later merge step

7 7 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Distributed Friends of Friends (dFoF) Instead, divide the space into “slightly overlapping” cubes Identify cross-cube edges and merge the respective “local groups” -Randomly select a subset of objects -Apply the kd-tree construction -Send each object to corresponding cubes -Allocate different processors to cubes -Apply the Union-Find algorithm to the galaxies in the cube overlaps Distributed computation: Apply a sequential FoF algorithm to find the “local groups” within each cube

8 8 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Distributed Friends of Friends (dFoF) Pluggable Sequential FOF can be easily replaced by other similar group finding algorithm Avoid explicit cross-processor communication Scalability Applicable to Other Problem

9 9 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Implementation Properties Avoid explicit communication Optional out-of-core processing if resource is insufficient Fits Hadoop’s structure naturally. Using 3 Map phrases and 2 Reduce phrases.

10 10 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Disc Cloud Cluster 64 nodes 8 cores per node, 2.83GHz 16 GByte memory per node 10 GBit / second network

11 11 Strong Scalability Experiments 0.5 bln *0.9 bln 1 2 4 8 16 32 4 15 60 240 Number of nodes 1 bln Bin Fu © November 2009 http://www.pdl.cmu.edu/ Time (min) Input data constant Change the number of nodes Log-log scale Ideally: straight line * University of Washington 2 15 bln

12 12 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Weak Scalability Experiments 1 24 8 16 32 0 4 8 Time (min) Number of nodes 32 mln / Node 2 6 64 mln / Node Proportionally change the input size and # nodes Ideally: flat line Constant workload for each node 10

13 13 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Conclusion Good scalability from a series of experiments. A Distributed astronomic group-finding algorithm Hadoop implementation

14 14 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Future Work General-purpose astronomy toolkit: Distributed computation for other standard astronomy problems:  Correlation functions, spatial matching, density distribution, spectral analysis,...  Massive spatial indices of celestial objects integrated with distributed algorithms

15 15 Bin Fu © November 2009 http://www.pdl.cmu.edu/ Thanks! http://www.cs.cmu.edu/~binf/dFOF


Download ppt "Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed."

Similar presentations


Ads by Google