Large Scale Data Processing Techniques for Astronomical Applications

Large Scale Data Processing Techniques for Astronomical Applications
17/9/2018 Large Scale Data Processing Techniques for Astronomical Applications Shanjiang Tang School of Computer Science & Technology, Tianjin University Nov 30th, 2017

Astronomical Data is BIG
17/9/2018 Astronomical Data is BIG Lots of data is being collected and warehoused. Astronomical Images Cosmological simulation Data volume: e.g., 100 TB, 10 PB. Fast射电望远镜 Lamost望远镜美国新墨西哥的甚大阵（Very Large Array） 4. Cosmological simulation (N-body cosmological simulations ) Many astronomical applications do not scale well to such large amount of data.

Problems of Existing Applications
17/9/2018 Problems of Existing Applications Many astronomical applications are NOT suitable for big data. Not SCALE well to large data volume. (e.g., Memory-bounded application: Early Black Holes in Cosmological Simulations) Not FAST enough for large data volume. (e.g., astronomical correlation function: Baryon Acoustic Oscillations) 1. Memory-bounded application (not scale well)--Early Black Holes in Cosmological Simulations: Luminosity Functions and Clustering Behaviourar 2. astronomical correlation function problem --- slow In this chapter, we turn to the following astrophysics problem: building a histogram of pairwise distances between celestial objects, the astronomical correlation function problem [Peebles, 1980]. The histogram essentially measures the distribution of astronomical objects at given distances, and it has been widely used by astrophysicists. For instance, one application of correlation functions is to help cosmologists better understand the universe. Now cosmologists have come to the consensus that the universe is not only expanding, but expanding at an accelerated rate. How to handle such large-scale data applications?

Popular Data Processing Frameworks
Lots of frameworks are emerging for different applications. Batch applications Streaming applications Interactive applications Graph applications Deep learning applications

Hadoop Overview Open-source implementation of MapReduce
YARN: the second generation/version of Hadoop Scale up to 6,000-10,000 machines Support for multi-tenancy 5 Borrowed from

MapReduce is a Promising Choice for Big Data Processing
MapReduce, proposed by Google 2004, is inspired by the map and reduce combinators of Lisp. Map: (key1, val1) → (key2, val2). The map function takes as input <key,value> pairs and produces a set of zero or more intermediate <key,value> pairs. The framework group together all the intermediate values associated to the same intermediate key and passes them to the reducer. Reduce: (key2, [val2]) → [val3]. The reduce function aggregate the values of a key by using a binary operation, such as the sum.

Hadoop-based Astronomy applications
17/9/2018 Hadoop-based Astronomy applications Bin Fu, et al. DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics, In HPDC’10. Bin Fu, et al. Exact and Approximate Computation of a Histogram of Pairwise Distances between Astronomical Objects, AstroHPC 2012. H. Willey, et al. Astronomy in the Cloud: Using MapReduce for Image Coaddition. In Arxiv, 2012. C.C Mi, et al. An Efficient Cross-Match Implementation Based on Directed Join Algorithm in MapReduce, in UCC’11. K.S Lee, et al. An Efficient Astronomical Cross-matching model Based on MapReduce Mechanism, In ASE BD&SI‘15 . DiscFinder [Fu et al., 2010] is a scalable, distributed, and data-intensive group finder for analyzing observation and simulation astrophysics datasets. Group finding is a form of clustering used in astrophysics for identifying large-scale structures such as clusters and superclusters of galaxies. DiscFinder runs on commodity compute clusters and scales to large datasets with billions of particles. It is designed to operate on datasets that are much larger than the aggregate memory available in the computers where it executes. As a proof-of-concept, we have implemented DiscFinder as an application on top of the Hadoop framework. DiscFinder has been used to cluster the largest open-science cosmology simulation datasets containing as many as 14.7 billion particles.

DiscFinder: A Distributed Version of Friends-of-Friends Technique
Friends-of-Friends Algorithm for Galaxies clustering Two galaxies are “Friends” if they are close to each other Vertices denotes galaxies, and their friendships are edges. Time complexity is O((n · log n)1.5) for the exact computation, and O(n) for an approximate algorithm. Galaxy clusters and space partitioning

MapReduce-based Approach
MapReduce “wrapper” distributes the friends-of-friends computing across nodes. Divide the space into cubes Apply a sequential friends-of- friends procedure with each cube Identify cross-cube “friendships” and merge the respective clusters.

MapReduce-based Approach(cont’)
The partitioning and clustering stages for DiscFinder are based on MapReduce

Spark Overview An fast and general large-scale data processing system
17/9/2018 Spark Overview An fast and general large-scale data processing system Much faster than Hadoop due to in-memory computing Fault tolerance support. Support graph, streaming, interactive and machine learning processing. Spark ecosystem Logistic regression in Hadoop and Spark

Spark-based Astronomy applications
17/9/2018 Spark-based Astronomy applications Zhao Zhang et al. Scientific Computing Meets Big Data Technology: An Astronomy Use Case. In IEEE big data, 2015. Zhao Zhang et al. Kira: Processing Astronomy Imagery Using Big Data Technology. In IEEE TBD 2016. Panos LABROPOULOS et al. Distributed Data Processing Using Spark in Radio Astronomy, in TERATEC 2016. Mariem Brahem et al. AstroSpark - Towards a Distributed Data Server for Big Data in Astronomy, SIGSPATIAL Workshop, 2016.

Kira: An Astronomy Image Processing Toolkit built on Spark
A Typical Supernovae Detection Pipeline Images Source Extraction Point Spread Function Estimation Image Reprojection Image Coaddition Object Classification Catalogs Source Extraction Source Extraction

Source Extractor Steps
Background Estimation Background Subtraction Object Detection through Convolution Object Statistics Evaluation

Kira Source Extractor Architecture

Experimental Results 1TB Dataset Performance between Kira SE VS C

Conclusion The huge volume and rapid growth of dataset in scientific computing such as Astronomy demand for a fast and scalable data processing system. Leveraging a big data platform such as Hadoop/Spark would enable scientists to benefit from the rapid pace of innovation and large range of systems that are being driven by widespread interest in big data analytics. Finally, in the era of big data, astronomical informatics is a must and our team has done many big data processing work on that.

Thanks! Question?

Large Scale Data Processing Techniques for Astronomical Applications

Similar presentations

Presentation on theme: "Large Scale Data Processing Techniques for Astronomical Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large Scale Data Processing Techniques for Astronomical Applications

Similar presentations

Presentation on theme: "Large Scale Data Processing Techniques for Astronomical Applications"— Presentation transcript:

Similar presentations

About project

Feedback