Spark & MongoDb for LSST

Spark & MongoDb for LSST
Christian Arnault (LAL) Réza Ansari (LAL) Fabrice Jammes (LPC Clermont) Osman Aidel (CCIN2P3) César Richard (U-PSud) June, LSST Workshop - CCIN2P3

Topics… Spark MongoDb Spark (again)
How to consider parallelism & distribution in the processing workflows How to cope with Intermediate data Manage steps in the workflow Production the final data (catalogues) How to distribute data (data formats) Avro/Parquet (converting FITS format) MongoDb To understand whether Mongo might offer similiar features as QServ Spark (again) Same question but using the Spark-Dataframe technology Combined with the GeoSpark module for 2D indexing June, LSST Workshop - CCIN2P3

Spark: the simplified process
Simulation Images Observation Calibration Object Detection Sky background Reference Catalogues Objets {x, y, flux} Photométry, PhotoZ Astrométry Measured Objets {RA, DEC, flux, magnitude, Z} Catalogues June, LSST Workshop - CCIN2P3

Typical numbers Camera CCD Pixels 3,2 Gpixels
15 To per night (x 10 years) Image Diameter: 3.5° / 64cm -> 9,6 °² (Moon = 0,5°) ~ x 6 CCD images 189 CCDs / 6 filters CCD 16 Mpixels (= 1 FITS file) 16 cm² 3 Go/s 0,05 °² = 3 ‘ 2.9 ’’ Pixels 10 µm , 0,2 arc-secs 2 bytes June, LSST Workshop - CCIN2P3

Algorithms Simulation: Detection: Identification:
Apply a gaussian pattern with common width (i.e. we only consider atmosphere and optical aberrations) + some noise Detection: Convolution with a gaussian pattern for PSF Handle an overlap margin for objects close to the image border Identification: Search for geo-2D coordinates from reference catalogues Handling large number of datafiles Based on multiple indexing keys(run, filter, ra, dec, …) aka ‘data butler’ Studying the transfer mechanisms Throughput, serialization June, LSST Workshop - CCIN2P3

June, LSST Workshop - CCIN2P3

Images creation Declare a schema For: Serialization of images
For data partitioning & indexing def make_schema(): schema = StructType([ StructField("id", IntegerType(), True), StructField("run", IntegerType(), True), StructField("ra", DoubleType(), True), StructField("dec", DoubleType(), True), StructField("image", ArrayType(ArrayType(DoubleType()), True))]) return schema def create_image(spark): runs = ... rows = 3; cols = 3; region_size = 4000 images = []; image_id = 0 # initialize image descriptors for run in range(runs) for r in range(rows): for c in range(cols): ra = ...; dec = images.append((image_id, run, ra, dec)) image_id += 1 rdd = sc.parallelize(images).map(lambda x: fill_image(x)) df = spark.createDataFrame(rdd, make_schema()) df.write.format("com.databricks.spark.avro") \ .mode("overwrite") \ .save("./images") def fill_image(image): filled = ... return filled Spark June, LSST Workshop - CCIN2P3

Working on images using RDD Structured data
Selection via map, filter operations The User Defined Functions (UDF) may be written in any language Eg: In C++ and interfaced using PyBind def analyze(x): return 'analyze image', x[0] def read_images(spark): df = spark.read.format("com.databricks.spark.avro").load("./images") rdd = (df.rdd .filter(lambda x: x[1] == 3) map(lambda x: analyze(x))) result = rdd.collect() print(result) Select a data subset June, LSST Workshop - CCIN2P3

Working on images Using DataFrame Appears like row-colums
Image Indexing by run/patch/ra/dec/filter… def analyze(x): return 'analyze image', x[0] def read_images(spark): analyze = functions.udf(lambda m: analyze(m), <type>) df = (spark.read.load("./images") .filter(df.run == 3) .select(df.run, analyze(df.image).alias('image'))) df.show() June, LSST Workshop - CCIN2P3

June, LSST Workshop - CCIN2P3

Using MongoDB for ref. catalog
Object ingestion client = pymongo.MongoClient(MONGO_URL) lsst = client.lsst stars = lsst.stars for o_id in objects: o = objects[o_id] object = o.to_db() object['center'] = {'type': 'Point', 'coordinates': [o.ra, o.dec]} id = stars.insert_one(object) stars.create_index([('center', '2dsphere')]) Conversion to BSON Add 2D indexing Object finding center = [[cluster.ra(), cluster.dec()] for o in stars.find({'center': {'$geoWithin': {'$centerSphere': center, radius]}}}, {'_id': 0, 'where': 1, 'center': 1}): print('identified object') June, LSST Workshop - CCIN2P3

The Spark LAL Operated in the context of VirtualData and the mutualisation project ERM/MRM (Université Paris-Sud) This project groups several research teams in U-PSud (genomics, bio-informatics, LSST) both studying the Spark technology. We had a Spark school (in march 2017) (with the help of an expert from Databricks) June, LSST Workshop - CCIN2P3

U-Psud: OpenStack, CentOS7
Master 18c 32Go 4 To LSST Worker 2 To HDFS Mongo 108 cores 192 RAM 12 To Hadoop 2.6.5 Spark 2.1.0 Java 1.8 Python 3.5 Mongo 3.4 June, LSST Workshop - CCIN2P3

MongoDb Several functional characteristics of the QServ system seem to be obtained using the MongoDb tool, Among which we may quote: Ability to distribute both the database and the server through the intrinsic Sharding mechanism. Indexing against 2D coordinates of the objects Indexing against a sky splitting in chunks (so as to drive the sharding) Thus, the study purpose is to evaluate if: the MongoDb database offers natively comparable or equivalent functionality the performances are comparable. June, LSST Workshop - CCIN2P3

MongoDb in the Galactica cluster
One single server Name: MongoServer_1 Gabarit: C1.large RAM: 4Go VCPUs: 8 VCPU Disk: 40Go The tests are operated upon a dataset of 1.9 To: Object ( documents) Source ( documents) ForcedSource ( documents) ObjectFullOverlap ( documents) These catalogues are prepared to concern sky regions (identified by a chunkId). Therefore, 324 sky regions are available for any of the 4 catalog types. June, LSST Workshop - CCIN2P3

Operations Ingestion: Testing simple queries
Translating the SQL schema into a MongoDb schema (i.e. selecting the data types) Ingesting the CSV lines Automatic creation of the indexes from the SQL keys described in the SQL schema. Testing simple queries But … measures done with indexes on quantities… We don’t want to index any of 300 parameters Better structure space parameters and index over groups select count(*) from Object seconds select count(*) from ForcedSource seconds SELECT ra, decl FROM Object WHERE deepSourceId = ; seconds SELECT ra, decl FROM Object WHERE qserv_areaspec_box(…); seconds select count(*) from Object where y_instFlux > 5; seconds select min(ra), max(ra), min(decl), max(decl) from Object; seconds select count(*) from Source where flux_sinc between 1 and 2; seconds select count(*) from Source where flux_sinc between 2 and 3; seconds LSST Workshop - CCIN2P3

Joins, Aggregations Mongo operate complex queries using an aggregation of map-reduce operations (based on iterators) Example: finding all neighbours with distance < Dmax within a region select a sky region around a reference point build a self-join so as to obtain a list of object couples compute the distance between objects in every couple select all computed distances lower than a maximum value. June, LSST Workshop - CCIN2P3

Aggregation result = lsst.Object.aggregate( [ {'$geoNear': {
'near': [ra0, dec0], 'query': { 'loc': { '$geoWithin': {'$box': [bottomleft, topright] } } }, 'distanceField': 'dist', } }, {'$lookup': {'from':'Object', 'localField':'Object.loc', 'foreignField':'Object.loc', 'as':‘neighbours'} }, {'$unwind': '$neighbours'}, {'$redact': { '$cond': [{ '$eq': ["$_id", "$ neighbours._id"] }, "$$PRUNE", "$$KEEP" ] } }, {'$addFields': {'dist': dist} }, {'$match': {'dist': { '$lt': 1 } }, {'$project': {'_id': 0, 'loc':1, ' neighbours.loc':1, 'dist': 1}}, ] ) Select objects in a region Construct all pairs within the region Flatten the list Remove the duplication Compute the distance between pairs Filter Final projection June, LSST Workshop - CCIN2P3

Spark/Dataframes Context Same dataset, same objective
VirtualData LAL Ingest the dataset using the CSV connector to Dataframes Operate SQL-like API to query Use the GeoSpark for 2D navigation, filtering, indexing Objects: Point, Rectangle, Polygon, LineString Spatial index: R-Tree and Quad-Tree Geometrical operations: Minimum Bounding Rectangle, PolygonUnion, and Overlap/Inside(Self-Join) Spatial query operations: Spatial range query, spatial join query and spatial KNN query Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceeding of IEEE International Conference on Data Engieering IEEE ICDE 2016, Helsinki, Finland May 2016 June, LSST Workshop - CCIN2P3

CSV Ingestion to Spark Get the SQL Schema & produce the Spark representation of this schema catalog.read_schema() set_schema_structures() spark = SparkSession.builder.appName("StoreCatalog").getOrCreate() sc = spark.sparkContext sqlContext = SQLContext(sc) cat = subprocess.Popen(["hadoop", "fs", "-ls", "/user/christian.arnault/swift"], stdout=subprocess.PIPE) for line in cat.stdout: file_name = line.split('/')[-1].strip() schema = read_data(file_name) df = sqlContext.read.format('com.databricks.spark.csv') \ options(header='true', delimiter=';') \ load('swift/' + file_name, schema = schema.structure) df.write.format("com.databricks.spark.avro")\ .mode(write_mode).partitionBy('chunkId').save("./lsstdb") Get CSV files from HDFS Get the Spark Schema Read the CSV file Append the data into the dataframe June, LSST Workshop - CCIN2P3

Read the dataframe and query
val conf = new SparkConf().setAppName("DF") val sc = new SparkContext(conf) val spark = SparkSession .builder() .appName("Read Dataset") .getOrCreate() val sqlContext = new SQLContext(sc) var df = time("Load db", sqlContext. read. format("com.databricks.spark.avro"). load("./lsstdb")) val df = time("sort", df.select("ra", "decl", "chunkId").sort("ra")) val seq = time("collect", df.rdd.take(10)) println(seq) Read the dataframe from HDFS using the Avro serializer Scala Perform queries June, LSST Workshop - CCIN2P3

Conclusion Spark is a rich and promising eco-system
But it needs configuration understanding: Memory (RAM) Partitioning data (throughput) Building the pipeline (as a DAG of process) Understanding the monitoring tools (eg. Ganglia) MongoDb: Powerful, but based on a very different paradigm as SQL (map-reduce based) I observed strange performance results that need to be understood Spark for catalogues Migrating to Spark/Dataframe seems to be really encouraging and should not show the same limitations… Primary results are at least better than Mongo (especially at ingestion step) GeoSpark powerful and meant to support very large datasets June, LSST Workshop - CCIN2P3

Spark & MongoDb for LSST

Similar presentations

Presentation on theme: "Spark & MongoDb for LSST"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark & MongoDb for LSST

Similar presentations

Presentation on theme: "Spark & MongoDb for LSST"— Presentation transcript:

Similar presentations

About project

Feedback