Download presentation
Presentation is loading. Please wait.
1
Spark & MongoDb for LSST
Christian Arnault (LAL) Réza Ansari (LAL) Fabrice Jammes (LPC Clermont) Osman Aidel (CCIN2P3) César Richard (U-PSud) June, LSST Workshop - CCIN2P3
2
Topics… Spark MongoDb Spark (again)
How to consider parallelism & distribution in the processing workflows How to cope with Intermediate data Manage steps in the workflow Production the final data (catalogues) How to distribute data (data formats) Avro/Parquet (converting FITS format) MongoDb To understand whether Mongo might offer similiar features as QServ Spark (again) Same question but using the Spark-Dataframe technology Combined with the GeoSpark module for 2D indexing June, LSST Workshop - CCIN2P3
3
Spark: the simplified process
Simulation Images Observation Calibration Object Detection Sky background Reference Catalogues Objets {x, y, flux} Photométry, PhotoZ Astrométry Measured Objets {RA, DEC, flux, magnitude, Z} Catalogues June, LSST Workshop - CCIN2P3
4
Typical numbers Camera CCD Pixels 3,2 Gpixels
15 To per night (x 10 years) Image Diameter: 3.5° / 64cm -> 9,6 °² (Moon = 0,5°) ~ x 6 CCD images 189 CCDs / 6 filters CCD 16 Mpixels (= 1 FITS file) 16 cm² 3 Go/s 0,05 °² = 3 ‘ 2.9 ’’ Pixels 10 µm , 0,2 arc-secs 2 bytes June, LSST Workshop - CCIN2P3
5
Algorithms Simulation: Detection: Identification:
Apply a gaussian pattern with common width (i.e. we only consider atmosphere and optical aberrations) + some noise Detection: Convolution with a gaussian pattern for PSF Handle an overlap margin for objects close to the image border Identification: Search for geo-2D coordinates from reference catalogues Handling large number of datafiles Based on multiple indexing keys(run, filter, ra, dec, …) aka ‘data butler’ Studying the transfer mechanisms Throughput, serialization June, LSST Workshop - CCIN2P3
6
June, LSST Workshop - CCIN2P3
7
Images creation Declare a schema For: Serialization of images
For data partitioning & indexing def make_schema(): schema = StructType([ StructField("id", IntegerType(), True), StructField("run", IntegerType(), True), StructField("ra", DoubleType(), True), StructField("dec", DoubleType(), True), StructField("image", ArrayType(ArrayType(DoubleType()), True))]) return schema def create_image(spark): runs = ... rows = 3; cols = 3; region_size = 4000 images = []; image_id = 0 # initialize image descriptors for run in range(runs) for r in range(rows): for c in range(cols): ra = ...; dec = images.append((image_id, run, ra, dec)) image_id += 1 rdd = sc.parallelize(images).map(lambda x: fill_image(x)) df = spark.createDataFrame(rdd, make_schema()) df.write.format("com.databricks.spark.avro") \ .mode("overwrite") \ .save("./images") def fill_image(image): filled = ... return filled Spark June, LSST Workshop - CCIN2P3
8
Working on images using RDD Structured data
Selection via map, filter operations The User Defined Functions (UDF) may be written in any language Eg: In C++ and interfaced using PyBind def analyze(x): return 'analyze image', x[0] def read_images(spark): df = spark.read.format("com.databricks.spark.avro").load("./images") rdd = (df.rdd .filter(lambda x: x[1] == 3) map(lambda x: analyze(x))) result = rdd.collect() print(result) Select a data subset June, LSST Workshop - CCIN2P3
9
Working on images Using DataFrame Appears like row-colums
Image Indexing by run/patch/ra/dec/filter… def analyze(x): return 'analyze image', x[0] def read_images(spark): analyze = functions.udf(lambda m: analyze(m), <type>) df = (spark.read.load("./images") .filter(df.run == 3) .select(df.run, analyze(df.image).alias('image'))) df.show() June, LSST Workshop - CCIN2P3
10
June, LSST Workshop - CCIN2P3
11
Using MongoDB for ref. catalog
Object ingestion client = pymongo.MongoClient(MONGO_URL) lsst = client.lsst stars = lsst.stars for o_id in objects: o = objects[o_id] object = o.to_db() object['center'] = {'type': 'Point', 'coordinates': [o.ra, o.dec]} id = stars.insert_one(object) stars.create_index([('center', '2dsphere')]) Conversion to BSON Add 2D indexing Object finding center = [[cluster.ra(), cluster.dec()] for o in stars.find({'center': {'$geoWithin': {'$centerSphere': center, radius]}}}, {'_id': 0, 'where': 1, 'center': 1}): print('identified object') June, LSST Workshop - CCIN2P3
12
The Spark LAL Operated in the context of VirtualData and the mutualisation project ERM/MRM (Université Paris-Sud) This project groups several research teams in U-PSud (genomics, bio-informatics, LSST) both studying the Spark technology. We had a Spark school (in march 2017) (with the help of an expert from Databricks) June, LSST Workshop - CCIN2P3
13
U-Psud: OpenStack, CentOS7
Master 18c 32Go 4 To LSST Worker 2 To HDFS Mongo 108 cores 192 RAM 12 To Hadoop 2.6.5 Spark 2.1.0 Java 1.8 Python 3.5 Mongo 3.4 June, LSST Workshop - CCIN2P3
14
MongoDb Several functional characteristics of the QServ system seem to be obtained using the MongoDb tool, Among which we may quote: Ability to distribute both the database and the server through the intrinsic Sharding mechanism. Indexing against 2D coordinates of the objects Indexing against a sky splitting in chunks (so as to drive the sharding) Thus, the study purpose is to evaluate if: the MongoDb database offers natively comparable or equivalent functionality the performances are comparable. June, LSST Workshop - CCIN2P3
15
MongoDb in the Galactica cluster
One single server Name: MongoServer_1 Gabarit: C1.large RAM: 4Go VCPUs: 8 VCPU Disk: 40Go The tests are operated upon a dataset of 1.9 To: Object ( documents) Source ( documents) ForcedSource ( documents) ObjectFullOverlap ( documents) These catalogues are prepared to concern sky regions (identified by a chunkId). Therefore, 324 sky regions are available for any of the 4 catalog types. June, LSST Workshop - CCIN2P3
16
Operations Ingestion: Testing simple queries
Translating the SQL schema into a MongoDb schema (i.e. selecting the data types) Ingesting the CSV lines Automatic creation of the indexes from the SQL keys described in the SQL schema. Testing simple queries But … measures done with indexes on quantities… We don’t want to index any of 300 parameters Better structure space parameters and index over groups select count(*) from Object seconds select count(*) from ForcedSource seconds SELECT ra, decl FROM Object WHERE deepSourceId = ; seconds SELECT ra, decl FROM Object WHERE qserv_areaspec_box(…); seconds select count(*) from Object where y_instFlux > 5; seconds select min(ra), max(ra), min(decl), max(decl) from Object; seconds select count(*) from Source where flux_sinc between 1 and 2; seconds select count(*) from Source where flux_sinc between 2 and 3; seconds LSST Workshop - CCIN2P3
17
Joins, Aggregations Mongo operate complex queries using an aggregation of map-reduce operations (based on iterators) Example: finding all neighbours with distance < Dmax within a region select a sky region around a reference point build a self-join so as to obtain a list of object couples compute the distance between objects in every couple select all computed distances lower than a maximum value. June, LSST Workshop - CCIN2P3
18
Aggregation result = lsst.Object.aggregate( [ {'$geoNear': {
'near': [ra0, dec0], 'query': { 'loc': { '$geoWithin': {'$box': [bottomleft, topright] } } }, 'distanceField': 'dist', } }, {'$lookup': {'from':'Object', 'localField':'Object.loc', 'foreignField':'Object.loc', 'as':‘neighbours'} }, {'$unwind': '$neighbours'}, {'$redact': { '$cond': [{ '$eq': ["$_id", "$ neighbours._id"] }, "$$PRUNE", "$$KEEP" ] } }, {'$addFields': {'dist': dist} }, {'$match': {'dist': { '$lt': 1 } }, {'$project': {'_id': 0, 'loc':1, ' neighbours.loc':1, 'dist': 1}}, ] ) Select objects in a region Construct all pairs within the region Flatten the list Remove the duplication Compute the distance between pairs Filter Final projection June, LSST Workshop - CCIN2P3
19
Spark/Dataframes Context Same dataset, same objective
VirtualData LAL Ingest the dataset using the CSV connector to Dataframes Operate SQL-like API to query Use the GeoSpark for 2D navigation, filtering, indexing Objects: Point, Rectangle, Polygon, LineString Spatial index: R-Tree and Quad-Tree Geometrical operations: Minimum Bounding Rectangle, PolygonUnion, and Overlap/Inside(Self-Join) Spatial query operations: Spatial range query, spatial join query and spatial KNN query Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceeding of IEEE International Conference on Data Engieering IEEE ICDE 2016, Helsinki, Finland May 2016 June, LSST Workshop - CCIN2P3
20
CSV Ingestion to Spark Get the SQL Schema & produce the Spark representation of this schema catalog.read_schema() set_schema_structures() spark = SparkSession.builder.appName("StoreCatalog").getOrCreate() sc = spark.sparkContext sqlContext = SQLContext(sc) cat = subprocess.Popen(["hadoop", "fs", "-ls", "/user/christian.arnault/swift"], stdout=subprocess.PIPE) for line in cat.stdout: file_name = line.split('/')[-1].strip() schema = read_data(file_name) df = sqlContext.read.format('com.databricks.spark.csv') \ options(header='true', delimiter=';') \ load('swift/' + file_name, schema = schema.structure) df.write.format("com.databricks.spark.avro")\ .mode(write_mode).partitionBy('chunkId').save("./lsstdb") Get CSV files from HDFS Get the Spark Schema Read the CSV file Append the data into the dataframe June, LSST Workshop - CCIN2P3
21
Read the dataframe and query
val conf = new SparkConf().setAppName("DF") val sc = new SparkContext(conf) val spark = SparkSession .builder() .appName("Read Dataset") .getOrCreate() val sqlContext = new SQLContext(sc) var df = time("Load db", sqlContext. read. format("com.databricks.spark.avro"). load("./lsstdb")) val df = time("sort", df.select("ra", "decl", "chunkId").sort("ra")) val seq = time("collect", df.rdd.take(10)) println(seq) Read the dataframe from HDFS using the Avro serializer Scala Perform queries June, LSST Workshop - CCIN2P3
22
Conclusion Spark is a rich and promising eco-system
But it needs configuration understanding: Memory (RAM) Partitioning data (throughput) Building the pipeline (as a DAG of process) Understanding the monitoring tools (eg. Ganglia) MongoDb: Powerful, but based on a very different paradigm as SQL (map-reduce based) I observed strange performance results that need to be understood Spark for catalogues Migrating to Spark/Dataframe seems to be really encouraging and should not show the same limitations… Primary results are at least better than Mongo (especially at ingestion step) GeoSpark powerful and meant to support very large datasets June, LSST Workshop - CCIN2P3
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.