Alisdair Davey adavey@nso.edu NSO DKIST Data Center Alisdair Davey adavey@nso.edu NSO Steven Berukoff, Tony Hays, Kevin Reardon, DJ Spiess, Fraser Watson and Scott Wiant High-Resolution Solar Physics: Past, Present, Future
Data Center Roles The First Light Data Center Data Curation Data Calibration and Processing
Data Center Roles Data Curation Long-term storage management. The management of data throughout its lifecycle, from creation and initial storage to the time when it is archived for posterity or becomes obsolete and is deleted. The main purpose of data curation is to ensure that data is reliably retrievable for future research purposes or reuse. Long-term storage management. Streamlined, automated data and metadata ingest. Effective query, data retrieval. Flexible system accommodating science / instrument / technological changes without the need for frequent major redesign. Planned 44-year lifetime (2 solar cycles)
Solar Physics Mission Data Sizes The State of Solar Data storage and distribution Courtesy K. Reardon
Solar Physics Mission Data Sizes SDO – 800lb (362kg) Data Gorilla in Room
Solar Physics Mission Data Sizes SDO Other Missions DKIST
Data Transfer – Maui -> Boulder What about 60 TB/day? Challenge: Move data from Telescope to Boulder Mean 9 TB/day over shared 10 Gbps network 8 hours Extensive Testing to identify/fix bottlenecks Partner engagement: Upgrade to 40 Gbps by 2019 Use mature high-bandwidth tools (Globus) Leverage several existing networking providers including U.Hawaii, U. Colorado and Internet 2.
Hi-Speed Undersea Cables Connecting Hawaii
Hi-Speed Undersea Cables Connecting Hawaii
DKIST Data Transportation (Alternate Models) ✔ Never underestimate the bandwidth of a man with a van load of tapes! VS
Getting DKIST Data to Boulder West Coast DolphinNet
Alternative Route for Getting Data off the Mountain Grad Student
Data Content Management CURRENT PLAN Receive FITS files from summit, calibrations process Parse FITS, store metadata in “Inventory” Store serialized FITS in Object Store Maintain Offsite Partial Replica Retain everything until science-driven QA/C done (after > 6 mos.) OBJECT STORAGE – Common Adoption Commercial Cloud storage (S3, Azure, Google Cloud) – Also Facebook, Spotify, Dropbox Industry initiatives (OpenStack Swift) Ceph, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level. Why not send it to the cloud?
Why not send it to the cloud? We can afford to send it to the cloud.
Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage…
Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage… But We couldn’t afford to give the data out!
Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage… But We couldn’t afford to give the data out! But wait!
Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage… But We couldn’t afford to give the data out! But wait! Amazon cloud now has modules you can use to charge people to download your data!! That’s ok right?!
The Devils They Know!! Hα and Ca II K data from Chrotel (KIS) on Tenerife
The Devils They Know!! GridFTP Hα and Ca II K data from Chrotel (KIS) on Tenerife
GridFTP GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. GridFTP is an extension of the File Transfer Protocol (FTP) for grid computing. The protocol was defined within the GridFTP working group of the Open Grid Forum. There are multiple implementations of the protocol; the most widely used is that provided by the Globus Toolkit. You will have to register for an account.
Critical Science Plans 40 TiB/ 14 TiB average / median project data-set size Plans as large as ~240TB Room on your desktop? Data transfer times to you are in days … weeks!
Ground Based Solar Data Calibration / Analysis
Data Calibration & Processing Complex observations, complex hardware. Ground-based (atmosphere!), high-resolution, high cadence, small field-of-view observations. Sophisticated instruments (multiple modes, large data rates), broad bandwidth support. Multiple external partners / instrument providers - need to coordinate calibration definition/development process. Complex facility support Nine optical assemblies (excluding instruments) AO, Thermal, Enclosure, Polarization Incomplete prior instrument calibration experience Existing instruments not the same as DKIST instruments. No comparable ground-based data processing systems in solar physics Invent “some” of the wheel…but not all!! New for solar physics but not necessarily new for big data Conclusion: This is novel and complex!
CRISP Data Pipeline
Data Calibration & Processing Asynchronous Event Driven Pipeline New Data Arrives Automated Calibration Task is Scheduled If it Completes, Write Results to Data Store If not, notify DC Staff to Fix or Manually Execute … and Executed or Queued … is IDd for Calibration Python <-> IDL Bridge
Visible Broadband Imager (VBI) “Straightforward” Calibration VBI will record images from the DKIST telescope at the highest possible spatial and temporal resolution at a number of specified wavelengths in the range from 390nm to 860 nm. VBI will provide high-quality imaging through filters with relatively broad pass-bands to optimize throughput. Its high cadence and short exposure times comes at the expense of information in the spectral domain. The VBI design will allow exposure times short enough (30 frames/s), to effectively "freeze" the atmospheric turbulence and apply speckle interferometric image reconstruction techniques.
Calibration - First Light Data Center Define core calibrations of detector/instrument Leverage known calibration(s) from existing/previous instruments Instrument Calibration Plans are joint effort between DC & Instrument Partners / Providers Build & publish community-contributed Python/IDL code And not forgetting the algorithm documentation! Expect significant iteration + refinement in early Operations Build in: revision & change control processes; “sandbox” computing to support ongoing development Avoid pipe dream of totally automated calibrations (at least to begin with) Must plan for flexible processing (automatic + user-directed) Create opportunities for others to do big data analytics.
Managing Expectations But I just want to run dkist_prep on all the data! What? You want me to write the paper too? Space based solar physicist version.
Managing Expectations But I want you to invert *ALL* the lines for me! Let me guess, ME isn’t good enough for you either!! Ground based solar physicist version.
Managing Expectations DKIST will be ready for science operations (summit & data center) in Jan. 2020! The community will be enabled to do science with DKIST! We welcome community help! Data Center won’t do everything that you want it to do right from the start! There are significant big data challenges But we think we have them managed!
And then … You will have the data you can do big solar data analytics with, albeit you will be left with a few scientific decisions on which inversion routines you want to run! First Light Data Center is the beginning, not the end product. dkist_paper, [/apj],[/science],[/nature] etc. coming in 2024. (Thanks to Tom Schad for the suggestion!)