DKIST Data Center Alisdair Davey NSO

Slides:



Advertisements
Similar presentations
Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Advertisements

The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
Introducing: Cooperative Library Presented August 19, 2002.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Windows Azure for scalable compute and storage SQL Azure for relational storage for the cloud AppFabric infrastructure to connect the cloud.
Electronic Data Interchange (EDI)
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
Numerical Grid Computations with the OPeNDAP Back End Server (BES)
Upcoming Enhancements to the HST Archive Mark Kyprianou Operations and Engineering Division Data System Branch.
DCS Overview MCS/DCS Technical Interchange Meeting August, 2000.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY NASA GODDARD SPACE FLIGHT CENTER ORBITAL SCIENCES CORPORATION NASA AMES RESEARCH CENTER SPACE TELESCOPE SCIENCE INSTITUTE.
material assembled from the web pages at
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
Software Architectural Styles Andrew Midwinter, Mark Mullen, Kevin Wong, Matt Jones 1.
What is the VSO? (and what isn’t it?). The VSO …  Allows you to search multiple archives in a single search  Keeps you from needing to keep track of.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Server to Server Communication Redis as an enabler Orion Free
Slide 1 Archive Computing: Scalable Computing Environments on Very Large Archives Andreas J. Wicenec 13-June-2002.
Large Area Surveys - I Large area surveys can answer fundamental questions about the distribution of gas in galaxy clusters, how gas cycles in and out.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
11 Researcher practice in data management Margaret Henty.
Origami: Scientific Distributed Workflow in McIDAS-V Maciek Smuga-Otto, Bruce Flynn (also Bob Knuteson, Ray Garcia) SSEC.
Next Generation of Apache Hadoop MapReduce Owen
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Cloud Computing ENG. YOUSSEF ABDELHAKIM. Agenda :  The definitions of Cloud Computing.  Examples of Cloud Computing.  Which companies are using Cloud.
Wednesday NI Vision Sessions
© 2003, Cisco Systems, Inc. All rights reserved. 2-1 Campus Network Design.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Canadian Bioinformatics Workshops
EGI-InSPIRE RI An Introduction to European Grid Infrastructure (EGI) March An Introduction to the European Grid Infrastructure.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
The future of Delft-FEWS
Enhancements to Galaxy for delivering on NIH Commons
BEST CLOUD COMPUTING PLATFORM Skype : mukesh.k.bansal.
Resource Management IB Computer Science.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Clouds , Grids and Clusters
Data Processing Status
Software infrastructure for a National Research Platform
Joseph JaJa, Mike Smorul, and Sangchul Song
Grid Computing.
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
LQCD Computing Operations
Introduction to Computers
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Capitalize on modern technology
Mobile edge computing Report by Weiqing huang.
Data Management: Documentation & Metadata
An Introduction to Computer Networking
Chapter 1 Database Systems
Leigh Grundhoefer Indiana University
Chapter 2: System Structures
Launch and On-orbit Checkout
Technical Capabilities
Laura Bright David Maier Portland State University
Chapter 2: Operating-System Structures
Observing with Modern Observatories (the data flow)
GCSE OCR 3 A451 Computing Client-server and peer-to-peer networks
Chapter 1 Database Systems
Google Sky.
Overview of Workflows: Why Use Them?
Distributed Systems Bina Ramamurthy 4/22/2019 B.Ramamurthy.
Chapter 2: Operating-System Structures
Presentation transcript:

Alisdair Davey adavey@nso.edu NSO DKIST Data Center Alisdair Davey adavey@nso.edu NSO Steven Berukoff, Tony Hays, Kevin Reardon, DJ Spiess, Fraser Watson and Scott Wiant High-Resolution Solar Physics: Past, Present, Future

Data Center Roles The First Light Data Center Data Curation Data Calibration and Processing

Data Center Roles Data Curation Long-term storage management. The management of data throughout its lifecycle, from creation and initial storage to the time when it is archived for posterity or becomes obsolete and is deleted. The main purpose of data curation is to ensure that data is reliably retrievable for future research purposes or reuse. Long-term storage management. Streamlined, automated data and metadata ingest. Effective query, data retrieval. Flexible system accommodating science / instrument / technological changes without the need for frequent major redesign. Planned 44-year lifetime (2 solar cycles)

Solar Physics Mission Data Sizes The State of Solar Data storage and distribution Courtesy K. Reardon

Solar Physics Mission Data Sizes SDO – 800lb (362kg) Data Gorilla in Room

Solar Physics Mission Data Sizes SDO Other Missions DKIST

Data Transfer – Maui -> Boulder What about 60 TB/day? Challenge: Move data from Telescope to Boulder Mean 9 TB/day over shared 10 Gbps network  8 hours Extensive Testing to identify/fix bottlenecks Partner engagement: Upgrade to 40 Gbps by 2019 Use mature high-bandwidth tools (Globus) Leverage several existing networking providers including U.Hawaii, U. Colorado and Internet 2.

Hi-Speed Undersea Cables Connecting Hawaii

Hi-Speed Undersea Cables Connecting Hawaii

DKIST Data Transportation (Alternate Models) ✔ Never underestimate the bandwidth of a man with a van load of tapes! VS

Getting DKIST Data to Boulder West Coast DolphinNet

Alternative Route for Getting Data off the Mountain Grad Student

Data Content Management CURRENT PLAN Receive FITS files from summit, calibrations process Parse FITS, store metadata in “Inventory” Store serialized FITS in Object Store Maintain Offsite Partial Replica Retain everything until science-driven QA/C done (after > 6 mos.) OBJECT STORAGE – Common Adoption Commercial Cloud storage (S3, Azure, Google Cloud) – Also Facebook, Spotify, Dropbox Industry initiatives (OpenStack Swift) Ceph, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level. Why not send it to the cloud?

Why not send it to the cloud? We can afford to send it to the cloud.

Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage…

Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage… But We couldn’t afford to give the data out!

Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage… But We couldn’t afford to give the data out! But wait!

Why not send it to the cloud? We can afford to send it to the cloud. We may even be able to afford the cloud storage… But We couldn’t afford to give the data out! But wait! Amazon cloud now has modules you can use to charge people to download your data!! That’s ok right?!

The Devils They Know!! Hα and Ca II K data from Chrotel (KIS) on Tenerife

The Devils They Know!! GridFTP Hα and Ca II K data from Chrotel (KIS) on Tenerife

GridFTP GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks. GridFTP is an extension of the File Transfer Protocol (FTP) for grid computing. The protocol was defined within the GridFTP working group of the Open Grid Forum. There are multiple implementations of the protocol; the most widely used is that provided by the Globus Toolkit. You will have to register for an account.

Critical Science Plans 40 TiB/ 14 TiB  average / median project data-set size Plans as large as ~240TB Room on your desktop? Data transfer times to you are in days … weeks!

Ground Based Solar Data Calibration / Analysis

Data Calibration & Processing Complex observations, complex hardware. Ground-based (atmosphere!), high-resolution, high cadence, small field-of-view observations. Sophisticated instruments (multiple modes, large data rates), broad bandwidth support. Multiple external partners / instrument providers - need to coordinate calibration definition/development process. Complex facility support Nine optical assemblies (excluding instruments) AO, Thermal, Enclosure, Polarization Incomplete prior instrument calibration experience Existing instruments not the same as DKIST instruments. No comparable ground-based data processing systems in solar physics Invent “some” of the wheel…but not all!! New for solar physics but not necessarily new for big data Conclusion: This is novel and complex!

CRISP Data Pipeline

Data Calibration & Processing Asynchronous Event Driven Pipeline New Data Arrives Automated Calibration Task is Scheduled If it Completes, Write Results to Data Store If not, notify DC Staff to Fix or Manually Execute … and Executed or Queued … is IDd for Calibration Python <-> IDL Bridge

Visible Broadband Imager (VBI) “Straightforward” Calibration VBI will record images from the DKIST telescope at the highest possible spatial and temporal resolution at a number of specified wavelengths in the range from 390nm to 860 nm. VBI will provide high-quality imaging through filters with relatively broad pass-bands to optimize throughput. Its high cadence and short exposure times comes at the expense of information in the spectral domain. The VBI design will allow exposure times short enough (30 frames/s), to effectively "freeze" the atmospheric turbulence and apply speckle interferometric image reconstruction techniques.

Calibration - First Light Data Center Define core calibrations of detector/instrument Leverage known calibration(s) from existing/previous instruments Instrument Calibration Plans are joint effort between DC & Instrument Partners / Providers Build & publish community-contributed Python/IDL code And not forgetting the algorithm documentation! Expect significant iteration + refinement in early Operations Build in: revision & change control processes; “sandbox” computing to support ongoing development Avoid pipe dream of totally automated calibrations (at least to begin with) Must plan for flexible processing (automatic + user-directed) Create opportunities for others to do big data analytics.

Managing Expectations But I just want to run dkist_prep on all the data! What? You want me to write the paper too? Space based solar physicist version.

Managing Expectations But I want you to invert *ALL* the lines for me! Let me guess, ME isn’t good enough for you either!! Ground based solar physicist version.

Managing Expectations DKIST will be ready for science operations (summit & data center) in Jan. 2020! The community will be enabled to do science with DKIST! We welcome community help! Data Center won’t do everything that you want it to do right from the start! There are significant big data challenges But we think we have them managed!

And then … You will have the data you can do big solar data analytics with, albeit you will be left with a few scientific decisions on which inversion routines you want to run! First Light Data Center is the beginning, not the end product. dkist_paper, [/apj],[/science],[/nature] etc. coming in 2024. (Thanks to Tom Schad for the suggestion!)