Presentation on theme: "Components of a Data Analysis System Scientific Drivers in the Design of an Analysis System."— Presentation transcript:
Components of a Data Analysis System Scientific Drivers in the Design of an Analysis System
Data Import Format –Either widely used/accepted, or –Can be converted easily from something widely used –User need not know the details of the format –Well documented (e.g., which flavor of latitude). Fast Access –Disk I/O speeds do not follow Moore’s law –Read speed is more important than write speed –Caching –File size is only important to keep access times low Content must represent the details of the data E2E - Full intent of the observer must be embedded
Data Export Format –Either widely used/accepted, or –Can be converted easily into something widely used –User need not know the details of the format –Well documented (e.g., which flavor of latitude). You can read what you write –Import format == Export format Fast Access –Disk I/O speeds do not follow Moore’s law –Read speed is more important than write speed Content must represent the details of the data E2E - Full intent of the observer must be embedded. Includes user annotation/comments
Data Base System Ability to work with more than one data set Data base for both export and import files Large data volumes –Access using scan numbers is no longer sufficient –Require the ability to select subsets of data via sophisticated data-base queries –Moderate number of columns in data base index –‘Index’ to data kept in memory to speed data access –File summaries at various levels of detail Various levels of ‘granularity” Calibrated and raw data E2E - User can add annotation/comments Security – Only the observer can access data
Data Archive Write speed more important than read speed. File size is very important Cannot anticipate types of user queries –Large number of columns in data base index –Very sophisticated/fast RDBMS Storage need not be a widely used data format –Format can be very different from that used by analysis system. Export format should be a widely used data format
Interactive On-Line Data Analysis The ability to access data ASAP –Import file updates automatically as observations proceed (real-time “filler”). –Index to file updates automatically –Updates happen per ‘integration’ (spectral-line) or per N seconds (continuum) –Minimum integration time ~ few times the minimum time of real-time “filler” –Analysis system automatically is aware of updated index. –Read-protect online/filled data? User should be able to ‘see’ the data within an ‘integration’ of when it was taken (or N seconds).
User Interface Command line –Familiar syntax better than a good syntax –Procedural with byte-wise compiling (performance) –History, min-match or command completion –Useful error messages –Interruptible –Error trapping and exception handling –Ability to “Undo”
User Interface GUI’s best for: –Interacting with data visualizations –Filling in forms data base queries options for data pipelines –Browsing for data files –Defining E2E data flow (ala labview)
Imaging Tools Visualization –Shouldn’t try to recreate those things already available in another package – export instead. Data Flagging – Pick a system that works Graphics –Traditional capabilities (zoom in/out, scroll, print, save, …) –Data volume requires great performance, smart libraries (screen resolution << # data pts) –Interactive feedback (e.g., defining baseline regions). Publishable plots or export into something else? –Default plot style –Ability to tweak everything (label formats; char sizes; add, remove, move annotation; tick mark size; major/minor ticks, full box; grid; multiple X and Y axes, …..)
Analysis Algorithms Algorithms well documented Study what exists in other packages. Robustness very important but so is speed –Provide less robust but faster alternatives Developers should not force an algorithm on users Developers should provide ‘defaults’ only Building blocks better than a do-all algorithm. Ability to use and modify ‘header’ information as well as data. E2E – do-alls are built out of the same building blocks.
Documentation On-line and hardcopy –Tutorials/Quick Guides –Cookbook Based on observing types –Reference Manuals Full, gory details Data Formats Algorithms –Searchable by keywords Quick, interactive command help from within the system. Never release until these are in place
User Support/Feedback A familiar system minimizes staff support Easily accessed, on-line “help desk” and “Suggestion” box Automatic generation of “bug” reports Observers of observers
Marketing A familiar system already has a market Don’t be another cereal on the supermarket shelf Workshops are better than papers Create a User Community Responsive feedback from developers Independent Beta testers Reputation & first experiences are everything
User Community User Forums Newsletters Accept User Contributions/Additions –Sourceforge-like system –NRAO-seal-of-approval NRAO Moderator
Real-Time Data Display To guarantee data quality –Product is not stored (except for hardcopy) –Sequential processing -- different from E2E/Data pipeline –Fast is more important than accurate –Few bells and whistles -- must avoid the RTD black hole –A simple display for all observation types more important than sophisticated displays for a few data types Display happens within an ‘integration’ of when data were taken – tied to real time filler GUI based – underlying language is unimportant Output understandable by an operator
Real Time Data Analysis Pointing/Focus/Tipping/… are different from RTD –Results should be stored (Data Base) –Results are used by the control system (pointing/focus) or by subsequent analysis (tipping) –Accuracy is as important as speed –More bells, whistles, user-options –Sequential processing (non E2E/data pipeline) –Only a few observation types are handled Analysis happens within an ‘integration’ of when data were taken GUI based – underlying language is unimportant Output understandable by an operator
IDL Work Package SDFITS –Interim solution for data import/export –Class/IDL specific; soon Aips++/Aips/UniPOPS? –MD/BDFITS next generation (keywords, incompleteness of contents, versatility, …) IDL – Tom Bania –Uses UniPOPS as a ‘model’ – familiar to many –Very good reproduction –Bania-centric – needs to be generalized
IDL Work Package Glen Langston –Assess whether IDL will meet performance, extensibility, usability, … goals. –Generalization to other observing types. –Real-Time data access and display –Developed on top of and in parallel with Tom’s work (so, implementations have diverged) –Works well for Glen’s own experiments
IDL Work Package Institutionalize what Tom and Glen have done –Code management –Code review –Combine Tom and Glen’s branch –Generalize code –Provide ways for Tom and Glen to contribute within the same revision-control branch. Develop ‘Institutionalized’ code –Improve performance, usability, maintenance –Add/Replace I/O components with better CS methods.
Calibration Work Package User-tunable algorithms –Options for the ‘real-time filler’ – sequential –Options for E2E pipeline – non-sequential –Options for interactive data reduction Default algorithms for all observing cases Extensible as new algorithms are developed User-defined/tweaked algorithms Robust and not-so-robust algorithms
Calibration Work Package Opacity/atmosphere model Output units Efficiencies –Source size –Telescope model Tsys(f) estimates Differencing schemes Non-linearities/template fitting/….