Presentation on theme: "A Tale of Two Workflows Roger Barga, Microsoft Research (MSR) Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Research The creative input of the Trident."— Presentation transcript:
A Tale of Two Workflows Roger Barga, Microsoft Research (MSR) Nelson Araujo, Dean Guo, Jared Jackson, Microsoft Research The creative input of the Trident MSR summer ‘08 interns
Satya Sahoo Wright State University David Koop University of Utah Matt Valerio Ohio State University Eran Chinthaka Indiana University MSR (Trident) Summer ‘08 Interns
Demonstrate that a commercial workflow management system can be used to implement scientific workflow Offer this system as an open source accelerator Write once, deploy and run anywhere... Abstract parallelism (HPC and many core); Automatic provenance capture, for both workflow and results; Costing model for estimating resource required; Integrated data storage and access, in particular cloud computing; Reproducible research; Develop this in the context of real eScience applications Make sure we solve a real problem for actual project(s). And this is where things got really interesting...
Technical ComputingTechnical Computing eScience eScience Workflow is a bridge between the underwater sensor array (instrument) and the end users Mandate Make data available to researchers in (near-) real time Store data for long term time-series studies Features Allow human interaction with instruments; Deployed instruments will change regularly, as will the analysis; Facilitate automated, routine “survey campaigns”; Support automated event detection and reaction; User able to access through web (or custom client software); Best effort for most workflows is acceptable;
Telescope Telescope diameter (m) Effective collecting area (m 2 ) [A] Solid angle subtended by field of view (deg 2 ) [D] Nominal image quality (arcsec) [Q] The survey power [AD/Q 2 ] Status LINEAR1.00.832.50.2Active Spacewatch0.90.641.50.8Active UH 2.2-m/PFCam220.127.116.11.71.82004+ Palomar/QUEST18.104.22.1684.62003+ CFHT/Megacam3.6101.000.628Active Subaru/Suprimam8.0450.250.635Active Pan-STARRS3.61070.52802007+ DMT/LSST8.35470.610502012+ Haleakala Observatory, Maui, Hawaii!! One of the largest visible light telescopes 4 unit telescopes acting as one 1 Gigapixel per telescope Surveys entire visible universe in 1 week Catalog solar system, moving objects/asteroids ps1sc.org: UHawaii, Johns Hopkins, …
30TB of processed data/year ~1PB of raw data 5 billion objects; 100 million detections/week Updated every week SQL Server 2008 for storing detections Distributed view over spatially partitioned databases Replicated for fault tolerance Windows 2008 HPC Cluster Schedules workflow, monitor system
Technical ComputingTechnical Computing eScience eScience Slice 1 Slice 1 Slice 2 Slice 2 Slice 3 Slice 3 Slice 4 Slice 4 Slice 5 Slice 5 Slice 6 Slice 6 Slice 7 Slice 7 Slice 8 Slice 8 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 s 16 s 16 s3s3 s3s3 s2s2 s2s2 s5s5 s5s5 s4s4 s4s4 s7s7 s7s7 s6s6 s6s6 s9s9 s9s9 s8s8 s8s8 s 11 s 11 s 10 s 10 s 13 s 13 s 12 s 12 s 15 s 15 s 14 s 14 s1s1 s1s1 Load Merge 1 Load Merge 1 Load Merge 2 Load Merge 2 Load Merge 3 Load Merge 3 Load Merge 4 Load Merge 4 Load Merge 5 Load Merge 5 Load Merge 6 Load Merge 6 S1S1 S1S1 S2S2 S2S2 S3S3 S3S3 S4S4 S4S4 S5S5 S5S5 S6S6 S6S6 S7S7 S7S7 S8S8 S8S8 S9S9 S9S9 S 10 S 10 S 11 S 11 S 12 S 12 S 13 S 13 S 14 S 14 S 15 S 15 S 16 S 16 csv IPP Shared Data Store L1 L2 HOTHOTHOTHOT WARMWARMWARMWARM Main Distributed View
Technical ComputingTechnical Computing eScience eScience Supporting Provenance for the Scientist & the Data Valet Telescope CSV Files CSV Files Image Procesing Pipeline (IPP) CSV Files CSV Files Load Workflow Load DB Cold Slice DB 1 Cold Slice DB 2 Warm Slice DB 1 Warm Slice DB 2 Merge Workflow Hot Slice DB 2 Hot Slice DB 1 Flip Workflow Distrib uted View CASJobs Query Service CASJobs Query Service MyDB The Pan-STARRS Science Cloud ← Behind the Cloud|| User facing services → Validation Exception Notification Data Valet Workflows Data Consumer Queries & Workflows Data flows in one direction→, except for error recovery Slice Fault Recover Workflow Data Creators Astronomers (Data Consumers) Admin & Load-Merge Machines Production Machines
Technical ComputingTechnical Computing eScience eScience Workflow is just a member of the orchestra
Technical ComputingTechnical Computing eScience eScience Workflow carries out the data loading and merging Features Support scheduling of workflows for nightly load and merge; Offer only controlled (protected) access to the workflow system; Workflows are tested, hardened and seldom change; Not a unit of reuse or knowledge sharing; Fault tolerance – ensure recovery and cleanup from faults; Assign clean up workflows to undo state changes; Provenance as a record of state changes (system management); Performance monitoring and logging for diagnostics; Must “play well” in a distributed system; Provide ground truth for the state of the system;
I want to do this more than once and get exactly the same answer. I want to do this more than once, but don’t care if I get exactly the same answer. I’m only going to do this once and don’t care about keeping the data or the results long term (but I need to remember the inputs); I want to store the data in I want full provenance to validate a result, OPM compliant; I want to use my own provenance management system; Each group may wish a different UI (no WF), or authoring tool I want any data from any agency or investigator even if the measurement sites aren’t quite co-located; I’ll deal with it later. I only want NCAR, MBARI, etc. data because I trust it. I know that Jon really wants my results to drive his model and I want to share my workflow and executables. Each of these potentially impacts the technology, user interface, and API design
Divide and conquer You can see all of the application components; Different components share interfaces; Different components developed by different people work together, even if someone else implements them; Go from working to working Change one component, the rest keep working; Scale up or down over time; Testing components independently is possible; Full design, incremental implementation Build what you need as you go; Integrate new data sources, data types, analysis tools leverage the stable interfaces. Plug and play…
It’s hard You have to accumulate user scenarios, map them to the technical components, and then understand the implications. What are the dimensions of change/flexibility?
It’s hard You have to accumulate user scenarios, map them to the technical components, and then understand the implications. What are the dimensions of change/flexibility? It doesn’t feel like you’re making progress You spend a lot of time discovering what you already know. User scenarios often contain many of the same technical requirements again and again. It’s not fun You have to keep your interfaces stable longer (because you have dependencies on them), so that great idea has to wait for the next release The design discussions can be rather “energetic” It takes a team commitment
Drive workflow development with 20 queries (workflows) representative of the science diverse enough to drive the design
Drive workflow development with 20 queries (workflows) representative of the science diverse enough to drive the design Introduce a registry as single ground truth for all state and objects.
Trident Registry Registry Management
Provides ground truth state for Trident Captures provenance for workflows Records information on running jobs Meta data for all objects in Trident
Drive workflow development with 20 queries (workflows) representative of the science diverse enough to drive the design Introduce a registry as single ground truth for all state and objects. Introduce an event blackboard for service communication;
Trident Blackboard Overview Matt Valerio, Satya Sahoo, Jared Jackson Logging Monitoring Provenance Tracking Design Tracking Resource Usage User-Defined Tracking Design Data Blackboard Shared Ontology … … Other publishers concept 1 value 1 concept 2 value 2 BlackboardMessage concept 1 concept 3 Subscription Profile Subscription Store Publisher Store
Workflow Tracking Workflow Events Aborted Changed Completed Created Idle Loaded Persisted Resumed Started Suspended Terminated Unloaded Activity Events Cancelling Closed Compensating Executing Faulting Initialized User Events User-defined concept 1 value 1 concept 3 value 3 BlackboardMessage concept 1 concept 3 concept 1 value 1 concept 2 value 2 Concept-Value Pairs concept 3 value 3 concept 4 value 4 Aggregate Subscription Profile Ontology Mapping Tracking Data Instance ID Activity Type Activity Name Timestamp … Filtering Send Blackboard Why filter at the publisher? Minimize network usage Optimize performance (more messages/sec)
Workflow Monitoring SequenceActivity1 CpuIntensiveActivity1 MemoryIntensiveActivity1 CpuIntensiveActivity2 0%100% Time Goals Real-time resource usage graphs (e.g. Silverlight) Subscriber-initiated Activity-initiated Creation of cost models for each type of activity Implementation Subscribers listen for a specific resource concept The monitoring service polls a resource monitor at regular intervals The results are sent to the blackboard CPU Monitor
Illustration of Monitoring in Action
Drive workflow development with 20 queries (workflows) representative of the science diverse enough to drive the design Introduce a registry as single ground truth for all state and objects. Introduce an event blackboard for service communication; Choose specific interfaces between components and stick to them APIs, object models, browser user screens and forms Everything can be replaced and/or augmented
APIAPI Native Managed Web Services APIAPI Managed Native Web Services Trident Registry Provider API Eran Chinthaka and Nelson Araujo
Drive workflow development with 20 queries (workflows) representative of the science diverse enough to drive the design Introduce a registry as single ground truth for all state and objects. Introduce an event blackboard for service communication; Choose specific interfaces between components and stick to them APIs, object models, browser user screens and forms Everything can be replaced and/or augmented Separate the user interface to solve specific tasks Separate authoring UI from runtime Separate execution UI from runtime. It’s a workflow – what parameters do you want to set? What parts do you want to pause? Do over? Never do again? Some things only work on the desktop; some things work best in the cloud. Enable users to select at runtime.
Workflow Selection David Koop, Nelson Araujo Show me the workflows that Process these data sets (sensor types); Produce this kind of result (type of visualization, analysis); Order these workflows by time it was last used; Now apply this workflow to “this” area of the ocean;
Questions Scientific workflows for streamlining the data pipeline Data Acquisition Data Assembly Discovery and Browsing Science Exploration Domain Specific Analyses Scientific Output Archive Field sensor deployments and operations; field campaigns measuring site properties. “Raw” data includes sensor output, data downloaded from agency or collaboration web sites, papers (especially for ancillary data. “Raw” data browsing for discovery (do I have enough data in the right places?), cleaning (does the data look obviously wrong?), and light weight science via browsing “Science variables” and data summaries for hypothesis testing and early exploration. Like discovery and browsing, but variables are computed via gap filling, units conversions, or simple equations. “Science variables” combined with models, other specialized code, or statistics for deep science understanding. Scientific results via packages such as MatLab or R2. Special rendering package such as ArcGIS. Paper preparation. Data and analysis methodology stored for data reuse, or repeating an analysis.