Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay Altintas Instructors: Bertram Ludaescher Ilkay Altintas.

Similar presentations


Presentation on theme: "Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay Altintas Instructors: Bertram Ludaescher Ilkay Altintas."— Presentation transcript:

1 Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay Altintas Instructors: Bertram Ludaescher Ilkay Altintas

2 Scientific Workflows, B. Ludaescher & I. Altintas 2 Overview 10:30-11:15 Introduction to Scientific Workflows 11:15-12:00 Scientific Workflows in KEPLER live demo, brains-on session … but first, one more time … (déjà déjà vu) TM

3 Scientific Workflows, B. Ludaescher & I. Altintas 3

4 4 Information Integration Challenges: S 4 Heterogeneities Systems Integration –platforms, devices, data & service distribution, APIs, protocols, …  Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote resources, … Syntax & Structure –heterogeneous data formats (one for each tool...) –heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) –heterogeneous schemas (one for each DB...)  Database mediation technologies + XML-based data exchange, integrated views, transparent query rewriting, … Semantics –fuzzy metadata, terminology, “hidden” semantics, implicit assumptions, …  Knowledge representation & semantic mediation technologies + “smart” data discovery & integration + e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!

5 Scientific Workflows, B. Ludaescher & I. Altintas 5 Information Integration Challenges: S 5 Heterogeneities Synthesis of applications, analysis tools, data & query components, … into “scientific workflows” –How to make use of these wonderful things & put them together to solve a scientist’s problem?  Scientific Problem Solving Environments (PSEs)  GEON Portal and Workbench (“scientist’s view”) + ontology-enhanced data registration, discovery, manipulation + creation and registration of new data products from existing ones, …  GEON Scientific Workflow System (“engineer’s view”) + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset registration,…

6 Scientific Workflows, B. Ludaescher & I. Altintas 6 What is a Scientific Workflow (SWF)? Goals: –automate a scientist’s repetitive data management and analysis tasks –typical phases: data access, scheduling, generation, transformation, aggregation, analysis, visualization  design, test, share, deploy, execute, reuse, … SWFs Typical requirements/characteristics: –data-intensive and/or compute-intensive –plumbing-intensive –dataflow-oriented –distributed (data, processing) –user-interaction “in the middle”, … –… vs. (C-z; bg; fg)-ing (“detach” and reconnect) –advanced programming constructs (map(f), zip, takewhile, …) –logging, provenance, “registering back” (intermediate) products… … easy to recognize a SWF when you see one!

7 Scientific Workflows, B. Ludaescher & I. Altintas 7 Promoter Identification Workflow Source: Matt Coleman (LLNL)

8 Scientific Workflows, B. Ludaescher & I. Altintas 8 Source: NIH BIRN (Jeffrey Grethe, UCSD)

9 Scientific Workflows, B. Ludaescher & I. Altintas 9 Ecology: GARP Analysis Pipeline for Invasive Species Prediction Training sample (d) GARP rule set (e) Test sample (d) Integrated layers (native range) (c) Species presence & absence points (native range) (a) EcoGrid Query EcoGrid Query Layer Integration Layer Integration Sample Data + A3 + A2 + A1 Data Calculation Map Generation Validation User Validation Map Generation Integrated layers (invasion area) (c) Species presence &absence points (invasion area) (a) Native range prediction map (f) Model quality parameter (g) Environmental layers (native range) (b) Generate Metadata Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Environmental layers (invasion area) (b) Invasion area prediction map (f) Model quality parameter (g) Selected prediction maps (h) Source: NSF SEEK (Deana Pennington et. al, UNM)

10 Scientific Workflows, B. Ludaescher & I. Altintas 10

11 Scientific Workflows, B. Ludaescher & I. Altintas 11 Digression: (Business) Workflows and Systems or: what you need to know when someone wants to sell you one ;-) or: the remote relatives (2 nd -3 rd cousins?) of scientific workflows

12 Scientific Workflows, B. Ludaescher & I. Altintas 12 What is a (Business) Workflow? Workflow management (also called Business Process Management) is the coordination of work processes through software. A workflow management system routes pending activities to process participants according to a model of the process. WF management systems have been around since the late 1970s (e.g. Officetalk, Xerox PARK) –marketing waves: Office Automation (70’s-80’s), Business Process Reengineering (90’s), Web Services Choreography (00’s) –roots/related: document management apps, email system apps, database apps (active DBMS’s, federated DBMS’s) –Meanwhile (69’-71’) elsewhere: Flow-based programming (J. Paul Morrison) –… not quite workflow but rather dataflow … (we’ll come to that…) Src/cf: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003

13 Scientific Workflows, B. Ludaescher & I. Altintas 13 Some History Commercial Workflow Systems Source: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003

14 Scientific Workflows, B. Ludaescher & I. Altintas 14 Some History Commercial Workflow Systems Source: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003

15 Scientific Workflows, B. Ludaescher & I. Altintas 15 Play Time @ Petri Nets World Petri Nets are the underlying abstract model of many B-WfMS’s (who said I can’t do bad acronyms, too? ;-) http://www.daimi.au.dk/PetriNets/ http://www.daimi.au.dk/PetriNets/introductions/aalst/ Let’s see the basic ideas first …

16 Scientific Workflows, B. Ludaescher & I. Altintas 16 Formal Basis: Petri Nets Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s) Provides a modeling language w/ rich theory, analysis tools, … A Petri net consists of places (P), transitions (T) and directed arcs (P  T or T  P). Places can hold tokens. A transition is enabled if each of its input places contains at least one token. An enabled transition can fire, removing input tokens and producing output tokens P1 P2 P3P4T1T2 Enabled not enabled

17 Scientific Workflows, B. Ludaescher & I. Altintas 17 Formal Basis: Petri Nets Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s) Provides a modeling language w/ rich theory, analysis tools, … A Petri net consists of places (P), transitions (T) and directed arcs (P  T or T  P). Places can hold tokens. A transition is enabled if each of its input places contains at least one token. An enabled transition can fire, removing input tokens and producing output tokens P1 P2 P3P4T1T2 Enabled not enabled

18 Scientific Workflows, B. Ludaescher & I. Altintas 18 Why Petri Nets Modeling and designing concurrent systems w/ competing resources (dining philosophers), … Lots of analysis techniques, tools, theory –boundedness (state space), –liveness (good things do happen), –safety (bad things do not happen), –reversibility, –deadlock(-freeness), –reachability (of certain states), – …

19 Scientific Workflows, B. Ludaescher & I. Altintas 19 In a Flux: WS-XX-“Standards” Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/ http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/ http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html

20 Scientific Workflows, B. Ludaescher & I. Altintas 20 Everything Flows? But what exactly? Dataflow –Data flows through operations (zoom into your CPU…) –Activity diagrams: data flows through actions –Process networks: data flows between processes Control-flow –Nodes are control-flow operations that start other operations on a state Mixed approaches –Statecharts: events trigger state transitions –Petri nets: tokens mark control and dataflow –Workflow languages: mix control and dataflow –… many others …

21 Scientific Workflows, B. Ludaescher & I. Altintas 21 Scientific “Workflows” vs Business Workflows Business Workflows (BPEL4WS* …) –Task-orientation: travel reservations; credit approval; BPM; … –Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout –Complex control flow, complex process composition (danger of control flow/dataflow “spaghetti”)  Dataflow and control-flow are often divorced! Scientific “Workflows” –Dataflow and data transformations –Data problems: volume, complexity, heterogeneity –Grid-aspects Distributed computation Distributed data –User-interactions/WF steering –Data, tool, and analysis integration  Dataflow and control-flow are often married! (can be a happy marriage… at times…) *Business Process Execution Language for Web Services (in case you wondered)

22 Scientific Workflows, B. Ludaescher & I. Altintas 22 Scientific “Workflows”: Some Findings More dataflow than (business control-/) workflow –DiscoveryNet, Kepler, SCIRun, Scitegic, Triana, Taverna, …, Need for “programming extensions” –Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) Need for abstraction and nested workflows Need for data transformations (WS1  DT  WS2) Need for rich user interaction & workflow steering: –pause / revise / resume –select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF Need for high-throughput data transfers and CPU cyles: “(Data-)Grid-enabling”, “streaming” Need for persistence of intermediate products and provenance

23 Scientific Workflows, B. Ludaescher & I. Altintas 23 Perspectives on Systems Source: Workflow-based Process Controlling, Michael zur Muehlen, 2003 / Dataflow View

24 Scientific Workflows, B. Ludaescher & I. Altintas 24 A Dataflow Component (“Actor”) “actor” / component input channels output channels ports parameters $1, $2, …

25 Scientific Workflows, B. Ludaescher & I. Altintas 25 Actor-Oriented Design Object orientation: class name data methods call return What flows through an object is sequential control (cf. CCA, MPI) Actor/Dataflow orientation: actor name data (state) ports Input data parameters Output data What flows through an object is a stream of data tokens (in SWFs/KEPLER also references!!) Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

26 Scientific Workflows, B. Ludaescher & I. Altintas 26 Object-Oriented vs. Actor-Oriented Interfaces Actor/Dataflow Oriented AO interface definition says “Give me text and I’ll give you speech” OO interface gives procedures that have to be invoked in an order not specified as part of the interface definition. TextToSpeech initialize(): void notify(): void isReady(): boolean getSpeech(): double[] Object Oriented Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

27 Scientific Workflows, B. Ludaescher & I. Altintas 27 Ptolemy II see!see! try!try! read!read! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

28 Scientific Workflows, B. Ludaescher & I. Altintas 28 History Gabriel (1986-1991) –Written in Lisp –Aimed at signal processing –Synchronous dataflow (SDF) block diagrams –Parallel schedulers –Code generators for DSPs –Hardware/software co-simulators Ptolemy Classic (1990-1997) –Written in C++ –Multiple models of computation –Hierarchical heterogeneity –Dataflow variants: BDF, DDF, PN –C/VHDL/DSP code generators –Optimizing SDF schedulers –Higher-order components Ptolemy II (1996-2022) –Written in Java –Domain polymorphism –Multithreaded –Network integrated –Modal models –Sophisticated type system –CT, HDF, CI, GR, etc. PtPlot (1997-??) –Java plotting package Tycho (1996-1998) –Itcl/Tk GUI framework Diva (1998-2000) –Java GUI framework Copernicus (code generator) KEPLER (2003-2028) –scientific workflow extensions Source (Ptolemy): Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflows KEPLER = “Ptolemy II + X” for Scientific Workflows

29 Scientific Workflows, B. Ludaescher & I. Altintas 29 An “early” example: Promoter Identification SSDBM, AD 2003 Scientist models application as a “workflow” of connected components (“actors”) If all components exist, the workflow can be automated/ executed Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)

30 Scientific Workflows, B. Ludaescher & I. Altintas 30 Why Ptolemy II (and thus KEPLER)? Ptolemy II Objective: –“The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse User-Orientation –Workflow design & exec console (Vergil GUI) –“Application/Glue-Ware” excellent modeling and design support run-time support, monitoring, … not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …) but middle-/underware is conveniently accessible through actors! PRAGMATICS –Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) –open source system –Ptolemy II folks actively participate in KEPLER

31 Scientific Workflows, B. Ludaescher & I. Altintas 31 The KEPLER/Ptolemy II GUI (Vergil) “Directors” define the component interaction & execution semantics Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)

32 Scientific Workflows, B. Ludaescher & I. Altintas 32 Ptolemy II: Actor-Oriented Modeling Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director” Different directors for different modeling and execution needs (… can even be combined!)  Better abstraction, modeling, component reuse, …

33 Scientific Workflows, B. Ludaescher & I. Altintas 33 Behavioral Polymorphism in Ptolemy These polymorphic methods implement the communication semantics of a domain in Ptolemy II. The receiver instance used in communication is supplied by the director, not by the component. (cf. CCA, WS-??, [G]BPL4??, … !) producer actor consumer actor IOPort Receiver Director Behavioral polymorphism is the idea that components can be defined to operate with multiple models of computation and multiple middleware frameworks. Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

34 Scientific Workflows, B. Ludaescher & I. Altintas 34 Domains and Directors: Semantics for Component Interaction CI – Push/pull component interaction CSP – concurrent threads with rendezvous CT – continuous-time modeling DE – discrete-event systems DDE – distributed discrete events FSM – finite state machines DT – discrete time (cycle driven) Giotto – synchronous periodic GR – 2-D and 3-D graphics PN – process networks SDF – synchronous dataflow SR – synchronous/reactive TM – timed multitasking Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ For (coarse grained) Scientific Workflows! For (finer-grained) concurrent jobs!?

35 Scientific Workflows, B. Ludaescher & I. Altintas 35 Polymorphic Actor Components Working Across Data Types and Domains Actor Data Polymorphism : –Add numbers (int, float, double, Complex) –Add strings (concatenation) –Add complex types (arrays, records, matrices) –Add user-defined types Actor Behavioral Polymorphism: –In dataflow, add when all connected inputs have data –In a time-triggered model, add when the clock ticks –In discrete-event, add when any connected input has data, and add in zero time –In process networks, execute an infinite loop in a thread that blocks when reading empty inputs –In CSP, execute an infinite loop that performs rendezvous on input or output –In push/pull, ports are push or pull (declared or inferred) and behave accordingly –In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add By not choosing among these when defining the component, we get a huge increment in component re- usability. But how do we ensure that the component will work in all these circumstances? Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

36 Scientific Workflows, B. Ludaescher & I. Altintas 36 Directors and Combining Different Component Interaction Semantics Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/ Possible app. in SWF: time-series aware … parameter-sweep aware … MPI aware XYZ aware … … execution models

37 Scientific Workflows, B. Ludaescher & I. Altintas 37 Component Composition & Interaction Components linked via ports Dataflow (and msg/ctl-flow) Where is the component interaction semantics defined?? –each component is its own director! But still useful for special applications, e.g. parallel programs (MPI, …) Source: GRIST/SC4DEVO workshop, July 2004, Caltech DIR1 DIR2 DIR3 DIR4 ???

38 Scientific Workflows, B. Ludaescher & I. Altintas 38 CCA via special (“look the other way”) Director(s)? CCA!? Dataflow in CCA a CCA “convention” can be used to accommodate actor- oriented/dataflow modeling CCA/Message Passing in KEPLER Kepler/Ptolemy can be extended to accommodate message passing semantics (CSP is already in Ptolemy II)

39 Scientific Workflows, B. Ludaescher & I. Altintas 39 Data/Control-Flow Spectrum Data (tokens) flow –(almost) no other side effects –WYSIWYG (usually) References flow –token reference type may be “http-get”, “ftp-get”, “hsi put”… –generic handling still possible Application specific tokens flow –e.g. current Nimrod job management in Resurgence –“invisible contract” between components –Director is unaware of what’s going on … (sounds familiar? ;-) Specific messages passing protocols (e.g., CSP, MPI) –for systems of tightly coupled components “clean” data(=ctl)-flow special tokens flow message passing, control flow “actor”

40 Scientific Workflows, B. Ludaescher & I. Altintas 40 KEPLER/CSP: Contributors, Sponsors, Projects (or loosely coupled Communicating Sequential Persons ;-) Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SEEK, GEON, SDM, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK Ptolemy II

41 Scientific Workflows, B. Ludaescher & I. Altintas 41 KEPLER: An Open Collaboration Initiated by members from NSF SEEK and DOE SDM/SPA; now several other projects Open Source (BSD-style license) Intensive Communications: –Web-archived mailing lists –IRC (!) Co-development: –via shared CVS repository –joining as a new co-developer (currently): get a CVS account (read-only) local development + contribution via existing KEPLER member be voted “in” as a member/co-developer Software & social engineering –How to better accommodate new groups/communities? –How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?

42 Scientific Workflows, B. Ludaescher & I. Altintas 42 GEON Dataset Generation & Registration (a co-development in KEPLER) Xiaowen (SDM) Edward et al.(Ptolemy) Yang (Ptolemy) Efrat (GEON) Ilkay (SDM) SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) % Makefile $> ant run % Makefile $> ant run

43 Scientific Workflows, B. Ludaescher & I. Altintas 43 KEPLER then …

44 Scientific Workflows, B. Ludaescher & I. Altintas 44 … and KEPLER today… What is HPC? … so,you see, scientific workflows need domain and data- polymorphic actors & must scale to HPC! What’s a scientific workflow? What’s a poly- morphic actor? BTW: Kepler is NOT a GUI (Vergil is)

45 Scientific Workflows, B. Ludaescher & I. Altintas 45 KEPLER Pedigree (to be determined…) Ptolemy KEPLER Ptolemy IIGabriel SCIRun Khoros AVS Graphical dataflow environments Problem solving environments Grid workflows DiscoveryNet Taverna Triana Pegasus Matrix openDX

46 Scientific Workflows, B. Ludaescher & I. Altintas 46 A Few Specific Kepler Features

47 Scientific Workflows, B. Ludaescher & I. Altintas 47 Web Services  Actors (WS Harvester) 1 2 3 4  “Minute-made” (MM) WS-based application integration Similarly: MM workflow design & sharing w/o implemented components

48 Scientific Workflows, B. Ludaescher & I. Altintas 48 Recent Actor Additions

49 Scientific Workflows, B. Ludaescher & I. Altintas 49 Digression: Who are the clients? Domain scientists 1.C/Perl/Python/Java/WS/DB-enabled ones 2.others (e.g. visually-inclined rest of us?) Goal: make the life better for both! –Workflow automation –Plumbing support –Execution monitoring, steering, runtime revision (pause-inspect-modify-resume cycle)

50 Scientific Workflows, B. Ludaescher & I. Altintas 50 For the Geoscientist: GEON Mineral Classification Workflow

51 Scientific Workflows, B. Ludaescher & I. Altintas 51 … inside the Classifier BrowserUI actor w/ SVG client display

52 Scientific Workflows, B. Ludaescher & I. Altintas 52 in KEPLER (interactive session) Source: Dan Higgins, Kepler/SEEK

53 Scientific Workflows, B. Ludaescher & I. Altintas 53 in KEPLER (w/ editable script) Source: Dan Higgins, Kepler/SEEK

54 Scientific Workflows, B. Ludaescher & I. Altintas 54 A Closer Look at Dataflow … (or: Do you know what’s going on under your carpet? ) control tokens flow, e.g., from “$”-actor to FileReader and ImageReader actors actual dataflow is “under the carpet” and through handles (file system, GridFTP, scp, SRB, …) Dataflow: what you see is what you get (almost…) Need for a general way to handle references!

55 Scientific Workflows, B. Ludaescher & I. Altintas 55 GEON Data Registration UI

56 Scientific Workflows, B. Ludaescher & I. Altintas 56 GEON Data Registration in KEPLER

57 Scientific Workflows, B. Ludaescher & I. Altintas 57 Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!?)

58 Scientific Workflows, B. Ludaescher & I. Altintas 58 Data Analysis: Biodiversity Indices

59 Scientific Workflows, B. Ludaescher & I. Altintas 59 Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.

60 Scientific Workflows, B. Ludaescher & I. Altintas 60 Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.

61 Scientific Workflows, B. Ludaescher & I. Altintas 61 Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.

62 Scientific Workflows, B. Ludaescher & I. Altintas 62 Re-engineered PIW w/ Iteration Constructs AD 2004 map(GenbankWS) Input: {“NM_001924”, “NM020375”} Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}

63 Scientific Workflows, B. Ludaescher & I. Altintas 63 Streaming Real-time Data Laser Strainmeter Channels in; Scientific Workflow; Earth-tide signal out Straightforward Example : Seismic Waveforms

64 Scientific Workflows, B. Ludaescher & I. Altintas 64 ORB

65 Scientific Workflows, B. Ludaescher & I. Altintas 65 Job Management (here: NIMROD) Job management infrastructure in place Results database: under development Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04

66 Scientific Workflows, B. Ludaescher & I. Altintas 66 KEPLER Today Support for SWF life cycle –Design, share, prototype, run, monitor, deploy, … Coarse-grained scientific workflows, e.g., –web service actors, grid actors, command-line actors, … Fine grained workflows and simulations, e.g., –Database access, XSLT transformations, … Kepler Extensions –SDM Center/SPA: support for data- and compute-intensive workflows! –real-time data streaming (ROADNet) –other special and generic extensions (e.g. GEON, SEEK) Status –first release (alpha) was in May 2004 –nightly builds w/ version tests –“Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …) –Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)

67 Scientific Workflows, B. Ludaescher & I. Altintas 67 KEPLER Tomorrow Application-driven extensions: –access to/integration with other IDMAF components SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, … –support for execution of new SWF domains Astrophysics: TSI/Blondin (SPA/NCSU) Nuclear Physics: Swesty (SPA/LLNL) … Generic extensions: –addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …) –(C-z; bg; fg)-ing (“detach” and reconnect) –workflow deployment models Additional “domain awareness” (e.g. via new directors) –time series, parameter sweeps, job scheduling, … –hybrid type system with semantic types Consolidation –More installers, regular releases, improved documentation, …

68 Scientific Workflows, B. Ludaescher & I. Altintas 68 Desiderata for and Features of Scientific Workflow Automation SWF design support –step-wise refinement, component/actor-oriented design, flow-oriented design, sharing (visual) design with others, … –better component reuse through actor-oriented modeling w/ (largely) independent directors Rapid prototyping support –Web service actors and harvester –Shell/command line actor –Data transformations (e.g., via Perl, Python, XSLT, … actors) Workflow “plumbing” support –data transformation actors e.g., in Perl, Python, XSLT, … Runtime support –Execution monitoring animation for SDF, planned “heartbeat” for PN, … listening to and logging of token flow through ports and control messages of directors –Pause-inspect-modify-resume cycle

69 Scientific Workflows, B. Ludaescher & I. Altintas 69 F I N Additional material ahead

70 Scientific Workflows, B. Ludaescher & I. Altintas 70 Research (and Development) Issues …some challenges and ideas…

71 Scientific Workflows, B. Ludaescher & I. Altintas 71 “Service Composition, Orchestration” and all that stuff Instead of asking which WS-XXX solves this for you, ask: What is my WF composition problem? Also: there is a good amount of previous work, most notably from the Ptolemy group itself: –How do you model systems as interacting components –How do you model component interaction –How can you make components and interaction patterns as reusable as possible –…  Check out actor-oriented modeling and design!

72 Scientific Workflows, B. Ludaescher & I. Altintas 72 “Programming Patterns” (Higher-Order FP Constructs)

73 Scientific Workflows, B. Ludaescher & I. Altintas 73 Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.

74 Scientific Workflows, B. Ludaescher & I. Altintas 74 Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.

75 Scientific Workflows, B. Ludaescher & I. Altintas 75 Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.

76 Scientific Workflows, B. Ludaescher & I. Altintas 76 hand-crafted control solution; also: forces sequential execution! designed to fit hand-crafted Web-service actor Complex backward control-flow No data transformations available [Altintas-et-al-PIW-SSDBM’03]

77 Scientific Workflows, B. Ludaescher & I. Altintas 77 A Scientific Workflow Problem: More Solved (Computer Scientist’s view) Solution based on declarative, functional dataflow process network (= also a data streaming model!) Higher-order constructs: map (f)  no control-flow spaghetti  data-intensive apps  free concurrent execution  free type checking  automatic support to go from piw(GeneId) to PIW := map (piw) over [GeneId] map (f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub- workflow piw(GeneId)

78 Scientific Workflows, B. Ludaescher & I. Altintas 78 A Scientific Workflow Problem: Even More Solved (domain&CS coming together!) map(GenbankWS) Input: {“NM_001924”, “NM020375”} Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}

79 Scientific Workflows, B. Ludaescher & I. Altintas 79 A Research Problem: Optimization by Rewriting Example: PIW as a declarative, referentially transparent functional process  optimization via functional rewriting possible e.g. map(f o g) = map(f) o map(g) Technical report &PIW specification in Haskell map(f o g) instead of map(f) o map(g) Combination of map and zip http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

80 Scientific Workflows, B. Ludaescher & I. Altintas 80 More Research… Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f mapS g  mapS (f g) Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney

81 Scientific Workflows, B. Ludaescher & I. Altintas 81 KEPLER Today Support for SWF life cycle –Design, share, prototype, run, monitor, deploy, … Coarse-grained scientific workflows, e.g., –web service actors, grid actors, command-line actors, … Fine grained workflows and simulations, e.g., –Database access, XSLT transformations, … Kepler Extensions –support for data- and compute-intensive workflows! –real-time data streaming (ROADNet) –other special and generic extensions (e.g. GEON, SEEK) Status –first release (alpha) was in May 2004 –nightly builds w/ version tests –“Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …) –Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)

82 Scientific Workflows, B. Ludaescher & I. Altintas 82 KEPLER Tomorrow Application-driven extensions: –access to/integration with other IDMAF components SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, … –support for execution of new SWF domains Astrophysics: TSI/Blondin (SPA/NCSU) Nuclear Physics: Swesty (SPA/LLNL) … Generic extensions: –addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …) –(C-z; bg; fg)-ing (“detach” and reconnect) –workflow deployment models Additional “domain awareness” (e.g. via new directors) –time series, parameter sweeps, job scheduling, … –hybrid type system with semantic types Consolidation –More installers, regular releases, improved documentation, …

83 Scientific Workflows, B. Ludaescher & I. Altintas 83 Towards a more concise Presentation Style … Due to lack of time, some slides will be “by reference” only ;-) – …Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work.... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]

84 Scientific Workflows, B. Ludaescher & I. Altintas 84 References Kepler: http://kepler-project.orghttp://kepler-project.org Ptolemy: http://ptolemy.eecs.berkeley.edu/http://ptolemy.eecs.berkeley.edu/ Flow-based Programming: http://www.jpaulmorrison.com/fbp/index.shtmlhttp://www.jpaulmorrison.com/fbp/index.shtml Wiki with links to others: http://www.jpaulmorrison.com/cgi-bin/wiki.plhttp://www.jpaulmorrison.com/cgi-bin/wiki.pl –http://c2.com/cgi/wiki?FlowBasedProgramminghttp://c2.com/cgi/wiki?FlowBasedProgramming –http://c2.com/cgi/wiki?DataflowProgramminghttp://c2.com/cgi/wiki?DataflowProgramming –http://c2.com/cgi/wiki?ActorsModelhttp://c2.com/cgi/wiki?ActorsModel


Download ppt "Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay Altintas Instructors: Bertram Ludaescher Ilkay Altintas."

Similar presentations


Ads by Google