Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Similar presentations


Presentation on theme: "Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute."— Presentation transcript:

1 Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute

2 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Acknowledgements Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi (Center for Grid Technologies, ISI) James Blythe, Yolanda Gil (Intelligent Systems Division, ISI) http://pegasus.isi.edu Research funded as part of the NSF GriPhyN, NVO and SCEC projects and EU- funded GridLab

3 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Today’s Scientific Applications Increasing in the level of complexity Use of individual application components Reuse of individual intermediate data products (files) Description of Data Products using Metadata Attributes Execution environment is complex and very dynamic Resources come and go Data is replicated Components can be found at various locations or staged in on demand Separation between the application description the actual execution description

4 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Workflow Definitions Workflow template: shows the main steps in the scientific analysis and their dependencies without specifying particular data products Abstract workflow: depicts the scientific analysis including the data used and generated, but does not include information about the resources needed for execution Concrete workflow: an executable workflow that includes details of the execution environment

5 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu

6

7 Concrete Workflow Generation and Mapping

8 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Pegasus: Planning for Execution in Grids Maps from abstract to concrete workflow Algorithmic and AI-based techniques Automatically locates physical locations for both workflow components and data Finds appropriate resources to execute Reuses existing data products where applicable Publishes newly derived data products Provides provenance information

9 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Generating a Concrete Workflow Information location of files and component Instances State of the Grid resources Select specific Resources Files Add jobs required to form a concrete workflow that can be executed in the Grid environment Data movement Data registration Each component in the abstract workflow is turned into an executable job

10 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Information Components used by Pegasus Globus Monitoring and Discovery Service (MDS) Locates available resources Finds resource properties Dynamic: load, queue length Static: location of GridFTP server, RLS, etc Globus Replica Location Service Locates data that may be replicated Registers new data products Transformation Catalog Locates installed executables

11 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Example Workflow Reduction Original abstract workflow If “b” already exists (as determined by query to the RLS), the workflow can be reduced

12 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Mapping from abstract to concrete Query RLS, MDS, and TC, schedule computation and data movement

13 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Pegasus Research resource discovery and assessment resource selection resource provisioning workflow restructuring task merged together or reordered to improve overall performance adaptive computing Workflow refinement adapts to changing execution environment

14 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Benefits of the workflow & Pegasus approach The workflow exposes the structure of the application maximum parallelism of the application Pegasus can take advantage of the structure to Set a planning horizon (how far into the workflow to plan) Cluster a set of workflow nodes to be executed as one (for performance) Pegasus shields from the Grid details

15 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Benefits of the workflow & Pegasus approach Pegasus can run the workflow on a variety of resources Pegasus can run a single workflow across multiple resources Pegasus can opportunistically take advantage of available resources (through dynamic workflow mapping) Pegasus can take advantage of pre-existing intermediate data products Pegasus can improve the performance of the application.

16 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Mosaic of M42 created on the Teragrid resources using Pegasus Pegasus improved the runtime of this application by 90% over the baseline case Bruce Berriman, John Good (Caltech) Joe Jacob, Dan Katz (JPL)

17 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Future Directions Support for workflows with real-time feedback to scientists. Providing intermediate analysis results so that the experimental setup can be adjusted while the short-lived samples or human subjects are available.

18 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu time Levels of abstraction Application -level knowledge Logical tasks Tasks bound to resources and sent for execution User’s Request Relevant components Full abstract workflow Partial execution Not yet executed Workflow refinement Onto-based Matchmaker Workflow repair Policy reasoner Cognitive Grids: Distributed Intelligent Reasoners that Incrementally Generate the Workflow

19 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu BLAST : set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query Lead by Veronika Nefedova (ANL) as part of the Paci Data Quest Expedition program 2 major runs were performed using Chimera and Pegasus: 1)60 genomes (4,000 sequences each), In 24 hours processed Genomes selected from DOE-sponsored sequencing projects 67 CPU-days of processing time delivered ~ 10,000 Grid jobs >200,000 BLAST executions 50 GB of data generated 2) 450 genomes processed Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant.

20 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Tomography (NIH-funded project) Derivation of 3D structure from a series of 2D electron microscopic projection images, Reconstruction and detailed structural analysis complex structures like synapses large structures like dendritic spines. Acquisition and generation of huge amounts of data Large amount of state-of-the-art image processing required to segment structures from extraneous background. Dendrite structure to be rendered by Tomography Work performed with Mark Ellisman, Steve Peltier, Abel Lin, Thomas Molina (SDSC)

21 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu LIGO’s pulsar search at SC 2002 The pulsar search conducted at SC 2002 Used LIGO’s data collected during the first scientific run of the instrument Targeted a set of 1000 locations of known pulsar as well as random locations in the sky Results of the analysis were be published via LDAS (LIGO Data Analysis System) to the LIGO Scientific Collaboration performed using LDAS and compute and storage resources at Caltech, University of Southern California, University of Wisconsin Milwaukee. ISI people involved: Gaurang Mehta, Sonal Patil, Srividya Rao, Gurmeet Singh, Karan Vahi Visualization by Marcus Thiebaux

22 Ewa Deelman, deelman@isi.eduwww.isi.edu/~deelmanpegasus.isi.edu Southern California Earthquake Center Southern California Earthquake Center (SCEC), in collaboration with the USC Information Sciences Institute, San Diego Supercomputer Center, the Incorporated Research Institutions for Seismology, and the U.S. Geological Survey, is developing the Southern California Earthquake Center Community Modeling Environment (SCEC/CME). Create fully three-dimensional (3D) simulations of fault-system dynamics. Physics-based simulations can potentially provide enormous practical benefits for assessing and mitigating earthquake risks through Seismic Hazard Analysis (SHA). The SCEC/CME system is an integrated geophysical simulation modeling framework that automates the process of selecting, configuring, and executing models of earthquake systems. Acknowledgments : Philip Maechling and Vipin Gupta University Of Southern California


Download ppt "Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute."

Similar presentations


Ads by Google