Workflows & Tools. Data Analysis  Review of typical data analyses  Reproducibility & provenance  Overview of workflows  Computer-based scientific.

Slides:



Advertisements
Similar presentations
Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr.
Advertisements

System Development Life Cycle (SDLC)
Lect.3 Modeling in The Time Domain Basil Hamed
Systems Development Environment
Forecasting Using the Simple Linear Regression Model and Correlation
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
IFS310: Week 3 BIS310: Structured Analysis and Design 5/4/2015 Process Modeling and Data Flow Diagrams.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Managing Data Resources
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Lecture 13 Revision IMS Systems Analysis and Design.
Computational Physics Kepler Dr. Guy Tel-Zur. This presentations follows “The Getting Started with Kepler” guide. A tutorial style manual for scientists.
Computers: Tools for an Information Age
Information Systems Development and Acquisition Chapter 8 Jessup & Valacich Instructor: Ramesh Sankaranarayanan.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.
The Software Product Life Cycle. Views of the Software Product Life Cycle  Management  Software engineering  Engineering design  Architectural design.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 14 Systems Analysis and Design: The Big Picture.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Data Mining Techniques
Systems Analysis and Design: The Big Picture
Chapter 13: Inference in Regression
Data Mining Chun-Hung Chou
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Unit 2: Engineering Design Process
INFORMATION SYSTEM APPLICATIONS System Development Life Cycle.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
LESSON 8 Booklet Sections: 12 & 13 Systems Analysis.
Analysis and Workflows Lesson 12: Analysis and Workflows CC image by Marc_Smith on Flickr.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Preserving the Scientific Record: Preserving a Record of Environmental Change Matthew Mayernik National Center for Atmospheric Research Version 1.0 [Review.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
U.S. Department of the Interior U.S. Geological Survey Tutorials on Data Management Lesson 3.3: Analysis and Workflows CC image by wlef70 on Flickr.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Module 6. Data Management Plans  Definitions ◦ Quality assurance ◦ Quality control ◦ Data contamination ◦ Error Types ◦ Error Handling  QA/QC best practices.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
© 2011 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Stewart Venit ~ Elizabeth Drake Developing a Program.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Ecoinformatics Workshop Summary SEEK, LTER Network Main Office University of New Mexico Aluquerque, NM.
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
Applications and Requirements for Scientific Workflow Introduction May NSF Geoffrey Fox Indiana University.
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Topic 4 - Database Design Unit 1 – Database Analysis and Design Advanced Higher Information Systems St Kentigern’s Academy.
Data Organization Quality Assurance and Transformations.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Managing Data Resources File Organization and databases for business information systems.
Systems Analysis and Design in a Changing World, Fourth Edition
Chapter 13 Simple Linear Regression
Lesson Topics Overview of typical data analyses
Computer aided teaching of statistics: advantages and disadvantages
Object-Oriented Software Engineering Using UML, Patterns, and Java,
Lecture 9- Design Concepts and Principles
Data Warehousing and Data Mining
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Lecture 9- Design Concepts and Principles
Computational Physics Kepler
Introduction to Systems Analysis and Design Stefano Moshi Memorial University College System Analysis & Design BIT
MECH 3550 : Simulation & Visualization
Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.
Scientific Workflows Lecture 15
Presentation transcript:

Workflows & Tools

Data Analysis  Review of typical data analyses  Reproducibility & provenance  Overview of workflows  Computer-based scientific workflows (SWF)  Benefits of SWF  Examples of SWF and associated tools

Data Analysis  After completing this lesson, the participant will be able to: ◦ Understand a subset of typical analyses used ◦ Define a workflow ◦ Define an SWF ◦ Discuss the benefits of workflows in general and SWF in particular ◦ Locate resources for using SWF

Data Analysis

 Conducted via personal computer, grid, cloud computing  Statistics, model runs, parameter estimations, production of graphs/plots etc.

Data Analysis  Processing: subsetting, merging, manipulating ◦ Reduction: important for high-resolution datasets ◦ Transformation: unit conversions, linear and nonlinear algorithms Datetimeair tempprecip Cmm 11-Jul-075: Jul-076: Jul-077: Jul-078: Jul-079: Jul-0710: Jul-0711: Jul-0712: Recreated from Michener & Brunt (2000)

Data Analysis  Graphical analyses ◦ Visual exploration of data: search for patterns ◦ Quality assurance: outlier detection Box and whisker plot of temperature by month Scatter plot of August Temperatures Strasser, unpub. data

Data Analysis  Statistical analyses Conventional statistics -Traditionally apply to experimental data -Examples: ANOVA, MANOVA, linear and nonlinear regression Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance Descriptive statistics Traditionally apply to observational or descriptive data Examples: diversity indices, cluster analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis Oksanen 2011 Example of Principle Component Analysis

Data Analysis  Statistical analyses (continued) ◦ Temporal analyses: time series ◦ Spatial analyses: for spatial autocorrelation ◦ Nonparametric approaches: useful when conventional assumptions violated or underlying distribution unknown ◦ Other misc. analyses: risk assessment, generalized linear models, mixed models, etc.  Analyses of very large datasets ◦ Data mining & discovery ◦ Online data processing

Data Analysis  Re-analysis of outputs  Final visualizations: charts, graphs, simulations etc. Science is iterative: The process that results in the final product can be complex

Data Analysis  Reproducibility is at the core of scientific method  Complex process = more difficult to reproduce  Good documentation required for reproducibility ◦ Metadata: data about data ◦ Process metadata: data about process used to create, manipulate, and analyze data

Data Analysis  Process metadata is information about the process used to get to the data outputs  Related concept: data provenance ◦ Data provenance is information about the origins of data ◦ Good provenance = able to follow data throughout entire life cycle (collection, organization & quality control, analyses, visualization) ◦ Allows for  Replication & reproducibility  Analysis for potential defects, errors in logic, statistical errors  Evaluation of hypotheses

Data Analysis  A workflow is a formalization of process metadata  Includes precise description of scientific procedure  Includes conceptualized series of data ingestion, transformation, and analytical steps  Three components of a workflow: 1.Inputs: Information or material required 2.Outputs: Information or material produced & potentially used as input in other steps 3.Transformation rules/algorithms (e.g. analyses)

Data Analysis  Simplest form of workflow: flow chart Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning

Data Analysis  Simplest form of workflow: flow chart Temperature data (T) Salinity data (S) Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Inputs & Outputs Summary statistics Data in Excel format Input: Raw T and S data Output: data in Excel format Input: data in Excel format

Data Analysis Temperature data (T) Salinity data (S) Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Transformation Rules Summary statistics Data in Excel format  Simplest form of workflow: flow chart Transformation rules describe what is done to/with the data to obtain the relevant outputs for publication.

Data Analysis Science is becoming more computationally intensive Most transformations are done via computer programs Sharing workflows benefits science Defining a scientific workflow system makes documenting workflows easier

Data Analysis  A scientific workflow is an “analytical pipeline”  Each step can be implemented in different software systems  Each step and its parameters/requirements are formally recorded  This allows reuse of both individual steps and the overall workflow

Data Analysis  Single access point for multiple analyses across software packages  Keeps track of analysis and provenance: enables reproducibility ◦ Each step & its parameters/requirements formally recorded  Workflow can be stored  Allows sharing and reuse of individual steps or overall workflow ◦ Automate repetitive tasks ◦ Use across different disciplines and groups ◦ Can run analyses more quickly since not starting from scratch

Data Analysis  Open-source, free, cross-platform  Drag-and-drop interface for workflow construction  Steps (analyses, manipulations, etc) in workflow represented by an “actor”  Actors connect via inputs and outputs to form a workflow  Possible applications ◦ Theoretical models or observational analyses ◦ Hierarchical modeling ◦ Can have nested workflows ◦ Can access data from web-based sources (e.g. databases)  Downloads and more information at kepler-project.org

Data Analysis Drag & drop components from this list Actors in workflow

Data Analysis This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics.

Data Analysis Resulting output

Data Analysis  Open-source  Workflow & provenance management support  Geared toward exploratory computational tasks ◦ Can manage evolving SWF ◦ Maintains detailed history about steps & data  Screenshot example

Data Analysis  Social networking site to support scientists that use SWF  Allows searching for, sharing, reuse of SWF  Can comment on and discuss contributed SWF  Gateway to journals and data repositories 

Data Analysis  Scientists should document workflows used to create results ◦ Data provenance ◦ Analyses and parameters used ◦ Connections between analyses via inputs and outputs  Documentation can be informal (for example, a flowchart) or formal (for example, Kepler software)

Data Analysis  Modern science is computer-intensive ◦ Heterogeneous data, analyses, software  Reproducibility is important  Workflows = process metadata ◦ Necessary for reproducibility, repeatability, validation  There are formal systems for documenting process metadata ◦ Enable storage, sharing, visualization, reuse

Data Analysis  Gil, Y, E Deelman, M Ellisman, T Fahringer, G Fox, D Gannon, C Goble, M Livny, L Moreau, and J Myers. Examining the Challenges of Scientific Workflows. Computer 40:24–32,  Michener, K, J Beach, M Jones, B Ludaescher, D Pennington, R Pereira, A Rajasekar, and M Schildhauer. A knowledge environment for the biodiversity and ecological sciences. Journal of Intelligent Information Systems, 29:111–126, August  Ludäscher, B, I Altintas, S Bowers, J Cummings, T Critchlow, E Deelman, DD Roure, J Freire, C Goble, M Jones, S Klasky, T McPhillips, N. Podhorszki, C Silva, I Taylor, and M Vouk. Scientific Process Automation and Workflow Management. Computational Science Series Ch 13. Chapman & Hall, Boca Raton,  McPhillips, T, S Bowers, D Zinn, B Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems 25: ,  B Ludäscher, I Altintas, C Berkley, D Higgins, E Jaeger-Frank, M Jones, E Lee, J Tao, and Y Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice & Experience, 18,  W Michener and J Brunt, editors. Ecological Data: Design, Management and Processing. Blackwell Science, 180p, 2000.

START QUIZ

Data Analysis Analyses The output which is the information produced The input that contains the information All of the above

Data Analysis Review this section Return

Data Analysis Proceed to the next question NEXT

Data Analysis Data mining and discovery Grid computing Pattern searching and decision trees Spatial analyses

Data Analysis Review this section Return

Data Analysis Proceed to the next question Next

Data Analysis Scatter plots Box-and-whisker plots Plots that show you potential data errors All graphical formats and analyses

Data Analysis Review this section Return

Data Analysis Proceed to the next question Next

Data Analysis Reproducibility Repeatability Validation All of the above

Data Analysis Review this section Return

Data Analysis Proceed to the next question Next

Data Analysis Scientific workflow Systematic workflow Scientific workforce Systematic work information

Data Analysis Review this section Return

Data Analysis Proceed to the next question Next

Data Analysis Each step can be implemented in different software systems with requirements formally recorded Single access point for multiple analyses. Workflow can be stored Allows sharing of individual steps. All of the above.

Data Analysis Review this section Return

Data Analysis Proceed to the next question Next

Data Analysis Good organization Good data maintenance Good provenance Good metadata

Data Analysis Review this section Return

Data Analysis You have completed this learning module. Next

Data Analysis We want to hear from you! CLICK the arrow to take our short survey.