Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workflows & Tools. Data Analysis  Review of typical data analyses  Reproducibility & provenance  Overview of workflows  Computer-based scientific.

Similar presentations


Presentation on theme: "Workflows & Tools. Data Analysis  Review of typical data analyses  Reproducibility & provenance  Overview of workflows  Computer-based scientific."— Presentation transcript:

1 Workflows & Tools

2 Data Analysis  Review of typical data analyses  Reproducibility & provenance  Overview of workflows  Computer-based scientific workflows (SWF)  Benefits of SWF  Examples of SWF and associated tools

3 Data Analysis  After completing this lesson, the participant will be able to: ◦ Understand a subset of typical analyses used ◦ Define a workflow ◦ Define an SWF ◦ Discuss the benefits of workflows in general and SWF in particular ◦ Locate resources for using SWF

4 Data Analysis

5  Conducted via personal computer, grid, cloud computing  Statistics, model runs, parameter estimations, production of graphs/plots etc.

6 Data Analysis  Processing: subsetting, merging, manipulating ◦ Reduction: important for high-resolution datasets ◦ Transformation: unit conversions, linear and nonlinear algorithms 0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 0711071100301000 0711071200304000 Datetimeair tempprecip Cmm 11-Jul-075:0027.6000 11-Jul-076:0027.6000 11-Jul-077:0027.7003 11-Jul-078:0028.2017 11-Jul-079:0028.5000 11-Jul-0710:0029.3000 11-Jul-0711:0030.1000 11-Jul-0712:0030.4000 Recreated from Michener & Brunt (2000)

7 Data Analysis  Graphical analyses ◦ Visual exploration of data: search for patterns ◦ Quality assurance: outlier detection Box and whisker plot of temperature by month Scatter plot of August Temperatures Strasser, unpub. data

8 Data Analysis  Statistical analyses Conventional statistics -Traditionally apply to experimental data -Examples: ANOVA, MANOVA, linear and nonlinear regression Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance Descriptive statistics Traditionally apply to observational or descriptive data Examples: diversity indices, cluster analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis Oksanen 2011 Example of Principle Component Analysis

9 Data Analysis  Statistical analyses (continued) ◦ Temporal analyses: time series ◦ Spatial analyses: for spatial autocorrelation ◦ Nonparametric approaches: useful when conventional assumptions violated or underlying distribution unknown ◦ Other misc. analyses: risk assessment, generalized linear models, mixed models, etc.  Analyses of very large datasets ◦ Data mining & discovery ◦ Online data processing

10 Data Analysis  Re-analysis of outputs  Final visualizations: charts, graphs, simulations etc. Science is iterative: The process that results in the final product can be complex

11 Data Analysis  Reproducibility is at the core of scientific method  Complex process = more difficult to reproduce  Good documentation required for reproducibility ◦ Metadata: data about data ◦ Process metadata: data about process used to create, manipulate, and analyze data

12 Data Analysis  Process metadata is information about the process used to get to the data outputs  Related concept: data provenance ◦ Data provenance is information about the origins of data ◦ Good provenance = able to follow data throughout entire life cycle (collection, organization & quality control, analyses, visualization) ◦ Allows for  Replication & reproducibility  Analysis for potential defects, errors in logic, statistical errors  Evaluation of hypotheses

13 Data Analysis  A workflow is a formalization of process metadata  Includes precise description of scientific procedure  Includes conceptualized series of data ingestion, transformation, and analytical steps  Three components of a workflow: 1.Inputs: Information or material required 2.Outputs: Information or material produced & potentially used as input in other steps 3.Transformation rules/algorithms (e.g. analyses)

14 Data Analysis  Simplest form of workflow: flow chart Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning

15 Data Analysis  Simplest form of workflow: flow chart Temperature data (T) Salinity data (S) Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Inputs & Outputs Summary statistics Data in Excel format Input: Raw T and S data Output: data in Excel format Input: data in Excel format

16 Data Analysis Temperature data (T) Salinity data (S) Data import into Excel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Transformation Rules Summary statistics Data in Excel format  Simplest form of workflow: flow chart Transformation rules describe what is done to/with the data to obtain the relevant outputs for publication.

17 Data Analysis Science is becoming more computationally intensive Most transformations are done via computer programs Sharing workflows benefits science Defining a scientific workflow system makes documenting workflows easier

18 Data Analysis  A scientific workflow is an “analytical pipeline”  Each step can be implemented in different software systems  Each step and its parameters/requirements are formally recorded  This allows reuse of both individual steps and the overall workflow

19 Data Analysis  Single access point for multiple analyses across software packages  Keeps track of analysis and provenance: enables reproducibility ◦ Each step & its parameters/requirements formally recorded  Workflow can be stored  Allows sharing and reuse of individual steps or overall workflow ◦ Automate repetitive tasks ◦ Use across different disciplines and groups ◦ Can run analyses more quickly since not starting from scratch

20 Data Analysis  Open-source, free, cross-platform  Drag-and-drop interface for workflow construction  Steps (analyses, manipulations, etc) in workflow represented by an “actor”  Actors connect via inputs and outputs to form a workflow  Possible applications ◦ Theoretical models or observational analyses ◦ Hierarchical modeling ◦ Can have nested workflows ◦ Can access data from web-based sources (e.g. databases)  Downloads and more information at kepler-project.org

21 Data Analysis Drag & drop components from this list Actors in workflow

22 Data Analysis This model shows the solution to the classic Lotka-Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics.

23 Data Analysis Resulting output

24 Data Analysis  Open-source  Workflow & provenance management support  Geared toward exploratory computational tasks ◦ Can manage evolving SWF ◦ Maintains detailed history about steps & data  www.vistrails.org Screenshot example

25 Data Analysis  Social networking site to support scientists that use SWF  Allows searching for, sharing, reuse of SWF  Can comment on and discuss contributed SWF  Gateway to journals and data repositories  www.myexperiment.org

26 Data Analysis  Scientists should document workflows used to create results ◦ Data provenance ◦ Analyses and parameters used ◦ Connections between analyses via inputs and outputs  Documentation can be informal (for example, a flowchart) or formal (for example, Kepler software)

27 Data Analysis  Modern science is computer-intensive ◦ Heterogeneous data, analyses, software  Reproducibility is important  Workflows = process metadata ◦ Necessary for reproducibility, repeatability, validation  There are formal systems for documenting process metadata ◦ Enable storage, sharing, visualization, reuse

28 Data Analysis  Gil, Y, E Deelman, M Ellisman, T Fahringer, G Fox, D Gannon, C Goble, M Livny, L Moreau, and J Myers. Examining the Challenges of Scientific Workflows. Computer 40:24–32, 2007.  Michener, K, J Beach, M Jones, B Ludaescher, D Pennington, R Pereira, A Rajasekar, and M Schildhauer. A knowledge environment for the biodiversity and ecological sciences. Journal of Intelligent Information Systems, 29:111–126, August 2007.  Ludäscher, B, I Altintas, S Bowers, J Cummings, T Critchlow, E Deelman, DD Roure, J Freire, C Goble, M Jones, S Klasky, T McPhillips, N. Podhorszki, C Silva, I Taylor, and M Vouk. Scientific Process Automation and Workflow Management. Computational Science Series Ch 13. Chapman & Hall, Boca Raton, 2009.  McPhillips, T, S Bowers, D Zinn, B Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems 25: 541-551, 2009.  B Ludäscher, I Altintas, C Berkley, D Higgins, E Jaeger-Frank, M Jones, E Lee, J Tao, and Y Zhao. Scientific workflow management and the kepler system. Concurrency and Computation: Practice & Experience, 18, 2006.  W Michener and J Brunt, editors. Ecological Data: Design, Management and Processing. Blackwell Science, 180p, 2000.

29 START QUIZ

30 Data Analysis Analyses The output which is the information produced The input that contains the information All of the above

31 Data Analysis Review this section Return

32 Data Analysis Proceed to the next question NEXT

33 Data Analysis Data mining and discovery Grid computing Pattern searching and decision trees Spatial analyses

34 Data Analysis Review this section Return

35 Data Analysis Proceed to the next question Next

36 Data Analysis Scatter plots Box-and-whisker plots Plots that show you potential data errors All graphical formats and analyses

37 Data Analysis Review this section Return

38 Data Analysis Proceed to the next question Next

39 Data Analysis Reproducibility Repeatability Validation All of the above

40 Data Analysis Review this section Return

41 Data Analysis Proceed to the next question Next

42 Data Analysis Scientific workflow Systematic workflow Scientific workforce Systematic work information

43 Data Analysis Review this section Return

44 Data Analysis Proceed to the next question Next

45 Data Analysis Each step can be implemented in different software systems with requirements formally recorded Single access point for multiple analyses. Workflow can be stored Allows sharing of individual steps. All of the above.

46 Data Analysis Review this section Return

47 Data Analysis Proceed to the next question Next

48 Data Analysis Good organization Good data maintenance Good provenance Good metadata

49 Data Analysis Review this section Return

50 Data Analysis You have completed this learning module. Next

51 Data Analysis We want to hear from you! CLICK the arrow to take our short survey.


Download ppt "Workflows & Tools. Data Analysis  Review of typical data analyses  Reproducibility & provenance  Overview of workflows  Computer-based scientific."

Similar presentations


Ads by Google