Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr.

Similar presentations

Presentation on theme: "Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr."— Presentation transcript:

1 Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr

2 Analysis and Workflows Review of typical data analyses Reproducibility & provenance Workflows in general Informal workflows Formal workflows CC image by jwalsh on Flickr

3 Analysis and Workflows After completing this lesson, the participant will be able to: o Understand a subset of typical analyses used o Define a workflow o Understand the concepts informal and formal workflows o Discuss the benefits of workflows CC image by cybrarian77 on Flickr

4 Analysis and Workflows Plan Collect AssureDescribePreserveDiscoverIntegrateAnalyze

5 Analysis and Workflows Conducted via personal computer, grid, cloud computing Statistics, model runs, parameter estimations, graphs/plots etc. CC image by hegemonx on Flickr CC image by tai viinikka on Flickr

6 Analysis and Workflows Processing: subsetting, merging, manipulating o Reduction: important for high-resolution datasets o Transformation: unit conversions, linear and nonlinear algorithms 0711070500276000 0711070600276000 0711070700277003 0711070800282017 0711070900285000 0711071000293000 0711071100301000 0711071200304000 Datetimeair tempprecip Cmm 11-Jul-075:0027.6000 11-Jul-076:0027.6000 11-Jul-077:0027.7003 11-Jul-078:0028.2017 11-Jul-079:0028.5000 11-Jul-0710:0029.3000 11-Jul-0711:0030.1000 11-Jul-0712:0030.4000 Recreated from Michener & Brunt (2000)

7 Analysis and Workflows Graphical analyses o Visual exploration of data: search for patterns o Quality assurance: outlier detection Box and whisker plot of temperature by month Scatter plot of August Temperatures Strasser, unpub. data

8 Analysis and Workflows Statistical analyses Conventional statistics o Experimental data o Examples: ANOVA, MANOVA, linear and nonlinear regression o Rely on assumptions: random sampling, random & normally distributed error, independent error terms, homogeneous variance Descriptive statistics o Observational or descriptive data o Examples: diversity indices, cluster analysis, quadrant variance, distance methods, principal component analysis, correspondence analysis From Oksanen (2011) Multivariate Analysis of Ecological Communities in R: vegan tutorial Example of Principle Component Analysis

9 Analysis and Workflows Statistical analyses (continued) o Temporal analyses: time series o Spatial analyses: for spatial autocorrelation o Nonparametric approaches useful when conventional assumptions violated or underlying distribution unknown o Other misc. analyses: risk assessment, generalized linear models, mixed models, etc. Analyses of very large datasets o Data mining & discovery o Online data processing

10 Analysis and Workflows Re-analysis of outputs Final visualizations: charts, graphs, simulations etc. Science is iterative: The process that results in the final product can be complex

11 Analysis and Workflows Reproducibility at core of scientific method Complex process = more difficult to reproduce Good documentation required for reproducibility o Metadata: data about data o Process metadata: data about process used to create, manipulate, and analyze data CC image by Richard Carter on Flickr

12 Analysis and Workflows Process metadata: Information about process (analysis, data organization, graphing) used to get to data outputs Related concept: data provenance o Origins of data o Good provenance = able to follow data throughout entire life cycle o Allows for Replication & reproducibility Analysis for potential defects, errors in logic, statistical errors Evaluation of hypotheses

13 Analysis and Workflows Formalization of process metadata Precise description of scientific procedure Conceptualized series of data ingestion, transformation, and analytical steps Three components o Inputs: information or material required o Outputs: information or material produced & potentially used as input in other steps o Transformation rules/algorithms (e.g. analyses) Two types: o Informal o Formal/Executable

14 Analysis and Workflows Workflow diagrams: Some basic building blocks Analytical process Data (input or output) Decision Predefined process (subroutine) Predefined process (subroutine) o Inputs or outputs include data, metadata, or visualizations o Analytical processes include operations that change or manipulate data in some way o Decisions specify conditions that determine the next step in the process o Predefined processes or subroutines specify a fixed multi- step process

15 Analysis and Workflows Output data, visualizations, associated metadata Data ingestion Raw data and associated metadata Data cleaning Analysis Step 1 Analysis Step 2 Output generatio n Workflow diagrams: Simple linear flow chart Conceptualizing analysis as a sequence of steps o arrows indicate flow

16 Analysis and Workflows Flow charts: simplest form of workflow Graph production Analysis: mean, SD Quality control & data cleaning Data import into R

17 Analysis and Workflows Flow charts: simplest form of workflow Transformation Rules Graph production Analysis: mean, SD Quality control & data cleaning Data import into R

18 Analysis and Workflows Flow charts: simplest form of workflow Inputs & Outputs Graph production Analysis: mean, SD Quality control & data cleaning Data import into R Salinity data Temperature data “Clean” T & S data Data in R format Summary statistics

19 Analysis and Workflows Analysis 2 Analysis 1 CONDITION false true CONDITIONAL LOOP IF Workflow diagrams: Adding decision points

20 Analysis and Workflows species name lat/long Data ingestion Data and metadata QA/QC (e.g., check for appropriate data types, outliers) Determine range (calculate convex hull of points) Compile list of unique taxa Generate output data and visualizations for taxa frequencies in stated range Data, visualizations, and metadata FOR counts Calculate taxa frequencies Workflow diagrams: a simple example

21 Analysis and Workflows false true A B Data ingestion Data and metadata Data cleaning Analysis 1B Analysis 1A Output generation Data, visualizations, and metadata Data and metadata FOR Predefined process (subroutine) Predefined process (subroutine) Data ingestion Data cleaning Data integration Data and metadata Data ingestion Data cleaning IF Analysis 2 Data integration Workflow diagrams: a complex example

22 Analysis and Workflows # % $ & Commented scripts: best practices Well-documented code is easier to review, share, enables repeated analysis Add high-level information at the top o Project description, author, date o Script dependencies, inputs, and outputs o Describes parameters and their origins Notice and organize sections o What happens in the section and why o Describe dependencies, inputs, and outputs Construct “end-to-end” script if possible o A complete narrative o Runs without intervention from start to finish

23 Analysis and Workflows Analytical pipeline Each step can be implemented in different software systems Each step & its parameters/requirements formally recorded Allows reuse of both individual steps and overall workflow CC image by AJ Cann on Flickr

24 Analysis and Workflows Benefits: Single access point for multiple analyses across software packages Keeps track of analysis and provenance: enables reproducibility o Each step & its parameters/requirements formally recorded Workflow can be stored Allows sharing and reuse of individual steps or overall workflow o Automate repetitive tasks o Use across different disciplines and groups o Can run analyses more quickly since not starting from scratch

25 Analysis and Workflows Example: Kepler Software Open-source, free, cross-platform Drag-and-drop interface for workflow construction Steps (analyses, manipulations etc) in workflow represented by “actor” Actors connect from a workflow Possible applications o Theoretical models or observational analyses o Hierarchical modeling o Can have nested workflows o Can access data from web-based sources (e.g. databases) Downloads and more information at

26 Analysis and Workflows Drag & drop components from this list Actors in workflow Example: Kepler Software

27 Analysis and Workflows This model shows the solution to the classic Lotka- Volterra predator prey dynamics model. It uses the Continuous Time domain to solve two coupled differential equations, one that models the predator population and one that models the prey population. The results are plotted as they are calculated showing both population change and a phase diagram of the dynamics. Example: Kepler Software

28 Analysis and Workflows Output Example: Kepler Software

29 Analysis and Workflows Example: VisTrails Open-source Workflow & provenance management support Geared toward exploratory computational tasks o Can manage evolving SWF o Maintains detailed history about steps & data Screenshot example

30 Analysis and Workflows Science is becoming more computationally intensive Sharing workflows benefits science o Scientific workflow systems make documenting workflows easier Minimally: document your analysis via informal workflows Emerging workflow applications (formal/executable workflows) will o Link software for executable end-to-end analysis o Provide detailed info about data & analysis o Facilitate re-use & refinement of complex, multi-step analyses o Enable efficient swapping of alternative models & algorithms o Help automate tedious tasks

31 Analysis and Workflows Scientists should document workflows used to create results o Data provenance o Analyses and parameters used o Connections between analyses via inputs and outputs Documentation can be informal (e.g. flowcharts, commented scripts) or formal (e.g. Kepler, VisTrails) CC image geek calendar on Flickr

32 Analysis and Workflows Modern science is computer-intensive o Heterogeneous data, analyses, software Reproducibility is important Workflows = process metadata Use of informal or formal workflows for documenting process metadata ensures reproducibility, repeatability, validation

33 Analysis and Workflows 1. W. Michener and J. Brunt, Eds. Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000).

34 Analysis and Workflows The full slide deck may be downloaded from: Suggested citation: DataONE Education Module: Analysis and Workflows. DataONE. Retrieved Nov12, 2012. From Workflows.pptx Copyright license information: No rights reserved; you may enhance and reuse for your own purposes. We do ask that you provide appropriate citation and attribution to DataONE.

Download ppt "Analysis and Workflows Lesson 10: Analysis and Workflows CC image by wlef70 on Flickr."

Similar presentations

Ads by Google