An Introduction to Designing and Executing Workflows with Taverna Part 2 – Importing and exporting data Norman Morrison University of Manchester Credits: Aleksandra Pawlik and Katy Wolstencroft
We can add input data into the workflow not only manually but also from a file. Go to myExperiment group and download a file called: 03B_species_1.txt Click run workflow again but instead of selecting Set value select Set file location and navigate to where you saved the 03B_species_1.txt file
Instead of downloading the file we can point the workflow to the file’s URL (if we know it). Let’s run the workflow again but this time select “Set URL” and paste in: ad/03B_species_1.txt
So far we have used simple text files, but it is also possible to use Spreadsheets as sources of input data. In order to do that we will need to add a Spreadsheet tool to our workflow. From the myExperiment group download the file: 03C_species_list_1.xls Open it on your machine and see what it contains (the list of the species name is in cells B3 to B6) From the Service Templates select the Spreadsheet Import tool right-click on it and add it to the workflow
In the pop up window set the correct range for columns and rows (untick the box “all rows”)
We need to delete the input port for the workflow (right click on it and select Delete) The Spreadsheet tool expects as an input the URL (or path) to the file. The best way to feed in that URL/path is to add a service called “Text constant”
Where it says “Add your own value here” enter: nt.org/files/1108/version s/1/download/03C_spec ies_list_1.xls nt.org/files/1108/version s/1/download/03C_spec ies_list_1.xls If you prefer you can insert the full path to your local file Then Apply and Close
Connect the Text constant with the Spreadsheet Import tool Connect the Spreadsheet Import tool with the input to the GBIF service
When we run the service, we can see that there are four values for the results (as there were 4 species names that we read from the spreadsheet). Taverna implicitly iterated over these 4 input values and processed them.
Taverna allows you to save results in different formats and also allows you to save intermediate workflow results (which is very useful when you run a large workflow) You can save all result values: Taverna allows you to save values in a variety of formats
You can also save each single value separately: In order to save intermediate values, in the results tab select the part of the workflow which you want to save the values for, then in the results window you should see these values and you will be able to save them
A shim is a service that doesn’t perform an experimental function, but acts as a connector, or glue, when 2 experimental services have incompatible outputs and inputs A shim can be any type of service – WSDL, soaplab etc. Many are simple Beanshell scripts Shims can also be used to preprocess data that are input into the workflow and we will use one of these shims for this exercise
Create a directory called “data” Copy over the files which we used for the previous exercise in to this directory: 03B_species_1.txt 03C_species_list_1.xls From the myExperiment group download the following files to the same directory: 03D_species_2.txt 03E_species_list_2.xls
Let’s assume you’re regularly having to deal with data in different formats - one of them is spreadsheet (csv or xls). You know that the spreadsheet files always have the species names in column B starting from row 3 up to row 100 (some rows may be empty). You can automate your workflow to pull the species names from all of these spreadsheets in a specified directory at once using a shim service.
Delete the Text constant service in your workflow From the Available Services select Local Services io and List Files by Extension
Connect the shim service with the Spreadsheet tool Right click on “file extension” and enter xls Right click on directory, click constant vaue and enter the path to the Directory you just created caled “Data”.
We need to reconfigure the Spreadsheet service We’ll set the rows from 3 to 100 And make the service ignore the blank rows
Run the workflow When we look at the results we can see that Taverna read the species names from both spreadsheets ignored the text files found the values for them using the GBIF service