A scientific workflow can be defined as the computational steps required for executing an in silico experiment. Scientific workflows are similar to laboratory protocols. The only difference between them is that in scientific workflows scientists run their simulations with a computer instead of in a laboratory. These kinds of workflows are becoming increasingly important for three reasons:
- They help the scientist researching the experiment to repeat the results and convince his/her colleagues of the validity of the method followed.
- They expose the inputs, intermediate results, outputs and codes of the experiment. In this regard they can be considered the log of the research, which is very useful for the reviewers of the work.
- They allow reproducing (and reusing) the method of the experiment to other scientists, since they just have to rerun the original experiments or use other input data.
On the left of this post there is an example of a workflow for feature selection: each of the steps represents a computational method that performs some operation on the input data received from the previous step. The initial input is a Dataset and a set of words, and the final output the selected features.
Repositories of workflows like myExperiment, CrowdLabs and Galaxy have been created for scientists to share their scientific workflows, so a reasonable question for a scientist would be: how do these workflows relate to each other? Which are the most popular workflow fragments in the dataset? Has there really been reuse among the different workflows?
A possible way to find out these popular workflow fragments is to apply graph matching techniques. If we represent the workflow as a directed acyclic graph, we can apply existing methods to see how a set of workflows overlaps with other similar workflows in the repository. In the UPM we have been recently working on this problem, and the technique we have used is described on our last accepted work at K-CAP. The idea is to obtain the best context free grammar encoding all the workflows of the repository. The rules of the grammar will be the most common fragments among the workflows, which is exactly what we are looking for. If you want more information and examples, I encourage you to have a look at the paper.