Month: April 2013

Finding commonalities in different scientific experiments

A scientific workflow can be defined as the computational steps required for executing an in silico experiment. Scientific workflows are similar to laboratory protocols. The only difference between them is that in scientific workflows scientists run their simulations with a computer instead of in a laboratory. These kinds of workflows are becoming increasingly important for three reasons:

  1. They help the scientist researching the experiment to repeat the results and convince his/her colleagues of the validity of the method followed.
  2. They expose the inputs, intermediate results, outputs and codes of the experiment. In this regard they can be considered the log of the research, which is very useful for the reviewers of the work.
  3. They allow reproducing (and reusing) the method of the experiment to other scientists, since they just have to rerun the original experiments or use other input data.
Feature selection workflow

On the left of this post there is an example of a workflow for feature selection: each of the steps represents a computational method that performs some operation on the input data received from the previous step. The initial input is a Dataset and a set of words, and the final output the selected features.
Repositories of workflows like myExperiment, CrowdLabs and Galaxy have been created for scientists to share their scientific workflows, so a reasonable question for a scientist would be: how do these workflows relate to each other? Which are the most popular workflow fragments in the dataset? Has there really been reuse among the different workflows?

A possible way to find out these popular workflow fragments is to apply graph matching techniques. If we represent the workflow as a directed acyclic graph, we can apply existing methods to see how a set of workflows overlaps with other similar workflows in the repository. In the UPM we have been recently working on this problem, and the technique we have used is described on our last accepted work at K-CAP. The idea is to obtain the best context free grammar encoding all the workflows of the repository. The rules of the grammar will be the most common fragments among the workflows, which is exactly what we are looking for. If you want more information and examples, I encourage you to have a look at the paper.

Research Objects for dummies

Last week a new W3C community and business group was approved: the Research Object for Scholarly Communication Community Group. But what is a Research Object? Why do we need them?

In a nutshell, a Research Object is a wrapper of interconnected resources and metadata bundled together to offer a context and materials for some purpose. It could be a set of papers that represent the state of the art of a project proposal, the experiments performed in a scientific experiment or even a post about Research Objects and its associated materials.

The Research Object primer explains with some examples the Workflow-Centric Research Objects, where the focus is to describe an experiment with its workflow specification and its relation to inputs, intermediate results, outputs and final publications. If you want to browse more information about the RO Model I recommend browsing this documentation.

Do you want to see an example? This RO was created from a myExperiment workflow execution, including a bundle with the provenance information, a sketch and the specification of the workflow. It even has the hypothesis the experiment aims to check.

But why is this useful? Well, having all the associated resources of an experiment scattered through the web makes it difficult for the consumer to understand the complexity of the work. For example, imagine that I have this paper, describing an experiment (link). The paper shows some tables and plots, which are made from the curated results of this web page (link), not linked from the paper. But then I also have an execution of a workflow (link), which shows a diagram of the workflow and provides pointers to all intermediate and final (non curated) data. The scientists who run the initial experiment for the paper are different from those who run the workflow, and the attribution should be noted. Also, all the links and dependencies between data can’t be inferred by just accessing to the resources. By having everything included in a Research Object, one would be able to understand what the role of each resource in the experiment is, being able to communicate it to other researchers.