While being a PhD student, many people have asked me about the subject of my thesis and the main ideas behind my research. As a student you always think you have very clear what you are doing, at least until you have to actually explain it to someone who is not related to your domain. In fact, it is about using the right terminology. If you say something like “Oh yeah, I am trying to detect abstractions on scientific workflows semi-automatically in order to understand how they can better be reused and related to each other”, people will look at you as if you didn’t belong to this planet. Instead, something like “detecting commonalities in scientific experiments in order to study how we can understand them better” might be more appropriate.
But last week the challenge was slightly different. I was invited to give an overview talk about the work I have been doing as a PhD student. And that is not only what I am doing, but why am I doing it and how is it all related without going into the details of every step. It may appear as an easy task, but it kept me thinking more than I expected.
A scientific workflow can be defined as the computational steps required for executing an in silico experiment. Scientific workflows are similar to laboratory protocols. The only difference between them is that in scientific workflows scientists run their simulations with a computer instead of in a laboratory. These kinds of workflows are becoming increasingly important for three reasons:
They help the scientist researching the experiment to repeat the results and convince his/her colleagues of the validity of the method followed.
They expose the inputs, intermediate results, outputs and codes of the experiment. In this regard they can be considered the log of the research, which is very useful for the reviewers of the work.
They allow reproducing (and reusing) the method of the experiment to other scientists, since they just have to rerun the original experiments or use other input data.
On the left of this post there is an example of a workflow for feature selection: each of the steps represents a computational method that performs some operation on the input data received from the previous step. The initial input is a Dataset and a set of words, and the final output the selected features.
Repositories of workflows like myExperiment, CrowdLabs and Galaxy have been created for scientists to share their scientific workflows, so a reasonable question for a scientist would be: how do these workflows relate to each other? Which are the most popular workflow fragments in the dataset? Has there really been reuse among the different workflows?
A possible way to find out these popular workflow fragments is to apply graph matching techniques. If we represent the workflow as a directed acyclic graph, we can apply existing methods to see how a set of workflows overlaps with other similar workflows in the repository. In the UPM we have been recently working on this problem, and the technique we have used is described on our last accepted work at K-CAP. The idea is to obtain the best context free grammar encoding all the workflows of the repository. The rules of the grammar will be the most common fragments among the workflows, which is exactly what we are looking for. If you want more information and examples, I encourage you to have a look at the paper.
The PROV family of documents has been finally released yesterday (March 12) as W3C proposed recommendation (link to the official post) by the Provenance Working Group. This family of documents consists of 4 recommendations and 8 notes that will help you to describe how to model, use and interchange provenance in the Web.
So, where to start? I would recommend you to have a look to the PROV-Overview Note, which describes a high level overview of all the documents in the family and how they are connected together. If you just want to use the model then I would recommend you to take a look at the Primer Note, which explains the functionality of the PROV model with simple examples. The rest of the documents serve different purposes:
PROV-O, the PROV ontology (Proposed Recommendation), is an OWL2 ontology allowing the mapping of PROV to RDF.
PROV-DM (Proposed Recommendation), the PROV data model for provenance.
PROV-N (Proposed Recommendation), a notation for provenance aimed at human consumption.
PROV-CONSTRAINTS (Proposed Recommendation), a set of constraints applying to the PROV data model.
PROV-XML (Note), an XML schema for the PROV data model.
PROV-AQ (Note), the mechanisms for accessing and querying provenance.
PROV-DICTIONARY (Note) introduces a specific type of collection, consisting of key-entity pairs.
PROV-DC (Note) provides a mapping between PROV and Dublic Core Terms.
PROV-SEM (Note), a declarative specification in terms of first-order logic of the PROV data model.
PROV-LINKS (Note) introduces a mechanism to link across bundles.
These descriptions were prov:wasQuotedFrom the Overview.
I’ll try to create a post in the next days on how to add simple PROV statements to your web page.
This week there was an announcement about the deadline extension for BIGPROV13. Apparently, some authors are preparing new submissions for next week. In previous posts I highlighted how the community has been demanding a provenance benchmark to test different analyses on provenance data, so today I’m going to describe how I have been contributing to the publication of public accessible provenance traces from scientific experiments.
It all started last year, when I did an internship in the Information Sciences Institute (ISI) to reproduce the results of the TB-Drugome experiment, led by Phil Bourne’s team in San Diego. They wanted to make accessible the method followed in their experiment in order to be reused by other scientists, for which it is necessary to publish sample traces of the experiment, the templates and every intermediate output and source. As a result, we reproduced the experiment with a workflow using the Wings workflow system, we extended the Open Provenance Model (OPM) to represent the traces as the OPMW profile, and we described here the process necessary in order to publish the templates and traces of any workflow as Linked Data. Lately we have aligned the previous work with the emerging PROV-O standard, providing serializations of both OPM and PROV for each workflow that is published. You can find the public endpoint here, and an exemplar application that loads into a wiki the data of a workflow(dynamically) can be seen here.
I have also been working with the Taverna people in the wf4Ever project to create a curated repository of runs from both Taverna and Wings, compatible with PROV (since both systems are similar and extend the standard to describe their workflows). The repository, available here for anyone that wants to use it, has been submitted to the BIGPROV13 call and hopefully will get accepted.
So… now that we have a standard for representing provenance the big questions are: What do I do with all the provenance I generate? How do I interoperate with other approaches? At what granularity do I record the activities of my website? How do I present provenance information to the users? How do I validate provenance? How do I complete it? Many challenges remain to be solved until we can hit Tim Berners Lee’s OH Yeah? button of every web resource.