Linking Research

Posts Tagged ‘e-Science’

Elevator pitch

Posted by dgarijov on February 16, 2014

While being a PhD student, many people have asked me about the subject of my thesis and the main ideas behind my research. As a student you always think you have very clear what you are doing, at least until you have to actually explain it to someone who is not related to your domain. In fact, it is about using the right terminology. If you say something like “Oh yeah, I am trying to detect abstractions on scientific workflows semi-automatically in order to understand how they can better be reused and related to each other”, people will look at you as if you didn’t belong to this planet. Instead, something like “detecting commonalities in scientific experiments in order to study how we can understand them bettermight be more appropriate.

But last week the challenge was slightly different. I was invited to give an overview talk about the work I have been doing as a PhD student. And that is not only what I am doing, but why am I doing it and how is it all related without going into the details of every step. It may appear as an easy task, but it kept me thinking more than I expected.

As I think some people might be interested in a global overview, I want to share the presentation here as well: http://www.slideshare.net/dgarijo/from-scientific-workflows-to-research-objects-publication-and-abstraction-of-scientific-experiments. Have a look!

Posted in e-Science, Linked Data, Provenance, Research Object, scientific workflows, Taverna, Tutorial, Wings | Tagged: , , , , , , , , , | Leave a Comment »

Quantifying reproducibility and the case of the TB-Drugome (or how did I end up working with biologists)

Posted by dgarijov on January 27, 2013

In a couple of moths from now, the Beyond the PDF 2 meeting will take place in Amsterdam. This meeting is organized by the Force 11 group in order to promote research communications and “e-scholarship”. Basically, it aims to group scientists from different disciplines in order to create discussion and (among other things) promote enhancing, preserving and reusing the work published as scientific publications.

In the previous Beyond the PDF meeting (2011), Philip E. Bourne introduced the TB-Drugome, an experiment that had taken him and his team a couple of years to finish. The experiment took the ligand binding sites of all approved drugs in Europe and USA and compared them against the ligand binding sites of of the M.tb (Tuberculosis) proteins in order to produce a set of candidate drugs that could (as a side effect from their original purpose) cure the disease.

Philip explained that all the results of the experiment were available online, and asked the computer scientists for the means to expose the method and the results appropriately in order to be reused. His purpose was that other people could use this experiment for dealing with other diseases without spending much effort in changing the method they had followed for the TB Drugome.

And that was precisely my objective during my first internship in the ISI. I was not a domain expert in biology, but thanks to the help of the TB-Drugome authors, we finally reproduced the experiment as a workflow in the Wings platflorm. We also exported it as Linked Data, and abstracted the workflow so as to be able to implement any of its steps with different implementations. An example of a run can be seen here.

As it happens in other domains, workflows decay: the input databases change, the tools are updated/changed, etc. I had to add small components to the workflow in order to make it work and preserve it. The results obtained were different, but consistent with the findings of the original experiment. Another interesting fact is that we quantified all the time we took for reproducing all the steps. This quantification effort gives an idea of how much effort must a newcomer put in reproducing a workflow when the authors are helping, just to give an insight of how big this task can be. If I get a travel grant, I’ll share these results in the Beyond the PDF 2 meeting in Amsterdam :).

Posted in Uncategorized | Tagged: , , , , , , , | Leave a Comment »

Provenance Corpus ready!

Posted by dgarijov on December 12, 2012

This week there was an announcement about the deadline extension for BIGPROV13. Apparently, some authors are preparing new submissions for next week. In previous posts I highlighted how the community has been demanding a provenance benchmark to test different analyses on provenance data, so today I’m going to describe how I have been contributing to the publication of public accessible provenance traces from scientific experiments.

It all started last year, when I did an internship in the Information Sciences Institute (ISI) to reproduce the results of the TB-Drugome experiment, led by Phil Bourne’s team in San Diego. They wanted to make accessible the method followed in their experiment in order to be reused by other scientists, for which it is necessary to publish sample traces of the experiment, the templates and every intermediate output and source. As a result, we reproduced the experiment with a workflow using the Wings workflow system, we extended the Open Provenance Model (OPM) to represent the traces as the OPMW profile, and we described here the process necessary in order to publish the templates and traces of any workflow as Linked Data. Lately we have aligned the previous work with the emerging PROV-O standard, providing serializations of both OPM and PROV for each workflow that is published. You can find the public endpoint here, and an exemplar application that loads into a wiki the data of a workflow(dynamically) can be seen  here.

I have also been working with the Taverna people in the wf4Ever project to create a curated repository of runs from both Taverna and Wings, compatible with PROV (since both systems are similar and extend the standard to describe their workflows). The repository, available here for anyone that wants to use it, has been submitted to the BIGPROV13 call and hopefully will get accepted.

So… now that we have a standard for representing provenance the big questions are: What do I do with all the provenance I generate? How do I interoperate with other approaches? At what granularity do I record the activities of my website? How do I present provenance information to the users? How do I validate provenance? How do I complete it? Many challenges remain to be solved until we can hit Tim Berners Lee’s OH Yeah? button of every web resource.

Posted in e-Science, Linked Data, Provenance, scientific workflows, Taverna, Wings | Tagged: , , , , , , , | Leave a Comment »

Late thoughts about e-Science 2012

Posted by dgarijov on November 26, 2012

After a 2 week holiday, I’m finally back to work. Before letting more time pass by, I would like to share here a small summary of the e-Science conference I attended about a month and a half ago in Chicago.

I’ll start with the keynotes. There were four in the 3 days that the conference lasted. Gerhard Klimeck (slides) introduced Nanohub, a platform to publish and use separate components and tools via user-friendly interfaces, showing how they could be used for different purposes like education or research in a scalable way. It has a lot of potential (specially since they try to make things easier through simple interfaces), but I found curious how the notion of workflows doesn’t exist (or they are barely used).

Gregory Wilson (slides) raised a nice issue in e-Science: sometimes the main issue about the products developed by the scientific community is not that they have the wrong functionality, but that users don’t understand what are these products or how to use them. In order to address it, we should first prepare the users and then give them the tools.

The third speaker was Carole Goble (slides), who talked about reproducibility in e-Science and the multiple projects in which she is participating. She mentioned specially the wf4Ever project (where she collaborates with the OEG) and the Research Objects, the data artifacts that myExperiment is starting to adopt in order to preserve workflows and their provenance.

The last keynote was given by Leonard Smith (slides), and unlike the others (which were more computer science oriented), he presented from the point of view of a scientist that is looking for the appropriate tools to keep doing his research successfully. He talked about doing “science in the dark” (predictions over past observations) versus “science in the light” (analysis with empirical evaluations), and showed the example of meteorological predictions. Apparently the Royal Society wanted to drop the weather predictions in the past, but they were forced by users to have them back. Leonard highlighted the importance of never giving a 100% or 0% chance in the forecasts and ended his talk asking how could the e-Science community help this kind of research. I really recommend taking a look at the slides.

As for the panels, I attended the one about operating cities and Big Data. The work presented was very interesting, but I was a bit disappointed. I haven’t been to many panels before, and I thought a panel discussion was more a discussion between the speakers and the audience rather than presentations about the speakers’ work and a longer round of questions. This does not imply that the work was bad at all, just that I missed some debate among the invited speakers.

Regarding the sessions, most of them happened in parallel. The whole program can be seen here, so I will just post those which I enjoyed the most:

  1. Workflow 1: Where Khalid Belhajjame presented the work on decay analyzed by the wf4Ever people in Taverna workflows (slides). Definitely a good first step for those seeking to preserve the workflow functionality and their reprpoducibility. In this session I also talked about our empirical analysis on scientific workflows in order to find common patterns in their functionality (see slides).
  2. Data provenance: Beth Plale’s students (Pend Chen and You-Wei Cheah) introduced their work on temporal representation and quality of the workflow traces; and Sarah Cohen-Boulakia presented her work about workflow rewriting in order to make scalable analyses on the workflow graphs. I liked all the aforementioned presentations, as they where interesting and easy to follow. However they all shared the need on real workflow traces (they had created artifical ones for testing their approaches).
  3. Workflow 2: From this session I found relevant the work presented by Sonja Holl (slides), who talked about the approach they use to find automatically the appropriate parameters for running a workflow. Once again, she was interested for traces o real workflows, specifically from Taverna (since it is the system she had been dealing with).

In conclusion, I was very happy to attend to the conference (my first one if I don’t count workshops!), even if I missed the 3 day workshops from Microsoft that happened earlier in the week. I had the chance to meet new people that I had only seen through e-mail, and I talked to all the thinking heads working close to what I do.

From the sessions also became clear to me that the community is asking for a scientific workflow provenance curated benchmark for testing their different algorithms and methods. Fortunately I have seen a call for paper with this theme: https://sites.google.com/site/bigprov13/. It covers provenance in general, but in the Wf4ever project we are already planning a joint submission with more than 100 executions of different workflows from Taverna and Wings systems. Specifically, the ones from Wings are already online published as Linked Data (see some examples here). Lets see how the call works out!

Some of the presenters at e-Science (from left to right): Sonja Holl, Katherine Wolstencroft, Khalid Belhajjame, Sarah Cohen and me

Posted in Conference, e-Science | Tagged: , , , , , | Leave a Comment »