Linking Research

Posts Tagged ‘provenance’

E-Science 2014: The longest Journey

Posted by dgarijov on October 31, 2014

After a few days back in Madrid, I have finally found some time to write about the eScience 2014 conference, which took place last week in Guarujá, Brasil. The conference lasted for 5 days (the first two days with workshops), and it got attendants from all over the world. It was especially good to see many young people who could attend thanks to the scholarships awarded by the conference, even when they were not presenting a paper. I found a bit unorthodox that the presenters couldn’t apply for these scholarships (I wanted to!), but I am glad to see this kind of giveaway. Conferences are expensive and I was able to have interesting discussions about my work thanks to this initiative. I think this is also a reflection of Jim Gray’s will: pushing science into the next generation.

We were placed in touristic resort in Guarujá, at the beach. This is what you could see when you got out of the hotel:

Guarujá beach

Guarujá beach

And the jungle was not far away either. After a 20 minute walk you were able to arrive at something like this…

The jungle was not far from the beach either

The jungle was not far from the beach either

…which is pretty amazing. However, the conference schedule was packed with interesting talks from 8:30 to 20:30 most of the days, and in general we were unable to do some sightseeing. In my opinion they could have reduced one workshop day and relax the schedule a little bit. Or at least remove the parallel sessions in the main conference. It always sucks to have to choose between two different interesting sessions. That said, I would like to congratulate everyone involved in the organization of the conference. They did an amazing job!

Another thing that surprised me is that I wasn’t expecting to see many Semantic Web people, since the ISWC Conference occurred at the same time in Italy, but I found quite a few. We are everywhere!

My talks at the conference were two, which summarized the results I achieved during my internship at the Information Sciences Institute earlier this year. First I presented a user survey quantifying the benefits of creating workflows and workflow fragments and then our approach to detect automatically common workflow fragments, tested in the LONI Pipeline (for more details I encourage you to follow the links to the presentations). The only thing that bothered me a bit was that my presentations were scheduled at strange hours. I had the last turn before the dinner for the first one, and then I was the first presenter the last day at 8:30 am for the second one. Here is a picture of the brave attendants who woke up early the last day, I really appreciated their effort :):

The brave attendants that woke up early to be at my talk at 8:30 am

The brave attendants that woke up early to be at my talk at 8:30 am

But let’s get back to the workshop, demos and conference. As I introduced above, the first 2 days included workshop talks, demos and tutorials. Here are my highlights:

Workshops and demos:

Microsoft is investing on scientific workflows!: I attended the Azure research training workshop, were Mateus Velloso introduced the Azure infrastructure for creating and setting up virtual machines, web services, webs and workflows. It is really impressive how easily you are able to create and run experiments with their infrastructure, although you are limited to their own library of software components (in this case, a machine learning library). If you want to add your own software, you have to expose it as a web service.

Impressive visualizations using Excel sheets at the Demofest! All the demos belonged to Microsoft (guess who was one of the main sponsors of the conference) although I have to admit that they looked pretty cool. I was impressed by two demos in particular, the Sanddance beta and the Worldwide Telescope. The former is used to load Excel files with large datasets to play with the data, select, filter and plot the resources by different facets. Easy to use and very fluid in the animations. The latter was similar to Google Maps, but you were able to load your excel dataset (more than 300K points at the same time) and show it on real time. For example, in the demo you could draw the itineraries of several whales in the sea at different points in time, and show their movement minute after minute.

Microsoft demo session. With caipirinhas!

Microsoft demo session. With caipirinhas!

New provenance use cases are always interesting. Dario Oliveira introduced their approach to extract biographic information from the Brazilian Historical Biographical Dictionary at the Digital Humanities Workshop. This included not only the life of the different persons collected as part of the dictionary, but also each reference that contributed to tell part of the story. Certainly a complex and interesting use case for provenance, which they are currently refining.

Paul Watson was awarded with the Jim Gray Award. In his keynote, he talked about the social exclusion and the effect of digital technologies. Having a lack of ability to log online may stop you from having access to many services, and ongoing work on helping people with accessibility problems (even through scientific workflows) was presented. Clouds play an important role too, as they have the potential for dealing with the fast growth of applications. However, the people who could benefit the most from the cloud often do not have the resources or skills to do so. He also described e-Science Central, a workflow system for easily creating workflows in your web browser, with provenance recording and exploring capabilities and the possibility to tune and improve the scalability of your workflows with the Azure infrastructure. The keynote ended by highlighting how important is to make things fun for the user (“gamification “ of evaluations, for example), and how important eScience is for computer science research: new challenges are continuously presented supported by real use cases in application domains with a lot of data behind.

I liked the three dreams for eScience of the “strategic importance of eScience” panel:

  1. Find and support the misfits, by addressing those people with needs in escience.
  2. Support cross domain overlap. Many communities base their work on the work made by other communities, although the collaboration rarely happens at the moment.
  3. Cross domain collaboration.
First panel of the conference

First panel of the conference

Conference general highlights:

Great discussion in the “Going native Panel”, chaired by Tony Hey, with experts from chemistry, scientific workflows and ornithology (talk about domain diversity). They analyzed the key elements of a successful collaboration, explaining how in their different projects they have a wide range of collaborators. It is crucial to have passionate people, who don’t lose the inertia after the grant from the project has been obtained. For example, one of the best databases for accessing chemicals descriptions on the UK came out from a personal project initiated by a minority. In general, people like to consume curated data, but very few are willing to contribute. In the end what people want is to have impact. Showing relevance and impact (or reputation, altmetrics, etc.) will grant additional collaborators. Finally, the issue of data interoperability between different communities was brought up for discussion. Data without methods is in many cases not very useful, which encourages part of the work I’ve been doing during the last years.

Awesome keynotes!! The one I liked the most was given by Noshir Contractor, who talked about “Grand Societal Challenges”. The keynote was basically about how to assemble a “dream team” of people for delivering a product/proposal, and all the analyses that had been done to determine which factors are the most influential. He started by talking about the Watson team, who built a machine capable of beating a human on TV, and continued by presenting the tendencies people have when selecting people for their own teams. He also presented a very interesting study of videogames as “leadership online labs”. In videogames very heterogeneous people meet, and they have to collaborate in groups in order to be successful. The takeaway conclusion was that diversity in a group can be very successful, but it is also very risky and often it ends in a failure. That is why people tend to collaborate with people they have already collaborated with when writing a proposal.

The keynote by Kathleen R. McKeown was also amazing. She presented a high level overview of the work in NLP developed in their group concerning summarization of news, journal articles, blog posts, and even novels! (which IMO has a lot of merit without going into the detail). She presented co-reference detection of events, temporal summarization, sub-event identification and analysis of conversations in literature, depending on the type of text being addressed. Semantics can make a difference!

New workflow systems: I think I haven’t seen an eScience conference without new workflow systems being presented 😀 In this case the focus was more on the efficient execution and distribution of the resources. Dispel4py and Tigres workflow systems were introduced for scientists working in Python.

Cross domain workflows and scientific gateways:

Antonella Galizia presented the DRIHM infrastructure to set up Hydro-Meteorological experiments in minutes. Impressive, as they had to integrate models for meteorology, hydrology, pluviology and hydraulic systems, while reusing existent OGC standards and developing a gateway for citizen scientists. A powerful approach, as they were able to do flooding predictions on in certain parts of Italy. According to Antonella, one of the biggest challenges on achieving their results was to create a common vocabulary which could be understood by all the scientists involved. Once again we come back to semantics…

Rosa Filgueira presented another gateway, but for vulcanologists and rock physicists. Scientists often have problems to share data among different disciplines, even if they belong to the same domain (geology in this case). This is because every lab often records their data in a different way.

Finally, Silvia Olabarriaga gave an interesting talk about workflow management in astrophysics, heliophysics and biomedicine, distinguishing the conceptual level (user in the science gateway), abstract level (scientific workflow) and concrete level (how the workflow is finally executed on an infrastructure), and how to capture provenance at these different granularities.

Other more specific work that I liked:

  • A tool for understanding the copyright in science, presented by Richard Hoskings. A plethora of different licenses coexist in the Linked Open Data, and it is often difficult to understand how one can use the different resources exposed in the Web. This tool helps on guiding the user about the possible consequences of using a given resource or another in their applications. Very useful to detect any incompatibility on your application!
  • An interesting workflow similarity approach by Johannes Starlinger, which improves the current state of the art by making efficient matching on workflows. Johannes said they would release a new search engine soon, so I look forward to analyzing their results. They have published a corpus of similar workflows here.
  • Context of scientific experiments: Rudolf Mayer presented the work made on the Timbus project to capture the context of scientific workflows. This includes their dependencies, methods and data under a very fine granularity. Definitely related to Research Objects!
  • An agile annotation of scientific texts to identify and link biomedical entities by Marcus Silva, with the particularity of being capable of loading very large ontologies to do the matching.
  • Workflow ecosystems in Pegasus: Ewa Deelman presented a set of combinable tools for Pegasus able to archive, distribute simulate and re-compute efficiently workflows. All tested with a huge workflow in astronomy.
  • Provenance is still playing an important role in the conference, with a whole session for related papers. PROV is being reused and extended in different domains, but I still have to see an interoperable use across different domains to show its full potential.
Conference dinner and dance with a live band

Conference dinner and dance with a live band

In summary, I think the conference has been a very positive experience and definitely worth the trip. It is very encouraging to see that collaborations among different communities are really happening thanks to the infrastructure being developed on eScience, although there are still many challenges to address. I think we will see more and more cross domain workflows and workflow ecosystems in the next years, and I hope to be able to contribute with my research.

I also got plenty of new references to add to the state of the art of my thesis, so I think that I also did a good job by talking to people and letting others know of my work. Unfortunately my return flight was delayed and I missed my connection back to Spain, converting my 14 hour flight home to almost 48 hours. Certainly the longest journey from any conference I have assisted to.

Posted in Conference, e-Science | Tagged: , , , , , , , , , , , , | Leave a Comment »

Provenance Week 2014

Posted by dgarijov on June 20, 2014

Last week I attended to the Provenance Week in Cologne. For the first time, IPAW and TAPP were celebrated together, even having some overlapping sessions like the poster lighting talks. The clear benefit of having both events at the same time is that a bigger part of the community was actually able to attend to the event, even if some argued that 5 full days of provenance is too long. I got to see many known faces, and finally meet some people who I had just talked to remotely.

In general, the event was very interesting, definitely worth paying a visit. I was able to gather an overview of the state of the art in provenance in many different domains, and how to protect it, collect it and exploit it for various purposes. Different sessions led to different discussions, but I liked 2 topics in particular:

The “Sexy” application for provenance (Paul Groth). After years of discussions we have a standard for provenance, and many applications are starting to use it and extending for representing provenance across different domains. But there is no application that uses provenance from different sources to do something meaningful for the final user. Some applications define metrics that are domain dependent to assess trust, others like PROV-O viz visualize it to see what is going on in the traces, and others try to use it to explain what kind of things we can find in a particular dataset. But we still don’t have the provenance killer app… will the community be able to find it before the next Provenance Week?

Provenance has been discussed for many years now. How come are we still so irrelevant? (Beth Plale). This was brought up by the keynote speaker and organizer Beth Plale, who talked about different consortiums in the U.S. that are starting to care about provenance (e.g., Hathitrust publisher or the Research Data Alliance). As some people pointed out, it is true that provenance has gathered a lot of importance in the recent years, up to the point at which some of the grants will only be provided if the researchers guarantee the tracking of provenance. The standard helps, but we are still far from solving the provenance related issues. Authors and researchers have to see the benefit from publishing provenance (e.g., attribution, with something like PROV-Pingback); otherwise it will be very difficult to convince them to do so.

Luc getting prepared for his introductory speech in IPAW

Luc getting prepared for his introductory speech in IPAW

 

Apart from the pointers I have included above, many other applications and systems were presented during the week. These are my highlights:

Documentation of scientific experiments. A cool application for generating documentations of workflows using python notebook and the prov-o viz. Tested with Ducktape’s workflows.

Reconstruction of provenance: Hazeline Asuncion and Tom de Nies both presented their approaches for finding the dependencies among data files when the provenance is lost. I find this very interesting because it could be used (potentially) to label workflow activities automatically (e.g., with our motif list).

Provenance capture: RData tracker, an intrusive, yet simple way of capturing provenance of scripts in R. Other approaches like no workflow also looked ok, but seemed a little heavier.

Provenance benchmarking: Hugo Firth presented ProvGen, and interesting approach for creating huge synthetic provenance graphs simulating real world properties (e.g., twitter data). All the new provenance datasets were added to the ProvBench Github page, now also in Datahub.

Provenance pingbacks: Tim Lebo and Tom de Nies presented two different implementations (see here and here) for the PROV Pingback mechanism defined in the W3C. Even though security might still be an issue, this is a simple mechanism to provide attribution to the authors. Fantastic first steps!

Provenance abstraction: Paolo Missier presented a way of simplifying provenance graphs while preserving the prov notation, which helps to understand better what is going on in the provenance trace. Roly Perrera presented an interesting survey on how abstraction is also being used to present different levels of privacy when accessing the data, which will be more and more important as provenance gains a bigger role.

Applications of provenance: One of my favorites was Trusted Tiny Things, which aimed at describing everyday things with provenance descriptions. This would be very useful to know, in a city, how much the government spent on a certain item (like statue), and who was responsible for buying it. Other interesting applications were Pinar Alper’s approach for labeling workflows, Jun Zhao’s approach for generating queries for exploring provenance datasets and Matthew Gamble’s metric for quantifying the influence of an article in another just by using provenance.

Trusted Tiny Things presentation

Trusted Tiny Things presentation

The Provenance analytics workshop: I was offered to co-organize this satellite event on the first day. We got 11 submissions (8 accepted) and managed to keep a nice session running plus some discussion at the end. Some ongoing work on applications of provenance to different domains was presented (cloud, geospatial, national climate, crowdsourcing, scientific workflows) and the audience was open to provide feedback. I wouldn’t mind doing it again 🙂

The prov analytics workshop (pic by Paul Groth)

The prov analytics workshop (pic by Paul Groth)

Posted in Conference, Tutorial, Workshop | Tagged: , , , , | 2 Comments »

Finding commonalities in different scientific experiments

Posted by dgarijov on April 30, 2013

A scientific workflow can be defined as the computational steps required for executing an in silico experiment. Scientific workflows are similar to laboratory protocols. The only difference between them is that in scientific workflows scientists run their simulations with a computer instead of in a laboratory. These kinds of workflows are becoming increasingly important for three reasons:

  1. They help the scientist researching the experiment to repeat the results and convince his/her colleagues of the validity of the method followed.
  2. They expose the inputs, intermediate results, outputs and codes of the experiment. In this regard they can be considered the log of the research, which is very useful for the reviewers of the work.
  3. They allow reproducing (and reusing) the method of the experiment to other scientists, since they just have to rerun the original experiments or use other input data.
700

Feature selection workflow

On the left of this post there is an example of a workflow for feature selection: each of the steps represents a computational method that performs some operation on the input data received from the previous step. The initial input is a Dataset and a set of words, and the final output the selected features.
Repositories of workflows like myExperiment, CrowdLabs and Galaxy have been created for scientists to share their scientific workflows, so a reasonable question for a scientist would be: how do these workflows relate to each other? Which are the most popular workflow fragments in the dataset? Has there really been reuse among the different workflows?

A possible way to find out these popular workflow fragments is to apply graph matching techniques. If we represent the workflow as a directed acyclic graph, we can apply existing methods to see how a set of workflows overlaps with other similar workflows in the repository. In the UPM we have been recently working on this problem, and the technique we have used is described on our last accepted work at K-CAP. The idea is to obtain the best context free grammar encoding all the workflows of the repository. The rules of the grammar will be the most common fragments among the workflows, which is exactly what we are looking for. If you want more information and examples, I encourage you to have a look at the paper.

Posted in e-Science, Provenance, Wings | Tagged: , , , , | Leave a Comment »

The PROV family of specifications is released

Posted by dgarijov on March 13, 2013

The PROV family of documents has been finally released yesterday (March 12) as W3C proposed recommendation (link to the official post) by the Provenance Working Group. This family of documents consists of 4 recommendations and 8 notes that will help you to describe how to model, use and interchange provenance in the Web.

So, where to start? I would recommend you to have a look to the PROV-Overview Note, which describes a high level overview of all the documents in the family and how they are connected together. If you just want to use the model then I would recommend you to take a look at the Primer Note, which explains the functionality of the PROV model with simple examples. The rest of the documents serve different purposes:

  • PROV-O, the PROV ontology (Proposed Recommendation), is an OWL2 ontology allowing the mapping of PROV to RDF.
  • PROV-DM (Proposed Recommendation), the PROV data model for provenance.
  • PROV-N (Proposed Recommendation), a notation for provenance aimed at human consumption.
  • PROV-CONSTRAINTS (Proposed Recommendation), a set of constraints applying to the PROV data model.
  • PROV-XML (Note), an XML schema for the PROV data model.
  • PROV-AQ (Note), the mechanisms for accessing and querying provenance.
  • PROV-DICTIONARY (Note) introduces a specific type of collection, consisting of key-entity pairs.
  • PROV-DC (Note) provides a mapping between PROV and Dublic Core Terms.
  • PROV-SEM (Note), a declarative specification in terms of first-order logic of the PROV data model.
  • PROV-LINKS (Note) introduces a mechanism to link across bundles.

These descriptions were prov:wasQuotedFrom the Overview.
I’ll try to create a post in the next days on how to add simple PROV statements to your web page.

Posted in Provenance | Tagged: , , | Leave a Comment »

Provenance Corpus ready!

Posted by dgarijov on December 12, 2012

This week there was an announcement about the deadline extension for BIGPROV13. Apparently, some authors are preparing new submissions for next week. In previous posts I highlighted how the community has been demanding a provenance benchmark to test different analyses on provenance data, so today I’m going to describe how I have been contributing to the publication of public accessible provenance traces from scientific experiments.

It all started last year, when I did an internship in the Information Sciences Institute (ISI) to reproduce the results of the TB-Drugome experiment, led by Phil Bourne’s team in San Diego. They wanted to make accessible the method followed in their experiment in order to be reused by other scientists, for which it is necessary to publish sample traces of the experiment, the templates and every intermediate output and source. As a result, we reproduced the experiment with a workflow using the Wings workflow system, we extended the Open Provenance Model (OPM) to represent the traces as the OPMW profile, and we described here the process necessary in order to publish the templates and traces of any workflow as Linked Data. Lately we have aligned the previous work with the emerging PROV-O standard, providing serializations of both OPM and PROV for each workflow that is published. You can find the public endpoint here, and an exemplar application that loads into a wiki the data of a workflow(dynamically) can be seen  here.

I have also been working with the Taverna people in the wf4Ever project to create a curated repository of runs from both Taverna and Wings, compatible with PROV (since both systems are similar and extend the standard to describe their workflows). The repository, available here for anyone that wants to use it, has been submitted to the BIGPROV13 call and hopefully will get accepted.

So… now that we have a standard for representing provenance the big questions are: What do I do with all the provenance I generate? How do I interoperate with other approaches? At what granularity do I record the activities of my website? How do I present provenance information to the users? How do I validate provenance? How do I complete it? Many challenges remain to be solved until we can hit Tim Berners Lee’s OH Yeah? button of every web resource.

Posted in e-Science, Linked Data, Provenance, scientific workflows, Taverna, Wings | Tagged: , , , , , , , | Leave a Comment »

Late thoughts about e-Science 2012

Posted by dgarijov on November 26, 2012

After a 2 week holiday, I’m finally back to work. Before letting more time pass by, I would like to share here a small summary of the e-Science conference I attended about a month and a half ago in Chicago.

I’ll start with the keynotes. There were four in the 3 days that the conference lasted. Gerhard Klimeck (slides) introduced Nanohub, a platform to publish and use separate components and tools via user-friendly interfaces, showing how they could be used for different purposes like education or research in a scalable way. It has a lot of potential (specially since they try to make things easier through simple interfaces), but I found curious how the notion of workflows doesn’t exist (or they are barely used).

Gregory Wilson (slides) raised a nice issue in e-Science: sometimes the main issue about the products developed by the scientific community is not that they have the wrong functionality, but that users don’t understand what are these products or how to use them. In order to address it, we should first prepare the users and then give them the tools.

The third speaker was Carole Goble (slides), who talked about reproducibility in e-Science and the multiple projects in which she is participating. She mentioned specially the wf4Ever project (where she collaborates with the OEG) and the Research Objects, the data artifacts that myExperiment is starting to adopt in order to preserve workflows and their provenance.

The last keynote was given by Leonard Smith (slides), and unlike the others (which were more computer science oriented), he presented from the point of view of a scientist that is looking for the appropriate tools to keep doing his research successfully. He talked about doing “science in the dark” (predictions over past observations) versus “science in the light” (analysis with empirical evaluations), and showed the example of meteorological predictions. Apparently the Royal Society wanted to drop the weather predictions in the past, but they were forced by users to have them back. Leonard highlighted the importance of never giving a 100% or 0% chance in the forecasts and ended his talk asking how could the e-Science community help this kind of research. I really recommend taking a look at the slides.

As for the panels, I attended the one about operating cities and Big Data. The work presented was very interesting, but I was a bit disappointed. I haven’t been to many panels before, and I thought a panel discussion was more a discussion between the speakers and the audience rather than presentations about the speakers’ work and a longer round of questions. This does not imply that the work was bad at all, just that I missed some debate among the invited speakers.

Regarding the sessions, most of them happened in parallel. The whole program can be seen here, so I will just post those which I enjoyed the most:

  1. Workflow 1: Where Khalid Belhajjame presented the work on decay analyzed by the wf4Ever people in Taverna workflows (slides). Definitely a good first step for those seeking to preserve the workflow functionality and their reprpoducibility. In this session I also talked about our empirical analysis on scientific workflows in order to find common patterns in their functionality (see slides).
  2. Data provenance: Beth Plale’s students (Pend Chen and You-Wei Cheah) introduced their work on temporal representation and quality of the workflow traces; and Sarah Cohen-Boulakia presented her work about workflow rewriting in order to make scalable analyses on the workflow graphs. I liked all the aforementioned presentations, as they where interesting and easy to follow. However they all shared the need on real workflow traces (they had created artifical ones for testing their approaches).
  3. Workflow 2: From this session I found relevant the work presented by Sonja Holl (slides), who talked about the approach they use to find automatically the appropriate parameters for running a workflow. Once again, she was interested for traces o real workflows, specifically from Taverna (since it is the system she had been dealing with).

In conclusion, I was very happy to attend to the conference (my first one if I don’t count workshops!), even if I missed the 3 day workshops from Microsoft that happened earlier in the week. I had the chance to meet new people that I had only seen through e-mail, and I talked to all the thinking heads working close to what I do.

From the sessions also became clear to me that the community is asking for a scientific workflow provenance curated benchmark for testing their different algorithms and methods. Fortunately I have seen a call for paper with this theme: https://sites.google.com/site/bigprov13/. It covers provenance in general, but in the Wf4ever project we are already planning a joint submission with more than 100 executions of different workflows from Taverna and Wings systems. Specifically, the ones from Wings are already online published as Linked Data (see some examples here). Lets see how the call works out!

Some of the presenters at e-Science (from left to right): Sonja Holl, Katherine Wolstencroft, Khalid Belhajjame, Sarah Cohen and me

Posted in Conference, e-Science | Tagged: , , , , , | Leave a Comment »