Last week I attended to the Provenance Week in Cologne. For the first time, IPAW and TAPP were celebrated together, even having some overlapping sessions like the poster lighting talks. The clear benefit of having both events at the same time is that a bigger part of the community was actually able to attend to the event, even if some argued that 5 full days of provenance is too long. I got to see many known faces, and finally meet some people who I had just talked to remotely.
In general, the event was very interesting, definitely worth paying a visit. I was able to gather an overview of the state of the art in provenance in many different domains, and how to protect it, collect it and exploit it for various purposes. Different sessions led to different discussions, but I liked 2 topics in particular:
The “Sexy” application for provenance (Paul Groth). After years of discussions we have a standard for provenance, and many applications are starting to use it and extending for representing provenance across different domains. But there is no application that uses provenance from different sources to do something meaningful for the final user. Some applications define metrics that are domain dependent to assess trust, others like PROV-O viz visualize it to see what is going on in the traces, and others try to use it to explain what kind of things we can find in a particular dataset. But we still don’t have the provenance killer app… will the community be able to find it before the next Provenance Week?
Provenance has been discussed for many years now. How come are we still so irrelevant? (Beth Plale). This was brought up by the keynote speaker and organizer Beth Plale, who talked about different consortiums in the U.S. that are starting to care about provenance (e.g., Hathitrust publisher or the Research Data Alliance). As some people pointed out, it is true that provenance has gathered a lot of importance in the recent years, up to the point at which some of the grants will only be provided if the researchers guarantee the tracking of provenance. The standard helps, but we are still far from solving the provenance related issues. Authors and researchers have to see the benefit from publishing provenance (e.g., attribution, with something like PROV-Pingback); otherwise it will be very difficult to convince them to do so.
Apart from the pointers I have included above, many other applications and systems were presented during the week. These are my highlights:
Documentation of scientific experiments. A cool application for generating documentations of workflows using python notebook and the prov-o viz. Tested with Ducktape’s workflows.
Reconstruction of provenance: Hazeline Asuncion and Tom de Nies both presented their approaches for finding the dependencies among data files when the provenance is lost. I find this very interesting because it could be used (potentially) to label workflow activities automatically (e.g., with our motif list).
Provenance capture: RData tracker, an intrusive, yet simple way of capturing provenance of scripts in R. Other approaches like no workflow also looked ok, but seemed a little heavier.
Provenance benchmarking: Hugo Firth presented ProvGen, and interesting approach for creating huge synthetic provenance graphs simulating real world properties (e.g., twitter data). All the new provenance datasets were added to the ProvBench Github page, now also in Datahub.
Provenance pingbacks: Tim Lebo and Tom de Nies presented two different implementations (see here and here) for the PROV Pingback mechanism defined in the W3C. Even though security might still be an issue, this is a simple mechanism to provide attribution to the authors. Fantastic first steps!
Provenance abstraction: Paolo Missier presented a way of simplifying provenance graphs while preserving the prov notation, which helps to understand better what is going on in the provenance trace. Roly Perrera presented an interesting survey on how abstraction is also being used to present different levels of privacy when accessing the data, which will be more and more important as provenance gains a bigger role.
Applications of provenance: One of my favorites was Trusted Tiny Things, which aimed at describing everyday things with provenance descriptions. This would be very useful to know, in a city, how much the government spent on a certain item (like statue), and who was responsible for buying it. Other interesting applications were Pinar Alper’s approach for labeling workflows, Jun Zhao’s approach for generating queries for exploring provenance datasets and Matthew Gamble’s metric for quantifying the influence of an article in another just by using provenance.
The Provenance analytics workshop: I was offered to co-organize this satellite event on the first day. We got 11 submissions (8 accepted) and managed to keep a nice session running plus some discussion at the end. Some ongoing work on applications of provenance to different domains was presented (cloud, geospatial, national climate, crowdsourcing, scientific workflows) and the audience was open to provide feedback. I wouldn’t mind doing it again 🙂