Linking Research

Archive for the ‘e-Science’ Category

EarthCube All Hands Meeting (ECAHM 2018)

Posted by dgarijov on June 13, 2018

Last week I attended the annual EarthCube All Hands Meeting (ECAHM) in Alexandria, Washington. Since it’s been a while since I last wrote my last post, I think it would be interesting to share my notes and highlights here for anyone who missed the event.

ECAHM meetings are usually very enriching experiences, as they bring together a variety of researchers from different fields related to geosciences, ranging from computer scientists to volcanologists or marine biologists. The purpose of the meeting is to gather the community together and hear everyone report back from their EarthCube NSF funded projects, which are targeted towards improving cyber-infrastructure in the geosciences. As a computer scientist, I think this is a great meeting to attend for two main reasons: first, you always learn something new, even if it’s not in your domain. Second, people are extremely grateful to your contributions, as you are helping them become more effective when doing their science.

So, what was I doing at ECAHM 2018?

I attended the meeting to present our latest progress in OntoSoft, a distributed software metadata registry we created at ISI to facilitate scientists describe their software. You can see the poster abstract online (and soon the poster itself). I also participated on a “speed-dating session”, where I got to discuss for half an hour how to describe software with a domain scientist; and I substituted Yolanda Gil in a panel for external partnership opportunities, where I presented the Open Knowledge Network initiative. This effort, led by NITRD, is a great opportunity of creating a shared open knowledge graph that would be used for both research and industry to refine and curate its contents. The idea is that this knowledge graph becomes part of the US infrastructure the same way supercomputers currently are, so anyone could benefit from it and also contribute to it. It looks like the NSF is keen to pursue this objective too.

Two colleagues of mine also presented other initiatives I am involved in. Deborah Khider showcased our efforts towards structuring metadata and creating standards in the paleoclimate sciences, together with a set of tools that a team of paleo-climate scientists have developed to work with that structured data. She also managed to mix Star Wars and Star Trek themes in her poster and presentation, which was well received by the attendants (I think everyone stopped at her poster)

Jo Martin presented the IS-GEO research collaboration network, where we are bringing in experts from geosciences and intelligent systems to foster new collaborations. We hold a monthly meeting where we have every time a different researcher talking about their latest work! Check it out here:

About the keynotes:

As expected, keynotes at ECAHM are nothing like venues such as AAAI or IUI. The first speaker was Dean Pesnell (NASA) and he presented the research carried out by his team on studying the sun and sun spots. Why is this related to geosciences? Because the sun could be considered “our ground truth for the universe”, and anything related to its activity has many implications in any of the fields of geosciences. Their main problem is how to analyze the amount of data that they have. Each of their datasets may contain several hundred million images, so proper metadata is crucial (you don’t want to find out you have downloaded 300 million images for nothing). Dean showed some impressive videos of their observations of the sun, as well as their pipelines to handle “very big data” analyses.

The second speaker was Sarah Stamps, and she talked about continental rift and the Tanzania Volcano observatory. Apparently, geologists are one of the few people in the word who would run towards an erupting volcano, instead of away from it. Sarah described the EARS system (East African Rift System) they are setting up, and how they teamed up with CHORDS to enable real time analysis of the observations they measure on the field. Thanks to her work, they are developing an early warning system for hazard detection! Sarah was departing soon to set a few more observing stations in the field, so best of luck!!

The third speaker was Caroline S. Wagner, who gave some metrics on the social side of interdisciplinary collaboration across disciplines. Science has become increasingly collaborative and team based, and the number of international collaborations have doubled in the past years. The number of countries producing 95% of research has gone from 7 to 15, which indicates we are moving in the right direction. However, more than 50% of the articles are currently never cited. A few takeaways from this talk are: 1) International collaborations start face to face, so go to different events and meet new people; 2) Diverse teams usually take longer to be productive, as people don’t usually speak the same language. Be patient!!; 3) Work towards a solution, not towards interdisciplinar teams. Interdisciplinarity should be the means to an end, not the end itself.

Other highlights

Below are some additional highlights I found interesting for the EarthCube community.

  • Eva Zanzerika reported on the NSF 10 Big Ideas, which nicely summarize the interests of the agency in terms of funding in the next years. The report has been out since more than 1 year ago, but it’s never too late to catch up!
  • Doug Fils presented their plan for turning P418 turning into something bigger. In case you don’t know, P418 currently tracks the metadata of datasets exposed as and aggregates it in a search engine (a search engine for scientific data). Future plans are to ingest other types of resources and make the code base stable.
  • Interesting working lunch idea: A napkin drawing exercise. Do you know how to present your idea with a simple sketch?
  • Simon Goring (and Scott Peckham): How do we measure success on a huge program such as Earthcube?
  • PANGEO: Big data in the geosciences (but without reinventing the wheel!)
  • ASSET: Or how to incorporate existing tools into your workflows by drawing sketches! Workflows are important! Two different studies may obtain results even if the original data is the same:

  • I got an award for community service 🙂 :


Posted in Conference, e-Science, scientific workflows, Workshop | Tagged: , , , , | Leave a Comment »

E-Science 2014: The longest Journey

Posted by dgarijov on October 31, 2014

After a few days back in Madrid, I have finally found some time to write about the eScience 2014 conference, which took place last week in Guarujá, Brasil. The conference lasted for 5 days (the first two days with workshops), and it got attendants from all over the world. It was especially good to see many young people who could attend thanks to the scholarships awarded by the conference, even when they were not presenting a paper. I found a bit unorthodox that the presenters couldn’t apply for these scholarships (I wanted to!), but I am glad to see this kind of giveaway. Conferences are expensive and I was able to have interesting discussions about my work thanks to this initiative. I think this is also a reflection of Jim Gray’s will: pushing science into the next generation.

We were placed in touristic resort in Guarujá, at the beach. This is what you could see when you got out of the hotel:

Guarujá beach

Guarujá beach

And the jungle was not far away either. After a 20 minute walk you were able to arrive at something like this…

The jungle was not far from the beach either

The jungle was not far from the beach either

…which is pretty amazing. However, the conference schedule was packed with interesting talks from 8:30 to 20:30 most of the days, and in general we were unable to do some sightseeing. In my opinion they could have reduced one workshop day and relax the schedule a little bit. Or at least remove the parallel sessions in the main conference. It always sucks to have to choose between two different interesting sessions. That said, I would like to congratulate everyone involved in the organization of the conference. They did an amazing job!

Another thing that surprised me is that I wasn’t expecting to see many Semantic Web people, since the ISWC Conference occurred at the same time in Italy, but I found quite a few. We are everywhere!

My talks at the conference were two, which summarized the results I achieved during my internship at the Information Sciences Institute earlier this year. First I presented a user survey quantifying the benefits of creating workflows and workflow fragments and then our approach to detect automatically common workflow fragments, tested in the LONI Pipeline (for more details I encourage you to follow the links to the presentations). The only thing that bothered me a bit was that my presentations were scheduled at strange hours. I had the last turn before the dinner for the first one, and then I was the first presenter the last day at 8:30 am for the second one. Here is a picture of the brave attendants who woke up early the last day, I really appreciated their effort :):

The brave attendants that woke up early to be at my talk at 8:30 am

The brave attendants that woke up early to be at my talk at 8:30 am

But let’s get back to the workshop, demos and conference. As I introduced above, the first 2 days included workshop talks, demos and tutorials. Here are my highlights:

Workshops and demos:

Microsoft is investing on scientific workflows!: I attended the Azure research training workshop, were Mateus Velloso introduced the Azure infrastructure for creating and setting up virtual machines, web services, webs and workflows. It is really impressive how easily you are able to create and run experiments with their infrastructure, although you are limited to their own library of software components (in this case, a machine learning library). If you want to add your own software, you have to expose it as a web service.

Impressive visualizations using Excel sheets at the Demofest! All the demos belonged to Microsoft (guess who was one of the main sponsors of the conference) although I have to admit that they looked pretty cool. I was impressed by two demos in particular, the Sanddance beta and the Worldwide Telescope. The former is used to load Excel files with large datasets to play with the data, select, filter and plot the resources by different facets. Easy to use and very fluid in the animations. The latter was similar to Google Maps, but you were able to load your excel dataset (more than 300K points at the same time) and show it on real time. For example, in the demo you could draw the itineraries of several whales in the sea at different points in time, and show their movement minute after minute.

Microsoft demo session. With caipirinhas!

Microsoft demo session. With caipirinhas!

New provenance use cases are always interesting. Dario Oliveira introduced their approach to extract biographic information from the Brazilian Historical Biographical Dictionary at the Digital Humanities Workshop. This included not only the life of the different persons collected as part of the dictionary, but also each reference that contributed to tell part of the story. Certainly a complex and interesting use case for provenance, which they are currently refining.

Paul Watson was awarded with the Jim Gray Award. In his keynote, he talked about the social exclusion and the effect of digital technologies. Having a lack of ability to log online may stop you from having access to many services, and ongoing work on helping people with accessibility problems (even through scientific workflows) was presented. Clouds play an important role too, as they have the potential for dealing with the fast growth of applications. However, the people who could benefit the most from the cloud often do not have the resources or skills to do so. He also described e-Science Central, a workflow system for easily creating workflows in your web browser, with provenance recording and exploring capabilities and the possibility to tune and improve the scalability of your workflows with the Azure infrastructure. The keynote ended by highlighting how important is to make things fun for the user (“gamification “ of evaluations, for example), and how important eScience is for computer science research: new challenges are continuously presented supported by real use cases in application domains with a lot of data behind.

I liked the three dreams for eScience of the “strategic importance of eScience” panel:

  1. Find and support the misfits, by addressing those people with needs in escience.
  2. Support cross domain overlap. Many communities base their work on the work made by other communities, although the collaboration rarely happens at the moment.
  3. Cross domain collaboration.
First panel of the conference

First panel of the conference

Conference general highlights:

Great discussion in the “Going native Panel”, chaired by Tony Hey, with experts from chemistry, scientific workflows and ornithology (talk about domain diversity). They analyzed the key elements of a successful collaboration, explaining how in their different projects they have a wide range of collaborators. It is crucial to have passionate people, who don’t lose the inertia after the grant from the project has been obtained. For example, one of the best databases for accessing chemicals descriptions on the UK came out from a personal project initiated by a minority. In general, people like to consume curated data, but very few are willing to contribute. In the end what people want is to have impact. Showing relevance and impact (or reputation, altmetrics, etc.) will grant additional collaborators. Finally, the issue of data interoperability between different communities was brought up for discussion. Data without methods is in many cases not very useful, which encourages part of the work I’ve been doing during the last years.

Awesome keynotes!! The one I liked the most was given by Noshir Contractor, who talked about “Grand Societal Challenges”. The keynote was basically about how to assemble a “dream team” of people for delivering a product/proposal, and all the analyses that had been done to determine which factors are the most influential. He started by talking about the Watson team, who built a machine capable of beating a human on TV, and continued by presenting the tendencies people have when selecting people for their own teams. He also presented a very interesting study of videogames as “leadership online labs”. In videogames very heterogeneous people meet, and they have to collaborate in groups in order to be successful. The takeaway conclusion was that diversity in a group can be very successful, but it is also very risky and often it ends in a failure. That is why people tend to collaborate with people they have already collaborated with when writing a proposal.

The keynote by Kathleen R. McKeown was also amazing. She presented a high level overview of the work in NLP developed in their group concerning summarization of news, journal articles, blog posts, and even novels! (which IMO has a lot of merit without going into the detail). She presented co-reference detection of events, temporal summarization, sub-event identification and analysis of conversations in literature, depending on the type of text being addressed. Semantics can make a difference!

New workflow systems: I think I haven’t seen an eScience conference without new workflow systems being presented 😀 In this case the focus was more on the efficient execution and distribution of the resources. Dispel4py and Tigres workflow systems were introduced for scientists working in Python.

Cross domain workflows and scientific gateways:

Antonella Galizia presented the DRIHM infrastructure to set up Hydro-Meteorological experiments in minutes. Impressive, as they had to integrate models for meteorology, hydrology, pluviology and hydraulic systems, while reusing existent OGC standards and developing a gateway for citizen scientists. A powerful approach, as they were able to do flooding predictions on in certain parts of Italy. According to Antonella, one of the biggest challenges on achieving their results was to create a common vocabulary which could be understood by all the scientists involved. Once again we come back to semantics…

Rosa Filgueira presented another gateway, but for vulcanologists and rock physicists. Scientists often have problems to share data among different disciplines, even if they belong to the same domain (geology in this case). This is because every lab often records their data in a different way.

Finally, Silvia Olabarriaga gave an interesting talk about workflow management in astrophysics, heliophysics and biomedicine, distinguishing the conceptual level (user in the science gateway), abstract level (scientific workflow) and concrete level (how the workflow is finally executed on an infrastructure), and how to capture provenance at these different granularities.

Other more specific work that I liked:

  • A tool for understanding the copyright in science, presented by Richard Hoskings. A plethora of different licenses coexist in the Linked Open Data, and it is often difficult to understand how one can use the different resources exposed in the Web. This tool helps on guiding the user about the possible consequences of using a given resource or another in their applications. Very useful to detect any incompatibility on your application!
  • An interesting workflow similarity approach by Johannes Starlinger, which improves the current state of the art by making efficient matching on workflows. Johannes said they would release a new search engine soon, so I look forward to analyzing their results. They have published a corpus of similar workflows here.
  • Context of scientific experiments: Rudolf Mayer presented the work made on the Timbus project to capture the context of scientific workflows. This includes their dependencies, methods and data under a very fine granularity. Definitely related to Research Objects!
  • An agile annotation of scientific texts to identify and link biomedical entities by Marcus Silva, with the particularity of being capable of loading very large ontologies to do the matching.
  • Workflow ecosystems in Pegasus: Ewa Deelman presented a set of combinable tools for Pegasus able to archive, distribute simulate and re-compute efficiently workflows. All tested with a huge workflow in astronomy.
  • Provenance is still playing an important role in the conference, with a whole session for related papers. PROV is being reused and extended in different domains, but I still have to see an interoperable use across different domains to show its full potential.
Conference dinner and dance with a live band

Conference dinner and dance with a live band

In summary, I think the conference has been a very positive experience and definitely worth the trip. It is very encouraging to see that collaborations among different communities are really happening thanks to the infrastructure being developed on eScience, although there are still many challenges to address. I think we will see more and more cross domain workflows and workflow ecosystems in the next years, and I hope to be able to contribute with my research.

I also got plenty of new references to add to the state of the art of my thesis, so I think that I also did a good job by talking to people and letting others know of my work. Unfortunately my return flight was delayed and I missed my connection back to Spain, converting my 14 hour flight home to almost 48 hours. Certainly the longest journey from any conference I have assisted to.

Posted in Conference, e-Science | Tagged: , , , , , , , , , , , , | Leave a Comment »

Elevator pitch

Posted by dgarijov on February 16, 2014

While being a PhD student, many people have asked me about the subject of my thesis and the main ideas behind my research. As a student you always think you have very clear what you are doing, at least until you have to actually explain it to someone who is not related to your domain. In fact, it is about using the right terminology. If you say something like “Oh yeah, I am trying to detect abstractions on scientific workflows semi-automatically in order to understand how they can better be reused and related to each other”, people will look at you as if you didn’t belong to this planet. Instead, something like “detecting commonalities in scientific experiments in order to study how we can understand them bettermight be more appropriate.

But last week the challenge was slightly different. I was invited to give an overview talk about the work I have been doing as a PhD student. And that is not only what I am doing, but why am I doing it and how is it all related without going into the details of every step. It may appear as an easy task, but it kept me thinking more than I expected.

As I think some people might be interested in a global overview, I want to share the presentation here as well: Have a look!

Posted in e-Science, Linked Data, Provenance, Research Object, scientific workflows, Taverna, Tutorial, Wings | Tagged: , , , , , , , , , | Leave a Comment »

Posted by dgarijov on January 10, 2014

The last 3 years I have been involved in the Wf4Ever project, which has developed the notion of Research Objects and their respective models (previously introduced another post). Lately I have been exploring new ways for eating my own dog food by associating Research Objects to my papers as HTML web pages (see an example here). These Research Objects are useful, as they serve as summary for the paper in question, and they have pointers to all the datasets, queries and additional materials that I could not include in the paper.

However, I realized that I spent a lot of time creating them and annotating them. Therefore during last Christmas I have created a Research Object Creator tool, which takes as input a LaTeX file and extracts its title and abstract to create an annotated page in rdf-a. It also produces a structure of the contents to reference, so you only have to fill in (and annotate if you want) the resources to point to. A sample can be seen in the image below:


A Sample Research Object generated by the tool

The tool is available in Github, so if you want to try it out with a LaTeX paper click on the following link:

Finally, I have also created a landing page for showing the current catalog of Research Objects: The page is generated automatically and given a URI of a Research Object, it extracts its title and abstract from the rdf-a descriptions. If you want to contribute with new URIs, modify the Constants file in the Github project ( and I will recreate the landing page. Note that for this project I have used the Semargl rdf-a parser (, which is a little bit strict when parsing the HTML pages. If your Research Object has any markup mistakes, the parser will fail.

Posted in e-Science, Linked Data, Research Object | Tagged: , , , , , | Leave a Comment »

How to (properly) publish a vocabulary or ontology in the web (part 3 of 6)

Posted by dgarijov on July 7, 2013

This part of the tutorial explains how to design a human readable documentation. When browsing an ontology, it is very important to provide accurate definitions and examples of how to use it. If these are not provided, the ontology will be very difficult to reuse. Having a documentation easy to navigate, which explains every concept and relationship separately and which presents an overview and examples improves the understandability of the whole ontology to other people.

Some people address this step by pointing to a report/deliverable/paper where the ontology is described. Although this helps, it is not easy to navigate and will drive crazy any final user. I don’t recommend it. Furthermore, according to my experience, if the ontology is documented in a paper then the information will be of little use.

Making a proper documentation is difficult and takes time. Fortunately, there are some tools to help you overcome this task, like LODE, Parrot, OWLDoc, neologism, Ontospec, etc. I have worked with LODE, Parrot and OWLDoc, so I will only cover these here:

  • LODE: For me it’s the best of the tools I’ve tried. It is a web service that takes as input an owl file and generates an html. The html is W3C-style with the definition of each of the terms extracted from the domain, ranges and metadata of your owl file. If you extract the appropriate bits you can automatically create templates to customize your documentation with additional images, explanations and examples (like this one).
  • Parrot: Very similar to LODE, although the styles used are different and you have to clean some of the properties not defined within the namespace of your ontology (like the ones used to add metadata). It works really well, and my choice picking LODE instead of Parrot is a matter of styles.
  • OWLDoc: NeOn Toolkit plug-in that generates an owl documentation javadoc style from your .owl file. I don’t personally like it much, as customizing it is a bit of a pain.

Once you have your html template from one of these tools (with all the concepts of the ontology fully covered), you should add sections describing an overview of the model and examples. My suggestion is to follow the structure of W3C documents, namely:

  1. Title and date of the release.
  2. Metadata: Authors, contributors, version, imported ontologies, license, link to previous version, link to the latest version.
  3. Abstract: small summary of your ontology in 2 lines. I recommend pointing to the owl file here as well.
  4. Table of contents of your html document.
  5. Introduction: provide context to the ontology. What are its goals and the benefits of using it?.
  6. Namespace declarations: Namespace URIs of all the vocabularies used within the document (this could be found at the end as well).
  7. Overview of classes and properties: Very small section with the list of tables and properties of the ontology, for making the navigation easier to the reader.
  8. Description: Diagram of the ontology concepts, relationships and how they are related to each other. Usage examples might help clarifying things as well.
  9. Cross reference section: this is the section automatically generated by the tools covered above. Just copy what they generated J
  10. References.
  11. Acknowledgements, specially remember to include the developers of the tools you have used.

Want to see some examples? Check PROV (W3C), some commonly used vocabularies like foaf or Dublin Core (which cover the points listed above with their own structure) or some of the ontologies I’ve been publishing, like p-plan or wf-motifs. Note that the order in which the points of the list appear is not mandatory. Modify it in order to make your ontology easier to use to the final user!

This is part of a tutorial divided in 7 parts:

  1. Overview of the tutorial.
  2. (Reqs addressed A1(partially), A2, A3, A4, P1) Publishing your vocabulary at a stable URI using RDFS/OWL.
  3. (Reqs addressed P2, P3). How to design a human readable documentation.  (this post)
  4. Extra: A tool for creating html readable documentation
  5. (Reqs addressed P4). Derreferencing your vocabulary.
  6. (Reqs addressed A1 (partially)). Dealing with the license. (To appear)
  7. (Reqs addressed A5, P5). Reusing other vocabularies. (To appear)


Posted in e-Science, Linked Data | Tagged: , , , , | 11 Comments »

Finding commonalities in different scientific experiments

Posted by dgarijov on April 30, 2013

A scientific workflow can be defined as the computational steps required for executing an in silico experiment. Scientific workflows are similar to laboratory protocols. The only difference between them is that in scientific workflows scientists run their simulations with a computer instead of in a laboratory. These kinds of workflows are becoming increasingly important for three reasons:

  1. They help the scientist researching the experiment to repeat the results and convince his/her colleagues of the validity of the method followed.
  2. They expose the inputs, intermediate results, outputs and codes of the experiment. In this regard they can be considered the log of the research, which is very useful for the reviewers of the work.
  3. They allow reproducing (and reusing) the method of the experiment to other scientists, since they just have to rerun the original experiments or use other input data.

Feature selection workflow

On the left of this post there is an example of a workflow for feature selection: each of the steps represents a computational method that performs some operation on the input data received from the previous step. The initial input is a Dataset and a set of words, and the final output the selected features.
Repositories of workflows like myExperiment, CrowdLabs and Galaxy have been created for scientists to share their scientific workflows, so a reasonable question for a scientist would be: how do these workflows relate to each other? Which are the most popular workflow fragments in the dataset? Has there really been reuse among the different workflows?

A possible way to find out these popular workflow fragments is to apply graph matching techniques. If we represent the workflow as a directed acyclic graph, we can apply existing methods to see how a set of workflows overlaps with other similar workflows in the repository. In the UPM we have been recently working on this problem, and the technique we have used is described on our last accepted work at K-CAP. The idea is to obtain the best context free grammar encoding all the workflows of the repository. The rules of the grammar will be the most common fragments among the workflows, which is exactly what we are looking for. If you want more information and examples, I encourage you to have a look at the paper.

Posted in e-Science, Provenance, Wings | Tagged: , , , , | Leave a Comment »

Research Objects for dummies

Posted by dgarijov on April 14, 2013

Last week a new W3C community and business group was approved: the Research Object for Scholarly Communication Community Group. But what is a Research Object? Why do we need them?

In a nutshell, a Research Object is a wrapper of interconnected resources and metadata bundled together to offer a context and materials for some purpose. It could be a set of papers that represent the state of the art of a project proposal, the experiments performed in a scientific experiment or even a post about Research Objects and its associated materials.

The Research Object primer explains with some examples the Workflow-Centric Research Objects, where the focus is to describe an experiment with its workflow specification and its relation to inputs, intermediate results, outputs and final publications. If you want to browse more information about the RO Model I recommend browsing this documentation.

Do you want to see an example? This RO was created from a myExperiment workflow execution, including a bundle with the provenance information, a sketch and the specification of the workflow. It even has the hypothesis the experiment aims to check.

But why is this useful? Well, having all the associated resources of an experiment scattered through the web makes it difficult for the consumer to understand the complexity of the work. For example, imagine that I have this paper, describing an experiment (link). The paper shows some tables and plots, which are made from the curated results of this web page (link), not linked from the paper. But then I also have an execution of a workflow (link), which shows a diagram of the workflow and provides pointers to all intermediate and final (non curated) data. The scientists who run the initial experiment for the paper are different from those who run the workflow, and the attribution should be noted. Also, all the links and dependencies between data can’t be inferred by just accessing to the resources. By having everything included in a Research Object, one would be able to understand what the role of each resource in the experiment is, being able to communicate it to other researchers.

Posted in e-Science | Tagged: | 2 Comments »

The Beyond of the pdf Workshop

Posted by dgarijov on March 26, 2013

The Second Beyond the PDF workshop has finally taken place last week in Amsterdam (fortunately I got travel support from the organizers, so I was able to attend the full event). If I have to pick a word to describe the workshop, it would be “different”. As Paul Groth (one of the chairmans) summarizes in his post, the audience was heterogeneous: there were people from biomedical, humanities, social sciences and physical sciences domains, belonging to different types of organizations (ranging from academics to governmental). Publishers and editorials were also present, and many different tools, visions and ideas were presented to improve the future of scholarship communication. This whole context was a bit different to what one could be used to see in other conferences, where you find people doing similar things to what you do, and you discuss your research rather than the idea of how to communicate it to others. Here people were not afraid to tell publishers and editors why they thought the system was broken, exposing their arguments in a non-formal friendly environment.

Another interesting fact was the “second screen” showing the twitter wall live. People were very active, highlighting the interesting quotes from the talks and initiating debates in parallel to all the sessions. Even today the tag #btpdf2 is still active. Congrats to all the organizing staff!

While the speakers were exposing, some artist were drawing the Beyond the PDF wall

While the speakers were exposing, some artist were drawing the Beyond the PDF wall

Detailed summary and highlights

The program of the workshop is available here. Below you can see the summary and highlights from the different sessions and interesting quotes I wrote down in my notes.

Day 1:

The day started with a Keynote by Kathleen Fitzpatrick, who explained how the book is not dead, although the academic book is kind of dying. The blog could be a replacement, since it is a kind of alternative way to publish the resources. You are able to get comments from the community, feedback suggestions and support. Why couldn’t we be our own publishers?

The current reviewing process has concerns; could it be part of what is broken? Bias and flaws is not unusual, and reviewing requires a great labor for which we normally don’t receive much credit. As an example, she explained how the book she had been writing had more impact in a blog form than in its final published format.

Finally, she remarked how important the online communities are. If you build a tool or a service without a community, people will not just come. You have to build a community first. Some interesting quotes: “Publishers will have to focus more on services and less on selling digital objects”. “We need filters, not gatekeepers” (referring to publishers and editors). “The network is not a threat. It helps to reach more people
Laura Czerniewicz and Michelle Willmers followed the keynote with a session on context. They highlighted the dangers of a complete open access: will it become a flooding of content? There is a need for a rewarding system. What do authors get from open access? Editors are gatekeepers. Another important factor is that in the end only the Journal articles are considered when judging the validity of a researcher. Tweets, blogs, talks, workshops and conferences are ignored, even when they could have had more impact than the actual journals. In most cases journal articles are the peak of the iceberg.

Next, on the Vision session, Nathan Jenkins introduced Authorea, a very cool tool to build articles online without having to deal with the Latex compilation and built on Ruby on Rails. Mercé Crossas presented Dataverse, a portal for archiving data results for citation purposes, motivated by the volatility of the links in old papers. Amalia S. Levi explained how in historical research a lot of the data already existed, but the links were missing. (This reminds me of some conversations that I’ve had recently about how the papers are cited in the scientific community. It turns out that sometimes this is the case nowadays as well). Joost Kircz hit the spot in his speech (in my opinion): Are we going Beyond the pdf or Beyond the essay? An enhanced pdf is still stuck on the page paradigm. Papers represent structured or randomized knowledge that should be browsed, and that is often not possible in a book. I liked his ending statement: “Publishing is not a science, but is a craft”. Lisa Girard followed with StemBook, a portal where all the authors could keep their findings up to date, allowing the community to review their work in stem cell biology. An interesting thing about it is that people could upload their protocols and annotate them using Domeo, aligned with the Annotation Ontology. Paolo Ciccarese followed providing an overview of that ontology, summarizing their efforts and collaboration in the community in order to come up with a highly adopted standard.

As a small comment to this session, I think it is a bit curious that so many finished (or nearly finished) tools were presented in a “Vision” session. It would have been interesting to see how some of the presenters picture the future of publication and how to get there (either by using some of the presented tools or not).

After lunch there was a session on new models for content dissemination, where Theodora Bloom started stating very clearly what the main current problems are for dissemination:

  1. Access to what you want to read and use
  2. Publication venue as a measure of quality.
  3. Having to repeat the cycle of publication in different journals
  4. Poor links for underlying data.

She also explained how in Plos One the research leading to negative results is also published, but hardly anyone submits. I really liked this, it reminded me of a quote from Thomas Edison: “I have not failed. I’ve just found 10,000 ways that won’t work”. If an idea looks promising but doesn’t work as expected, it’s important to share it with the community so as to avoid someone else to repeat the same mistake. Who knows, it might even inspire other people to come up with a better solution.

Brian Hole followed talking about metajournals and the social contract of science, combining it in the idea of an Ultrajournal.

The second part of the session was introduced by a lively Jason Priem, who talked about how the printing press had been the first revolution for disseminating content and the Internet the second one. According to him, we should mine the network in order to produce the appropriate filters for the information. Keith Collier followed introducing Rubriq, an independent peer- review system that aims to decouple the peer review from the publication. Next, Kaveh Bazargan showed the current concern about type setters, and how we should get rid of them. Instead, XML or blog post should be the current type setters, giving more freedom to the writer. Finally Peter Bradley talked about, an open source platform for the evaluation of information, and Alf Eaton introduced PeerJ, an open access peer reviewed journal with metadata for all their papers.

The final session of the day was about the business case, where three representatives explained different business models and three stakeholders plus the audience asked questions about them.  Wim van der Stelt argued that in Springer they are not resisting to the change and Mark Hahnel defended the authors to be able to receive credit for their data as it happens in FigShare. The discussion brought some interesting topics to the table, such as that scholarly communication per se is not profitable and we need government funding, how to move from impact factor in journals to one that is meaningful (and convince the government to support it) or how to be able to share our work to those that don’t have the means to afford to pay it. Another important observation is the number of hours spent by researchers in rejected per year, which sums up to 11-16 millions!

The day ended with the session on demos and posters. Marco Roos and Aleix Garrido were by my side talking about the wf4ever project, while I spoke a bit about the work done reproducing the TB-Drugome workflow. The slides can be seen here.

Day 2

Carol Teinoir started the day by trying to analyze and understand the needs of scholars. She gave a lot of metrics about the main reasons for scholars to not share their data (“I have not the time”, or “I’m not required to” were among the top five), and how successful researchers turn up to read more. She also gave metrics on who is sharing data versus who is willing to share their data, and analyzed how the e-books had influenced the printed pdf copies. An interesting fact: in Australia, e-books have almost replaced written copies.

The “Making it happen” session was next. Asunción Gómez Pérez talked about the SEALS evaluation platform, which allows reproducing the different tests of an experiment automatically. Graeme Hirst spoke about usability, the “neglected dimension” and how we are “forced” to use low usable systems like Word and Latex. The gain should be greater than the pain when writing a paper.Rebecca Lawrence followed talking about data review and how to share data: the requirement of a data sharing plan, how things should be done according to standards, where do we find the funding for the previous 2, how we should refuse the papers where data is not accessible, and how a reviewer should have access to all the materials in order to properly review the paper.

Asun Gómez Pérez talking about Seals

Asun Gómez Pérez talking about the SEALS platform

The session finished with several short presentations that can be accessed here. Anita de Waard insisted on the idea of the need of a new rewarding system, although no further details were given. I also liked the talk by Melissa Haendel on reproducibility on science, even if she didn’t talk about the role of scientific workflows in reproducibility. Another interesting tool was ORCID, a registry for scholars with author disambiguation. Gully Burns ended the session analyzing how the different parameters change an experiment.

We broke out in different sessions during lunch. I went to the reproducibility, where we shared the different issues that currently exist for trying to store and rerun experiments. However, unlike the data citation group we didn’t come up with a manifesto.

The next session dealt with the new models for evaluation of research, where the organizer, Carole Goble, proposed a little role play. Each of the 6 participants wore a different hat representing the role of their institution. Phil Bourne was the institutional dean (officer hat), Victoria Stodden (with the typical English bureaucratic hat on the right of the picture) represented the public funding agencies, Christine Borgman represented the digital libraries (second hand cowboy hat), Jan Reichelt with the “cool” hat on the left represented the commercial funders; Scott Edmunds representing publisher role with a top hat (unfortunately he wasn’t wearing it in the picture) and Steve Pettifer represented the academic role, (can’t be seen properly on the picture).

Roles in academic funding and dissemination of research

Roles in academic funding and dissemination of research

The summary of the discussion was as follows, for each role:

  • Funding agencies: they are not interested in the evaluation of the academic research. It should be driven by the community.
  • The dean: I’ll quote the acting by Phil:

“Oh, we have produced a 200 page report about the possible changes that we could do to the system.

–  And what are you going to change?

– Very little!”

It’s events like this one the ones that provide the new ideas.

  • Publishers and academia: death to impact factor.
  • Commercial funders: code and methods matters. They should be brought as first class citizens (I couldn’t agree more).
  • Digital libraries: The standards are problematic. Tools don’t connect, and interoperability is an issue.

The final session, Visions for the future, grouped a set of flash talks from very different people. The most successful ones were given by Carole Goble (winner), who compared the publication of data from a software engineering perspective, and how we could do several releases of the data as happens in software releases: “Don’t publish, release!”; Stian Haklev with his proposal to create an alternative for Google Scholar (I liked his answer to Ed Hovy, when he asked what was new in his proposal: “There is nothing new about this, and that is precisely what is new, that we are just able to make it”); Jeffrey Lancaster with his proposal to change the CSL citation styles and Kaveh Bazargan, who demanded the publishers to release the XML of the papers instead of the pdf. The job of a publisher should be to disseminate content, and not to dictate us how to read the papers. He even did an online demo of a tool that could show the pdf in several different ways depending on the user preferences from the XML.

I also found interesting the proposal by Alejandra Gonzalez-Beltran, who talked about isa-tools, a platform used by pharmaceutical companies for the collection, curation and reuse of datasets; and of course the idea of Olga Giraldo, who wants to provide the means to transform laboratory protocols as nanopublications and provide checklist to organize them properly. Below you can see a picture of the participants in the session:

Participants of the "Make it Happen" session.

Participants of the “Make it Happen” session.

And that’s all! I think that in summary it was a nice event with a lot of discussion and claims from academia to editors, publishers and funding agencies. Of course, I guess that part of the motivation of the workshop is for them to take ideas on how the system could be changed plus a state of the art of different tools and platforms that they could incorporate to their systems.

Results, next steps?

There was a lot of debate but no session for what the next steps should be. I think this would have been an interesting thing to have, although it is difficult to have it all in a 2-day event. As results, part of the people participating in the breakout sessions wrote the “Data citation manifesto”, which I would really like people to follow in order to give credit for their data (link here, please share!).

Also the idea of an open Google Scholar (as an open alternative such as open Street maps is to Google Maps) looks promising. I hope it gets implemented!

And finally, some personal thoughts. After attending the event I realized that as a computer scientist working to enable reproducibility and reusability of other people’s work, sometimes in my own area we don’t follow the reproducibility principles: papers about tools that are not available after a while, published algorithms without an implementation, , unstable links, etc. I have always tried to include a reference to the code and evaluations done in my work for the reviewers to access it, but I might start using some of the tools shown in the workshop for the sake of preservation.

The map with the main topics discussed on day 2

The map with the main topics discussed on day 2

Posted in e-Science, Workshop | Tagged: , , , | Leave a Comment »

Provenance Corpus ready!

Posted by dgarijov on December 12, 2012

This week there was an announcement about the deadline extension for BIGPROV13. Apparently, some authors are preparing new submissions for next week. In previous posts I highlighted how the community has been demanding a provenance benchmark to test different analyses on provenance data, so today I’m going to describe how I have been contributing to the publication of public accessible provenance traces from scientific experiments.

It all started last year, when I did an internship in the Information Sciences Institute (ISI) to reproduce the results of the TB-Drugome experiment, led by Phil Bourne’s team in San Diego. They wanted to make accessible the method followed in their experiment in order to be reused by other scientists, for which it is necessary to publish sample traces of the experiment, the templates and every intermediate output and source. As a result, we reproduced the experiment with a workflow using the Wings workflow system, we extended the Open Provenance Model (OPM) to represent the traces as the OPMW profile, and we described here the process necessary in order to publish the templates and traces of any workflow as Linked Data. Lately we have aligned the previous work with the emerging PROV-O standard, providing serializations of both OPM and PROV for each workflow that is published. You can find the public endpoint here, and an exemplar application that loads into a wiki the data of a workflow(dynamically) can be seen  here.

I have also been working with the Taverna people in the wf4Ever project to create a curated repository of runs from both Taverna and Wings, compatible with PROV (since both systems are similar and extend the standard to describe their workflows). The repository, available here for anyone that wants to use it, has been submitted to the BIGPROV13 call and hopefully will get accepted.

So… now that we have a standard for representing provenance the big questions are: What do I do with all the provenance I generate? How do I interoperate with other approaches? At what granularity do I record the activities of my website? How do I present provenance information to the users? How do I validate provenance? How do I complete it? Many challenges remain to be solved until we can hit Tim Berners Lee’s OH Yeah? button of every web resource.

Posted in e-Science, Linked Data, Provenance, scientific workflows, Taverna, Wings | Tagged: , , , , , , , | Leave a Comment »

Late thoughts about e-Science 2012

Posted by dgarijov on November 26, 2012

After a 2 week holiday, I’m finally back to work. Before letting more time pass by, I would like to share here a small summary of the e-Science conference I attended about a month and a half ago in Chicago.

I’ll start with the keynotes. There were four in the 3 days that the conference lasted. Gerhard Klimeck (slides) introduced Nanohub, a platform to publish and use separate components and tools via user-friendly interfaces, showing how they could be used for different purposes like education or research in a scalable way. It has a lot of potential (specially since they try to make things easier through simple interfaces), but I found curious how the notion of workflows doesn’t exist (or they are barely used).

Gregory Wilson (slides) raised a nice issue in e-Science: sometimes the main issue about the products developed by the scientific community is not that they have the wrong functionality, but that users don’t understand what are these products or how to use them. In order to address it, we should first prepare the users and then give them the tools.

The third speaker was Carole Goble (slides), who talked about reproducibility in e-Science and the multiple projects in which she is participating. She mentioned specially the wf4Ever project (where she collaborates with the OEG) and the Research Objects, the data artifacts that myExperiment is starting to adopt in order to preserve workflows and their provenance.

The last keynote was given by Leonard Smith (slides), and unlike the others (which were more computer science oriented), he presented from the point of view of a scientist that is looking for the appropriate tools to keep doing his research successfully. He talked about doing “science in the dark” (predictions over past observations) versus “science in the light” (analysis with empirical evaluations), and showed the example of meteorological predictions. Apparently the Royal Society wanted to drop the weather predictions in the past, but they were forced by users to have them back. Leonard highlighted the importance of never giving a 100% or 0% chance in the forecasts and ended his talk asking how could the e-Science community help this kind of research. I really recommend taking a look at the slides.

As for the panels, I attended the one about operating cities and Big Data. The work presented was very interesting, but I was a bit disappointed. I haven’t been to many panels before, and I thought a panel discussion was more a discussion between the speakers and the audience rather than presentations about the speakers’ work and a longer round of questions. This does not imply that the work was bad at all, just that I missed some debate among the invited speakers.

Regarding the sessions, most of them happened in parallel. The whole program can be seen here, so I will just post those which I enjoyed the most:

  1. Workflow 1: Where Khalid Belhajjame presented the work on decay analyzed by the wf4Ever people in Taverna workflows (slides). Definitely a good first step for those seeking to preserve the workflow functionality and their reprpoducibility. In this session I also talked about our empirical analysis on scientific workflows in order to find common patterns in their functionality (see slides).
  2. Data provenance: Beth Plale’s students (Pend Chen and You-Wei Cheah) introduced their work on temporal representation and quality of the workflow traces; and Sarah Cohen-Boulakia presented her work about workflow rewriting in order to make scalable analyses on the workflow graphs. I liked all the aforementioned presentations, as they where interesting and easy to follow. However they all shared the need on real workflow traces (they had created artifical ones for testing their approaches).
  3. Workflow 2: From this session I found relevant the work presented by Sonja Holl (slides), who talked about the approach they use to find automatically the appropriate parameters for running a workflow. Once again, she was interested for traces o real workflows, specifically from Taverna (since it is the system she had been dealing with).

In conclusion, I was very happy to attend to the conference (my first one if I don’t count workshops!), even if I missed the 3 day workshops from Microsoft that happened earlier in the week. I had the chance to meet new people that I had only seen through e-mail, and I talked to all the thinking heads working close to what I do.

From the sessions also became clear to me that the community is asking for a scientific workflow provenance curated benchmark for testing their different algorithms and methods. Fortunately I have seen a call for paper with this theme: It covers provenance in general, but in the Wf4ever project we are already planning a joint submission with more than 100 executions of different workflows from Taverna and Wings systems. Specifically, the ones from Wings are already online published as Linked Data (see some examples here). Lets see how the call works out!

Some of the presenters at e-Science (from left to right): Sonja Holl, Katherine Wolstencroft, Khalid Belhajjame, Sarah Cohen and me

Posted in Conference, e-Science | Tagged: , , , , , | Leave a Comment »