Category: scientific workflows

Report: International Congress on Environmental Modelling and Software (iEMSs)

Last week I attended the 9th edition of iEMSs in Fort Collins, Denver. IEMSs is a bi-annual conference that brings together between 300 and 400 researchers from software engineering, intelligent systems, environmental modeling and decision making domains (among others). There were very few people that knew about ontologies and Semantic Web, which makes it a unique experience to learn about the problems from other communities. Going to this kind of events (outside of your community of expertise) has been eye opening for me in the past, and I cannot recommend it enough. Get out of your community bubble once in a while J

What was I doing at iEMSs?

I attended the conference to present 3 papers about our Model Integration project (MINT). The papers describe an overview of the project, in which we aim to reduce the time required to integrate together models from climate, hydrology, agriculture, economics and social sciences. In addition, we introduce a new approach to describe model variables and processes using the Ontosoft software registry and our plan to integrate Pegasus and Emely for efficient model coupling. More information is available in the conference program (hopefully our papers will soon be available in the conference proceedings as well). Overall, the presentations were well received and I was glad to learn that there is huge interest in some of the problems we are tackling, such as the description of models to facilitate their reusability or enabling model coupling.

AWESOME Keynotes

One of the best parts of the conference were the keynotes. Temple Grandin started on Monday with a cry for acceptance of visual thinkers (“I see risk, other people try to measure it!”) together with the need to get closer to the infrastructure we use every day. Get out of the office and get your hands dirty once in a while!

Nick Clinton followed up on Tuesday with an introduction to Google Earth (see slides). It looks like Google has invested a lot into bringing together earth data (more than 7 PB) and infrastructure to create an environment for scientist to do their science. All for free (for researchers), using Javascript and Python interfaces and with access to a bunch of machine learning algorithms. It’s also easy to create time lapses of areas of interest, allowing to show real time evolution of parts of earth for the last 30 years.

The last keynote speaker was Thomas Vilsack, former US Secretary of Agriculture under the Obama administration. This is the first keynote I have seen given by a politician, with no slides and a direct but compelling speech. The speaker tackled several problems related to modeling, from the role of science in different debates (GMOs and climate change) to the need for new sustainable solutions given the increase of population around the globe. How can we make models that convince farmers and policy makers about the long term consequences of their actions? How can models be used to increase the productivity per individual acre? Can we find solutions so we become better consumers of food? How can we reduce and reuse food waste?


Given that many sessions happened in parallel, this is a personal vision with the highlights of the talks I attended to:

  • Ibrahim Demir’s FloodAI is a very cool approach that mixes science with visual explanations early detection observations. They have done an impressive amount of work to be able to communicate their results with chat bots. No wonder why he won a conference award!
  • Alexei Voinov described surveys, tools and methods for participatory modeling. Remaining challenges are a) people tend to use the tools and models they are more familiar with, rather than experiment new ones in different contexts; b) Failure in method execution is not reported.
  • Ruth Falconer (University of Abertay) and the use of videogames in environmental modeling.

  • Eric Hutton (CSDMS) introduced PYMT, a model coupling framework in Python.
  • RODOS, an European decision support system designed as a consequence of Chernobyl’s nuclear accident. There are so many different processes involved, from wind to soil deposition of contamination.
  • The Nexus tools platform for model comparison. Currently they have 84 models and counting!
  • Sarah Mubareka’s report on integration of models of biomass supply. Creating accurate indicators for estimating biomass in Europe is a real challenge, as everyone one uses different definitions and metrics in their country.
  • Natalia Villanueva’s interface for scenario simulation in Rio Grande. I really like the effort they have put into make their results understandable by stakeholders.
  • TMDL, a mechanism to remediate impaired water bodies

See you in Brussels 2020!

EarthCube All Hands Meeting (ECAHM 2018)

EarthCube All Hands Meeting (ECAHM 2018)

Last week I attended the annual EarthCube All Hands Meeting (ECAHM) in Alexandria, Washington. Since it’s been a while since I last wrote my last post, I think it would be interesting to share my notes and highlights here for anyone who missed the event.

ECAHM meetings are usually very enriching experiences, as they bring together a variety of researchers from different fields related to geosciences, ranging from computer scientists to volcanologists or marine biologists. The purpose of the meeting is to gather the community together and hear everyone report back from their EarthCube NSF funded projects, which are targeted towards improving cyber-infrastructure in the geosciences. As a computer scientist, I think this is a great meeting to attend for two main reasons: first, you always learn something new, even if it’s not in your domain. Second, people are extremely grateful to your contributions, as you are helping them become more effective when doing their science.

So, what was I doing at ECAHM 2018?

I attended the meeting to present our latest progress in OntoSoft, a distributed software metadata registry we created at ISI to facilitate scientists describe their software. You can see the poster abstract online (and soon the poster itself). I also participated on a “speed-dating session”, where I got to discuss for half an hour how to describe software with a domain scientist; and I substituted Yolanda Gil in a panel for external partnership opportunities, where I presented the Open Knowledge Network initiative. This effort, led by NITRD, is a great opportunity of creating a shared open knowledge graph that would be used for both research and industry to refine and curate its contents. The idea is that this knowledge graph becomes part of the US infrastructure the same way supercomputers currently are, so anyone could benefit from it and also contribute to it. It looks like the NSF is keen to pursue this objective too.

Two colleagues of mine also presented other initiatives I am involved in. Deborah Khider showcased our efforts towards structuring metadata and creating standards in the paleoclimate sciences, together with a set of tools that a team of paleo-climate scientists have developed to work with that structured data. She also managed to mix Star Wars and Star Trek themes in her poster and presentation, which was well received by the attendants (I think everyone stopped at her poster)

Jo Martin presented the IS-GEO research collaboration network, where we are bringing in experts from geosciences and intelligent systems to foster new collaborations. We hold a monthly meeting where we have every time a different researcher talking about their latest work! Check it out here:

About the keynotes:

As expected, keynotes at ECAHM are nothing like venues such as AAAI or IUI. The first speaker was Dean Pesnell (NASA) and he presented the research carried out by his team on studying the sun and sun spots. Why is this related to geosciences? Because the sun could be considered “our ground truth for the universe”, and anything related to its activity has many implications in any of the fields of geosciences. Their main problem is how to analyze the amount of data that they have. Each of their datasets may contain several hundred million images, so proper metadata is crucial (you don’t want to find out you have downloaded 300 million images for nothing). Dean showed some impressive videos of their observations of the sun, as well as their pipelines to handle “very big data” analyses.

The second speaker was Sarah Stamps, and she talked about continental rift and the Tanzania Volcano observatory. Apparently, geologists are one of the few people in the word who would run towards an erupting volcano, instead of away from it. Sarah described the EARS system (East African Rift System) they are setting up, and how they teamed up with CHORDS to enable real time analysis of the observations they measure on the field. Thanks to her work, they are developing an early warning system for hazard detection! Sarah was departing soon to set a few more observing stations in the field, so best of luck!!

The third speaker was Caroline S. Wagner, who gave some metrics on the social side of interdisciplinary collaboration across disciplines. Science has become increasingly collaborative and team based, and the number of international collaborations have doubled in the past years. The number of countries producing 95% of research has gone from 7 to 15, which indicates we are moving in the right direction. However, more than 50% of the articles are currently never cited. A few takeaways from this talk are: 1) International collaborations start face to face, so go to different events and meet new people; 2) Diverse teams usually take longer to be productive, as people don’t usually speak the same language. Be patient!!; 3) Work towards a solution, not towards interdisciplinar teams. Interdisciplinarity should be the means to an end, not the end itself.

Other highlights

Below are some additional highlights I found interesting for the EarthCube community.

  • Eva Zanzerika reported on the NSF 10 Big Ideas, which nicely summarize the interests of the agency in terms of funding in the next years. The report has been out since more than 1 year ago, but it’s never too late to catch up!
  • Doug Fils presented their plan for turning P418 turning into something bigger. In case you don’t know, P418 currently tracks the metadata of datasets exposed as and aggregates it in a search engine (a search engine for scientific data). Future plans are to ingest other types of resources and make the code base stable.
  • Interesting working lunch idea: A napkin drawing exercise. Do you know how to present your idea with a simple sketch?
  • Simon Goring (and Scott Peckham): How do we measure success on a huge program such as Earthcube?
  • PANGEO: Big data in the geosciences (but without reinventing the wheel!)
  • ASSET: Or how to incorporate existing tools into your workflows by drawing sketches! Workflows are important! Two different studies may obtain results even if the original data is the same:

  • I got an award for community service 🙂 :

Elevator pitch

While being a PhD student, many people have asked me about the subject of my thesis and the main ideas behind my research. As a student you always think you have very clear what you are doing, at least until you have to actually explain it to someone who is not related to your domain. In fact, it is about using the right terminology. If you say something like “Oh yeah, I am trying to detect abstractions on scientific workflows semi-automatically in order to understand how they can better be reused and related to each other”, people will look at you as if you didn’t belong to this planet. Instead, something like “detecting commonalities in scientific experiments in order to study how we can understand them bettermight be more appropriate.

But last week the challenge was slightly different. I was invited to give an overview talk about the work I have been doing as a PhD student. And that is not only what I am doing, but why am I doing it and how is it all related without going into the details of every step. It may appear as an easy task, but it kept me thinking more than I expected.

As I think some people might be interested in a global overview, I want to share the presentation here as well: Have a look!

Provenance Corpus ready!

This week there was an announcement about the deadline extension for BIGPROV13. Apparently, some authors are preparing new submissions for next week. In previous posts I highlighted how the community has been demanding a provenance benchmark to test different analyses on provenance data, so today I’m going to describe how I have been contributing to the publication of public accessible provenance traces from scientific experiments.

It all started last year, when I did an internship in the Information Sciences Institute (ISI) to reproduce the results of the TB-Drugome experiment, led by Phil Bourne’s team in San Diego. They wanted to make accessible the method followed in their experiment in order to be reused by other scientists, for which it is necessary to publish sample traces of the experiment, the templates and every intermediate output and source. As a result, we reproduced the experiment with a workflow using the Wings workflow system, we extended the Open Provenance Model (OPM) to represent the traces as the OPMW profile, and we described here the process necessary in order to publish the templates and traces of any workflow as Linked Data. Lately we have aligned the previous work with the emerging PROV-O standard, providing serializations of both OPM and PROV for each workflow that is published. You can find the public endpoint here, and an exemplar application that loads into a wiki the data of a workflow(dynamically) can be seen  here.

I have also been working with the Taverna people in the wf4Ever project to create a curated repository of runs from both Taverna and Wings, compatible with PROV (since both systems are similar and extend the standard to describe their workflows). The repository, available here for anyone that wants to use it, has been submitted to the BIGPROV13 call and hopefully will get accepted.

So… now that we have a standard for representing provenance the big questions are: What do I do with all the provenance I generate? How do I interoperate with other approaches? At what granularity do I record the activities of my website? How do I present provenance information to the users? How do I validate provenance? How do I complete it? Many challenges remain to be solved until we can hit Tim Berners Lee’s OH Yeah? button of every web resource.