Report: International Congress on Environmental Modelling and Software (iEMSs)

Last week I attended the 9th edition of iEMSs in Fort Collins, Denver. IEMSs is a bi-annual conference that brings together between 300 and 400 researchers from software engineering, intelligent systems, environmental modeling and decision making domains (among others). There were very few people that knew about ontologies and Semantic Web, which makes it a unique experience to learn about the problems from other communities. Going to this kind of events (outside of your community of expertise) has been eye opening for me in the past, and I cannot recommend it enough. Get out of your community bubble once in a while J

What was I doing at iEMSs?

I attended the conference to present 3 papers about our Model Integration project (MINT). The papers describe an overview of the project, in which we aim to reduce the time required to integrate together models from climate, hydrology, agriculture, economics and social sciences. In addition, we introduce a new approach to describe model variables and processes using the Ontosoft software registry and our plan to integrate Pegasus and Emely for efficient model coupling. More information is available in the conference program (hopefully our papers will soon be available in the conference proceedings as well). Overall, the presentations were well received and I was glad to learn that there is huge interest in some of the problems we are tackling, such as the description of models to facilitate their reusability or enabling model coupling.

AWESOME Keynotes

One of the best parts of the conference were the keynotes. Temple Grandin started on Monday with a cry for acceptance of visual thinkers (“I see risk, other people try to measure it!”) together with the need to get closer to the infrastructure we use every day. Get out of the office and get your hands dirty once in a while!

Nick Clinton followed up on Tuesday with an introduction to Google Earth (see slides). It looks like Google has invested a lot into bringing together earth data (more than 7 PB) and infrastructure to create an environment for scientist to do their science. All for free (for researchers), using Javascript and Python interfaces and with access to a bunch of machine learning algorithms. It’s also easy to create time lapses of areas of interest, allowing to show real time evolution of parts of earth for the last 30 years.

The last keynote speaker was Thomas Vilsack, former US Secretary of Agriculture under the Obama administration. This is the first keynote I have seen given by a politician, with no slides and a direct but compelling speech. The speaker tackled several problems related to modeling, from the role of science in different debates (GMOs and climate change) to the need for new sustainable solutions given the increase of population around the globe. How can we make models that convince farmers and policy makers about the long term consequences of their actions? How can models be used to increase the productivity per individual acre? Can we find solutions so we become better consumers of food? How can we reduce and reuse food waste?


Given that many sessions happened in parallel, this is a personal vision with the highlights of the talks I attended to:

  • Ibrahim Demir’s FloodAI is a very cool approach that mixes science with visual explanations early detection observations. They have done an impressive amount of work to be able to communicate their results with chat bots. No wonder why he won a conference award!
  • Alexei Voinov described surveys, tools and methods for participatory modeling. Remaining challenges are a) people tend to use the tools and models they are more familiar with, rather than experiment new ones in different contexts; b) Failure in method execution is not reported.
  • Ruth Falconer (University of Abertay) and the use of videogames in environmental modeling.

  • Eric Hutton (CSDMS) introduced PYMT, a model coupling framework in Python.
  • RODOS, an European decision support system designed as a consequence of Chernobyl’s nuclear accident. There are so many different processes involved, from wind to soil deposition of contamination.
  • The Nexus tools platform for model comparison. Currently they have 84 models and counting!
  • Sarah Mubareka’s report on integration of models of biomass supply. Creating accurate indicators for estimating biomass in Europe is a real challenge, as everyone one uses different definitions and metrics in their country.
  • Natalia Villanueva’s interface for scenario simulation in Rio Grande. I really like the effort they have put into make their results understandable by stakeholders.
  • TMDL, a mechanism to remediate impaired water bodies

See you in Brussels 2020!

EarthCube All Hands Meeting (ECAHM 2018)

EarthCube All Hands Meeting (ECAHM 2018)

Last week I attended the annual EarthCube All Hands Meeting (ECAHM) in Alexandria, Washington. Since it’s been a while since I last wrote my last post, I think it would be interesting to share my notes and highlights here for anyone who missed the event.

ECAHM meetings are usually very enriching experiences, as they bring together a variety of researchers from different fields related to geosciences, ranging from computer scientists to volcanologists or marine biologists. The purpose of the meeting is to gather the community together and hear everyone report back from their EarthCube NSF funded projects, which are targeted towards improving cyber-infrastructure in the geosciences. As a computer scientist, I think this is a great meeting to attend for two main reasons: first, you always learn something new, even if it’s not in your domain. Second, people are extremely grateful to your contributions, as you are helping them become more effective when doing their science.

So, what was I doing at ECAHM 2018?

I attended the meeting to present our latest progress in OntoSoft, a distributed software metadata registry we created at ISI to facilitate scientists describe their software. You can see the poster abstract online (and soon the poster itself). I also participated on a “speed-dating session”, where I got to discuss for half an hour how to describe software with a domain scientist; and I substituted Yolanda Gil in a panel for external partnership opportunities, where I presented the Open Knowledge Network initiative. This effort, led by NITRD, is a great opportunity of creating a shared open knowledge graph that would be used for both research and industry to refine and curate its contents. The idea is that this knowledge graph becomes part of the US infrastructure the same way supercomputers currently are, so anyone could benefit from it and also contribute to it. It looks like the NSF is keen to pursue this objective too.

Two colleagues of mine also presented other initiatives I am involved in. Deborah Khider showcased our efforts towards structuring metadata and creating standards in the paleoclimate sciences, together with a set of tools that a team of paleo-climate scientists have developed to work with that structured data. She also managed to mix Star Wars and Star Trek themes in her poster and presentation, which was well received by the attendants (I think everyone stopped at her poster)

Jo Martin presented the IS-GEO research collaboration network, where we are bringing in experts from geosciences and intelligent systems to foster new collaborations. We hold a monthly meeting where we have every time a different researcher talking about their latest work! Check it out here:

About the keynotes:

As expected, keynotes at ECAHM are nothing like venues such as AAAI or IUI. The first speaker was Dean Pesnell (NASA) and he presented the research carried out by his team on studying the sun and sun spots. Why is this related to geosciences? Because the sun could be considered “our ground truth for the universe”, and anything related to its activity has many implications in any of the fields of geosciences. Their main problem is how to analyze the amount of data that they have. Each of their datasets may contain several hundred million images, so proper metadata is crucial (you don’t want to find out you have downloaded 300 million images for nothing). Dean showed some impressive videos of their observations of the sun, as well as their pipelines to handle “very big data” analyses.

The second speaker was Sarah Stamps, and she talked about continental rift and the Tanzania Volcano observatory. Apparently, geologists are one of the few people in the word who would run towards an erupting volcano, instead of away from it. Sarah described the EARS system (East African Rift System) they are setting up, and how they teamed up with CHORDS to enable real time analysis of the observations they measure on the field. Thanks to her work, they are developing an early warning system for hazard detection! Sarah was departing soon to set a few more observing stations in the field, so best of luck!!

The third speaker was Caroline S. Wagner, who gave some metrics on the social side of interdisciplinary collaboration across disciplines. Science has become increasingly collaborative and team based, and the number of international collaborations have doubled in the past years. The number of countries producing 95% of research has gone from 7 to 15, which indicates we are moving in the right direction. However, more than 50% of the articles are currently never cited. A few takeaways from this talk are: 1) International collaborations start face to face, so go to different events and meet new people; 2) Diverse teams usually take longer to be productive, as people don’t usually speak the same language. Be patient!!; 3) Work towards a solution, not towards interdisciplinar teams. Interdisciplinarity should be the means to an end, not the end itself.

Other highlights

Below are some additional highlights I found interesting for the EarthCube community.

  • Eva Zanzerika reported on the NSF 10 Big Ideas, which nicely summarize the interests of the agency in terms of funding in the next years. The report has been out since more than 1 year ago, but it’s never too late to catch up!
  • Doug Fils presented their plan for turning P418 turning into something bigger. In case you don’t know, P418 currently tracks the metadata of datasets exposed as and aggregates it in a search engine (a search engine for scientific data). Future plans are to ingest other types of resources and make the code base stable.
  • Interesting working lunch idea: A napkin drawing exercise. Do you know how to present your idea with a simple sketch?
  • Simon Goring (and Scott Peckham): How do we measure success on a huge program such as Earthcube?
  • PANGEO: Big data in the geosciences (but without reinventing the wheel!)
  • ASSET: Or how to incorporate existing tools into your workflows by drawing sketches! Workflows are important! Two different studies may obtain results even if the original data is the same:

  • I got an award for community service 🙂 :

Intelligent user interfaces 2017 (IUI2017)

Intelligent user interfaces 2017 (IUI2017)

I have just returned from an amazing IUI2017 in Limassol, Cyprus and, as I have done with other conferences, I think it would be useful to share a summary of my notes in this post. This was my first time attending the IUI conference, and I am gladly surprised with both the quality of the event and friendliness of the community. As a Semantic Web researcher, it was also very positive to learn how problems are tackled from a human-computer interaction perspective. I have to admit that this is often overlooked in many semantic web applications.IMG_20170312_131824098

What was I doing in IUI2017?

My role in the conference was to present our paper towards the generation of data narratives, or, in a more ambitious manner, our attempt to write the “methods” section of a paper automatically (see some examples here). The idea is simple: in computational experiments, the inputs, methods (i.e., scientific workflows), intermediate results, outputs and provenance are explicit in the experiment. However, scientists have to process all these data by themselves and summarize it in the paper. By doing so, they may omit important details that are critical for reusing or reproducing the work. Instead, our approach aims to use all the resources that are explicit in the experiment to generate accurate textual descriptions in an automated way.

I wanted to attend the conference in part to receive feedback on our current approach. Although our work was well received, I learned that the problem can get complex really quickly. In fact, I think it can become a whole area of research itself! I hope to see more approaches in the future in this direction. But that is the topic for another post. Let’s continue with the rest of the conference:


The conference lasted three days, with one main keynote opening each of them. The conference opened with Shumin Zhain, from Google, who described their work on modern touchscreen keyboard interfaces. This will ring a bell to anyone reading this post, as the result of their work can be seen on any Android phone nowadays. I am sure they will not have problems finding users to evaluate their approaches.

In particular, the speaker introduced the system to capture gestures to recognize words, as if you were drawing a line. Apparently, before 2004 they had been playing around with different keyboard configurations that helped users write in a more efficient manner. However, people have different finger sizes, and adapting the keyboard to them is still a challenge. Current systems have several user models, and combine them to adapt to different situations. It was in 2004 when they came with the first prototype of SHARK, a shape writer that used neural networks to decode keyboard movements. They refined their prototype until achieving the result that we see today on every phone.

However, there are still many challenges remaining. Smart watches have a screen that is too small for writing. And new formats without screen such as wearable devices or virtual reality don’t use standard keyboards. Eye tracking solutions have not made significant progress, and while speech recognition has evolved a lot, it is not likely to replace traditional writers any time soon.

The second speakers was George Samaras, who described their work to personalize interfaces based on the emotions shown by the users of a system. The motivation for this need is that currently an 80% of the errors of automated systems are due to human mistakes rather than mechanical ones, especially when the interfaces are complex, such as in aviation or nuclear plants. Here cognitive systems are crucial, and adapting the content and navigation to the humans using them becomes a priority.

The speaker presented their framework to classify users based on the relevant factors in interfaces. For example, the verbals prefer textual explanations, while imagers like image explanations for e.g., browsing results. Another example is how users prefer to explore the results: we have the wholist, who prefer a top down exploration, versus the analysit, who would rather go for bottom up search. This is can become an issue in collaborations, as users that prefer to perceive the information in the same way may collaborate more efficiently together. A study performed over 10 years with more than 1500 shows that personalized interfaces lead to a faster task completion.

Finally, the speaker presented their work for tackling the emotions of users. Recognizing them is important, as depending on their mood, users may be keen to see the interface in one way or the other. They have developed a set of cognitive agents, which aim to personalize services and persuade users to complete certain tasks. Persuasion is more efficient when taking into account emotions as well.


The final keynote was presented by Panos Markopoulos, who introduced their work on hci design for patient rehabilitation. Having a proper interaction with patients (in exercises for kids and elderly people, arm training for stroke survivors, etc.) is critical for their recovery. However, this interaction has to be meaningful or patients will get bored and not complete their recovery exercises. The speaker described their work with therapists to track patient recovery in exercises such as pouring wine, cleaning windows, etc. The talk ended with a summary of some of the current challenges in this area, such as adapting feedback from patient behavior, sustaining engagement on the long run or personalization of exercises.


  • Recommendation is still a major topic in HCI. Peter Brusilovsky gave a nice overview of their work on personalization in the context of relevance-based visualization, as part of the ESIDA workshop. Personalized visualizations are now gaining more relevance in recommendation, but picking the right visualization for users is still a challenge. In addition, users are starting to demand why certain recommendations are more relevant, so non-symbolic approaches like topic modeling present issues.
  • Semantic web as a means to address curiosity in recommendations. SIRUP uses LOD paths with cosine similarity to find potential connections relevant for users.
  • Most influential paper award: Trust in recommender systems (O’Donovan and Smyth), where they developed a trust model for users, taking into account provenance too. Congrats!



IUI 2017 had 193 participants this year, almost half of them students (86); and an acceptance rate of 23% (27% for full papers). You can check the program for more details. I usually prefer this kind of conferences because they are relatively small, you can see most of the presented work without having to choose and you can talk to everyone very easily. If I can, I will definitely come back.

I also hope to see more influence of Semantic Web techniques to facilitate some of the challenges in HCI, as I think it there is a lot of potential to help in explanation, trust or personalization. I look forward to attending next year in Tokyo!

AAAI 2017

AAAI 2017

The Association for the Advancement of Artificial Intelligence conference (AAAI) is held once a year to bring together experts from heterogeneous fields of AI and discuss their latest work. It is also a great venue if you are looking for a new job, as different companies and institutions often announce open positions. Last week, the 31st edition of the conference was celebrated in downtown San Francisco, and I attended the whole event. If you missed the conference and are curious about what was going on, make sure you read the rest of this post.


But first: what was I doing there?

I attended the conference to co-present a tutorial and a poster.

The tutorial was a training session called “The scientific paper of the future”, which introduced a set of best practices on how to describe data, software, metadata, methods and provenance associated with a scientific publication, along with different ways of implementing these practices. Yolanda Gil and I presented, but Gail Clement (lead of AuthorCarpentry at Caltech library) joined us as well to describe how to boost your research impact in 5 simple steps. I found some of her materials so useful that I have finally opened a profile on ImpactStory after her talk. All the materials of our talk are online, so feel free to check them out.

From left to right: Gail Clement, Yolanda Gil and me

The poster I presented described the latest additions of the DISK framework. In a nutshell, we have adapted our system for automating hypothesis analysis and revision to operate on data that is constantly growing. While doing this, we keep a detailed record of the inputs, outputs and workflows needed to do the revision of the hypothesis. Check out our paper for details!


Ok, enough self-promotion! Let’s get started with the conference:


In general, the quality of the keynotes and talks was outstanding. The presenters did a great job and effort to talk about their topics without jumping into the details of their field.

Rosalind Piccard started the week by talking about AI and emotions, or, using her own terms, “affective computing”. Detecting the emotion of the person interacting with the system is pivotal for decision making. But recognizing these emotions is not trivial (e.g., many people smile when they are frustrated, or even angry). It’s impressive how sometimes just training neural networks with sample data is not enough, as the history of the gestures play an important role in the detection as well. Rosalind described her work for detecting and predict emotions like the interest of an audience or stress. Thanks to a smart wristband they are able to predict seizures and breakouts in autistic kids. In the future, they aim to be able to predict your mood and possible depressions!

On Tuesday, the morning keynote was given by Steve Young, who talked about speech recognition and human-bots interaction. Their approach is mostly based on neural networks and reinforced learning. Curiously enough, this approach works better on the field (with real users) than with simulated results (for which other approaches work better). The challenges in this area lie in determining when a dialog is not accurate, as users tend to lie a lot when providing feedback. In fact, maybe the only way of knowing that something went wrong in a dialog is when it’s too late and the dialog has failed. As a person working on the Semantic Web domain, I found interesting that knowledge bases are an uncharted territory in this field at the moment.

Jeremy Frank spoke in the afternoon session for IAAI. He focused on the role of AI on autonomous space missions where sometimes the communications are interrupted and many anomalies may occur. The challenge in this case is not only to be able to plan what the robot or ship are going to do, but to monitor the plan and explain whether an order or a command did what it was actually supposed to. In this scenario, having new software becomes a risk.

On Wednesday, Dmitri Dolgov was in charge of talking about self-driving cars. More than 10 trillion miles are travelled every year across the world, with over 1.2 million casualties in accidents that are 94% of the time a human error. The speaker gave a great overview of the evolution of the field, starting in 2009 when they wanted to understand the problem and created a series of challenges to drive 100 miles in different scenarios. By 2010, they had developed a system good enough for driving a blind man across town, automatically. In 2012, the system was robust enough to drive in freeways. By 2015, they had finally achieved their goal: a complete driverless vehicle, without steering wheel or pedals. A capability of the system that surprised me is that it is able to read and mimic human behavior in intersections or stop signs without any trouble. In order to do this, the sensor data has to be very accurate, so they ended up creating their own sensors and hardware. As in the other talks, deep learning techniques have helped enormously to recognize certain scenarios and operate accordingly. Having the sensor data available has also helped. These cars have more than 1 billion virtual miles of training, and they are failing less and less as time goes by.


The afternoon session was led by Kristen Grauman, an expert in computer vision who analyzed how image recognition works in unlabeled video. The key challenge in this case is to be able to learn from images in a more natural way, as animals do. It turns out that our movement is heavily correlated to our vision sense, to the point that if we don’t allow an animal to move freely when it’s growing up and viewing the world, it may be damaged permanently. Therefore, maybe machines should learn from images in movement (videos) to understand better the context of an image. The first results in this direction look promising, and the system has so far learned to track relevant moving objects in video, by itself.

The final day opened with a panel that I am going to include in the keynote group, as it has been one of the breakthroughs of this year. An AI has recently beaten all the professional players against whom it has played in Poker (one to one), and two of the lead researchers in the field (Michael Bowling and Tuomas Sandholm) were invited to show us how they did it. Michael started describing DeepStack and why Poker is a particularly interesting challenge for AI: while in other games like chess you have all the information you need at a given state to decide your next move, Poker is an imperfect information game. You may have to remember the history of what has been done in order to proceed with your next decision. This creates a decision tree that is even bigger than complex board games like Chess and Go, so researchers have to abstract and explore the sparse tree. The problem is that, at some point, something may have happened that wasn’t taken into account in the abstraction, and this is where the problems start.

Their approach for addressing this issue is to reason over the possible cards that the opponent thinks the system has (game theory and Nash equilibrium play a crucial role). The previous history determines distributions of the cards, while evaluation functions have different heuristics based on the beliefs of the players in the current game (deep learning is used to choose the winning situation out of the possibilities). While current strategies are very exploitable, DeepStack is one of the least, being able to make 8 times what a regular player makes while being able to run in a laptop during the competition (the training part takes place before).

Tuomas followed introducing Libratus, an AI created last year but evolved from previous efforts. Libratus shares some strategies with DeepStack (card abstraction, etc.), as the Poker community has worked together on interoperable solutions. Libratus is the AI that actually played against the Poker professionals and beat them, even when they had a 200K $ incentive for the winner. The speaker mentioned that instead of trying to exploit the weaknesses of the opponent, Libratus focused on how the opponent exploits the strategies used by the AI. This way, Libratus could learn and fix these holes.

According to the follow up discussion, Libratus could probably defeat Deepstack, but they haven’t played against each other yet. The next challenges are applying these algorithms to solve similar issues in other domains, and making an AI that can actually be part of a table and join tournaments (this may imply a redefinition of the problem). Both researchers ended up stating how supportive the community has been providing feedback and useful ideas to improve their respective AIs.

The last keynote speaker was Russ Tedrake (MIT Robot labs), who presented advances in robotics and the lessons learned during the three year DARPA challenge on robotics. The challenge had a series of heterogeneous tasks (driving, opening a valve, cut a hole in a wall, open and traverse a door, etc.). Most of these problems are faced as optimization problems, and planning is a key feature that has to be updated on the go. Robustness is crucial for all the processes. For example, in the challenge, the MIT robot failed due to a human error and an arm broke off. However, thanks to the redundancy functions, the robot could finish the rest of the competition using only the other arm. As a side note, the speaker also explained why the robots always “walk funny”: their center of mass. It facilitates the equations for movement, so researchers have adopted it to avoid more complexity in their solutions.

One of the main challenges for these robots is perception. It has to run constantly to understand the surroundings of the robot (e.g., obstacles), dealing with possible noise data or incomplete information. The problem is that, when a new robot has to be trained, most of the data produced with other robots is not usable (different sensors, different means for grabbing and dealing with objects, etc.). Looking how babies react with their environment (touching everything and tasting it) might bring new insights in how to address these problems.

My highlights

-The “AI in practice” session that occurred on Sunday was great. The room was packed, and we saw presentations from companies like IBM, LinkedIn or Google.

I liked these talks because they highlighted some of the current challenges faced by AI. For example, Michael Witbrock (IBM) described how despite the advances in Machine Learning applications, the representations used to address a problem can barely be reused. The lack of explanation of deep learning techniques does not help either, specifically in diagnosing diseases: doctors want to know why a certain conclusion is reached. IBM is working towards improving the inference stack, so as to be able to combine symbolic systems with non-symbolic ones.

Another example was Gary Marcus (Uber labs), who explained that although there has been a lot of progress on AI, AGI (artificial general intelligence) has not advanced that much. Perception is more than being able to generalize from a situation, and machines are currently not very good at it. For example, an algorithm may be able to detect that there is a dog in a picture, and that the dog is lifting weights, but it won’t be able to tell you why this picture is unique or rare. The problem with current approaches is that they are incremental. Sometimes, there is a fear to step back and look at how some of our current problems are addressed. Focusing too much on incremental science (i.e., improving a small percentage of the precision of the current algorithms), may lead to get stuck in local maximums. Sometimes we need to address problems from different angles to make sure we make progress.

– AI in games is a thing! Over the years I have seen some approaches that aim to develop smart players, but attending this tutorial was one of the best experiences in the conference. Julian Toeglius gave an excellent overview/tutorial of the state of the art in the field, including how a simple A* algorithm may almost be a perfect player for Mario (if we omit those levels when we need to go back), how games are starting to adapt to players, how to build credible non player characters and how to create scenarios that are fun to play automatically. Then he introduced other problems that overlap with many of the challenges addressed in the keynotes: 1) How can we produce a general AI that learns how to play any game? And 2) how can we create a game automatically? For the first one, I found interesting that they have already developed a benchmark of simple games that will test your approach. The second one however is deeper, as the problem is not creating a game, or even a valid game. The real problem in my opinion is creating a game that a player considers fun. At the moment the current advances consist on modifications of existing games. I’ll be looking forward to reading more about this field and its future achievements.


– AI in education: Teaching ethics to researchers is becoming more and more necessary, given the pace at which science evolves. At the moment, this is an area often overlooked in any PhD or research program.

– The current NSF research plan is not mute! Lynne Parker introduced the creation of the AI research and development strategic plan, which expects to remain untouched even after the results of the latest election. The current focus is on how AI could help to the national priorities: liberty (e.g., security), life (education, medicine, law enforcement, personal services, etc.) and pursuit of happiness (manufacturing, logistics, agriculture, marketing, etc.). Knowledge discovery and transparent and explainable methods will help for this purpose.

– Games night! Great opportunity to socialize and meet part of the community by drawing, playing puzzles and board games.


– Many institutions are hiring. The Job fair had plenty of participating companies and institutions, but it was a little bit far away from the main events and I didn’t see many people attending. In any case, there were also plenty of companies with stands while the main conference was happening as well, which made it easy to talk to them and see what were they working on.

– Avoid reinventing the wheel! There was a cool panel on Expert systems history. Sometimes it is good to just take a step back and see how they analyzed research problems in the past. Some of their solutions still apply today

– Ontologies and Semantic Web were almost non-present in the whole conference. I think I only saw three talks related to the topic, about evolution and trust of knowledge bases, detection of redundant concepts in ontologies and the LIMES framework. I hope the semantic web community is more active in future editions of AAAI.

– Check out the program for more details on the talks and presentations.


Attending AAAI has been a great learning experience. I really recommend it to anyone working on any field of AI, especially if you are student or you are looking for a job. I also find very exciting that some of the problems I am working on are also identified as important by the rest of the community. In particular, the need of creating proper abstractions to facilitate understanding and shareability of current methods was part of the main topic of my thesis, while the need for explanation of the result of a certain technique is applied is highly related to what we do for capturing the provenance of scientific workflow results. As described by some of the speakers, “Debugging is a kind of alchemy” at the moment. Let’s turn it into a science.

Getting started with Docker: Modularizing your software in data-oriented experiments

Getting started with Docker: Modularizing your software in data-oriented experiments

As part of my work at the USC, I am always looking for different ways of helping scientist to reproduce their computational experiments. In order to facilitate software component deployment, I have been playing this week with Docker, a software wrapper that contains all the things you need to execute a software component.

The goal of this tutorial is to show you how you can get easily started to make your code reproducible. For more extensive tutorials and other Docker capabilities, I recommend you to go to the official Docker documentation:

Dockerizing your software: Docker images and containers

Docker handles two main concepts: containers and images. The images indicate how to set up and create an environment. The containers are the processes in charge of executing an image. For example, try installing Docker on your computer ( and test the “hello world” image:

docker run hello-world

If everything goes well, you should an image in your screen telling you that the Docker client contacted the Docker daemon, that the daemon pulled the “hello world” image from the Docker Hub repository, that then a new container was created, and that finally the output of the container was sent to your Docker client.

Docker has a local repository where it stores the images we create or pull from online repositories, such as the one we just retrieved. When we try to execute an image, Docker tries to find it locally and then online (e.g., on the Docker hub repository). If the system finds it, it will download it to our local repository. To browse over the images stored in your local repository, run the following command:

docker images

At the moment you should only see the “hello-world” image. Let’s try to do something fancier, like running an Ubuntu image with a unix command :

docker run ubuntu echo hello world

You should see “hello world” in the screen, after the image is downloaded. This is the same output you would obtain when executing that command in a terminal. If you are using popular software in your experiments, it is likely that someone has created an image and posted it online. For example, let’s consider that part of my experiment uses the samtools software, widely used in genomics analysis. In this example we will show how to reuse an image for samtools, the software we have used for the mpileup caller function.

The first thing we have to do is look for an image in Docker hub. In this case, the first result seems to be the appropriate image: The following command:

docker pull comics/samtools

will download the latest version. You can also specify the version by using a tag. For example comics/samtools:v1. Now if we execute the image locally:

docker run comics/samtools samtools mpileup

We will see the following on screen.


Basically, the program runs, but it is asking for its correct usage (we didn’t invoke it correctly). Since the mpileup software requires three inputs, in this tutorial we are going to choose a simpler function from the samtools software: sort, which sorts an input bam file.

In order to be able to pass the inputs file to our docker container, we need to mount a volume, i.e., tell the system that we want to share a folder with the container. This can be done with the “-v” option.

docker run -v PathToFolderYouWantToShare:/out comics/samtools samtools sort -o /out/sorted.bam /out/inputFileToSort.bam

Where the PathOfTheFolderYouWantToShare is the folder where you have your input file (“inputFileToSort.bam”). This will result in a sorted file (“sorted.bam”) of the input file “inputFileToSort” in the folder “PathToFolderYouWantToShare”.

All right, so now we have our component working. Now if we want anyone to use our inputs, we just have to tell them which Docker image to download. You may include your data also as part of the Docker image, but for that you will have to create your own Docker file (see below).

Creating Docker files

OK, so far it’s easy to reuse someone else’s software if there is an image online. But how do I create an image of the scripts/software I have done for others to reproduce? For this we need to create a Docker file, which will tell Docker how to build an image.

The first step is to build an image for the software we want to install. In my case, I chose the Ubuntu default image, and then added the steps and dependencies of the samtools software. My Docker file looks as it follows:

from ubuntu
MAINTAINER add yourself here
RUN apt-get update && apt-get install -y python unzip gcc make bzip2 zlib1g-dev ncurses-dev
COPY samtools-1.3.1.tar.bz2 samtools.tar.bz2
RUN bunzip2 samtools.tar.bz2 && tar xf samtools.tar && mv samtools-1.3.1 samtools && cd samtools && make
ENV PATH /samtools:$PATH

The image created by this Docker file modifies the Ubuntu image we downloaded before, installing python, unzip, gcc, make, bzip2, zlib-dev and ncurses-dev, which are packages used by samtools. Thanks to this, we will have access to those commands from our Linux terminal in our container. The second command copies the software we want to install into the container (download it from, unzips it and compiles it, adding “/samtools” to the system path. Note that if we want to copy sample data to the image, this would be another way to do so.

Now we just have to build the file using the following Docker command:

docker build -t youruser/nameOfImage -f pathToDockerFile .

youruser/nameOfImage is just a way to tag the images you create. In my case I named it dgarijo/test:v1. Later, when running the image as a container, we will use this name. The -f option points to the docker file you want to build as an image. This flag is optional: if you don’t include it, it will search on your local folder. Also, in some cases there are known issues. If you run into any trouble, just use:

docker build -t dgarijo/test:v1 DIRECTORY .

Where the “DIRECTORY” contains a docker file called “Dockerfile”.

Now that our image is in our local repository, let’s run it using the –v option to pass the appropriate inputs:

docker run -v PathOfTheFolderWithTheBamFile:/out nameOfYourImage samtools/samtools sort -o /out/sorted.bam /out/canary_test.bam

After a few seconds, you should see that the program ends, and a new file “sorted.bam” has appeared in your shared file. Now that your image works, you should consider uploading to the Docker hub repository (see the tutorial on the Docker site)

And that’s it for today! If you want to see more details on how some of these dockerized components can be used in a scientific workflow system like WINGS, check out this tutorial:

How to (easily) publish your ontology permanently: OnToolgy and w3id

How to (easily) publish your ontology permanently: OnToolgy and w3id

I have recently realized that I haven’t published any post for a while, so I don’t think there is a better way to start 2017 than with a small tutorial: how to mint w3ids for your ontologies without having to issue pull requests on Github.

In a previous post I described how to publish vocabularies and ontologies in a permanent manner using w3ids. These ids are community maintained and are a very flexible approach, but I have found out that doing pull requests to the w3id repository may be a hurdle for many people. Hence, I have been thinking and working towards lowering this barrier.

Together with some colleagues from the Universidad Politecnica de Madrid, we released a year and a half ago a tool for helping documenting and evaluating ontologies: OnToology. Given a Github repository, OnToology tracks all your updates and issues pull requests with their documentation, diagrams and evaluation. You can see a step by step tutorial to set up and try OnToology with the ontologies of your choice. The rest of the tutorial assumes that your ontology is tracked by OnToology.

So, how can you mint w3ids from OnToology? Simple, go to “my repositories tab:


Then expand your repository:


And select “publish” on the ontology you want to mint a w3id:


Now OnToology will request a name for your URI, and that’s it! The ontology will be published under the w3id that appears below the ontology you selected. In my case I selected to publish the wgs84 ontology under the “wgstest” name:


As shown in the figure, the ontology will be published under “”

If you want to update the html in Github and want to see the changes updated, you should click on the “republish” button that now replaces the old “publish” one:


Right now the ontologies are published on the OnToology server, but we will enable the publication in Github by using Github pages soon. If you want the w3id to point somewhere else, you can either contact us at, or you can issue a pull request to w3id adding your redirection before the 302 redirection in our “def” namespace:

Towards a human readable maintainable ontology documentation

Some time ago, I wrote a small post to guide people on how to easily develop the documentation of your ontology when publishing it on the Web. The ontology documentation is critical for reuse, as it provides an overview of the terms of the ontology with examples, diagrams and their definitions. Many researchers describe their ontologies in associated publications, but in my opinion a good documentation is what any potential reuser will browse if they want to include the ontology on their work.

As I pointed out in my previous post, there are several tools to produce a proper documentation, like LODE and Parrot. However, these tools focus just in the concepts of the ontology, and when using them I found myself facing three main limitations:

  1. That the tools are in web services external to my control, and whenever the ontology is larger than a certain size, the web service will not admit it.
  2. That whenever I want to export the produced ontology documentation, it’s not straightforward: I have to download a huge html and it dependencies from the browser.
  3. That if I want to edit the ontology documentation adding an introduction, diagrams, etc., I have to edit the huge downloaded html. This is cumbersome, as finding the spot where I want to add new contributions is difficult. Normally the edition of the text is mandatory, as some of the metadata of the ontology is not annotated within the ontology itself.

In order to face these limitations, I decided to create Widoco, a WIzard for DOCumenting Ontologies, more than a year ago. Widoco is based on LODE and helps you creating the ontology in three simple steps: introducing the ontology URI or file, completing its metadata and selecting the structure of the document you want to build. You can see a snapshot of the wizard below:

Widoco screenshot

Originally, Widoco produced the documentation offline (no need to use external web services and without a limit for the size of your ontology) and the output was divided in different documents, each of them containing a new section. That way, it was more manageable to edit each of them. The idea here is to be similar to Latex projects, where you include the sections you desire on the main document and comment those you don’t want to include. Ideally, the document would readapt itself to show only the sections you want, dynamically.

After some work, I have just released the version 1.2.2 of the tool, and I would like to comment some of its features here.

  • Metadata gathering improvements: Widoco will aim to extract metadata from the ontology itself, but that metadata is often incomplete. With Widoco now it is possible to introduce many metadata fields on the fly, if the user wants them to be added to the documentation. Some of the latest added metadata fields indicate the status of the document and how to properly cite the ontology, including its DOI. In addition, it is possible to save and load the metadata properties as a .properties file, in case the documentation needs to be regenerated in the future. As for the license, if an internet connection is available, Widoco will aim to retrieve the license name and metadata from the Licensius web services, where an endpoint of licenses is ready for exploitation.

    Widoco configuration screenshot
  • Access to a particular ontology term: I have changed the anchors in the document to match the URI of the terms. Therefore, if a user derreferences a particular ontology term, he/she will be redirected to the particular definition of that term in the document. This is useful because it saves time when looking for the definition of a particular concept.
  • Automatic evaluation: If an internet connection is available, Widoco uses the OOPS! web service to detect common pitfalls in your ontology design. The report can be published along with the documentation.
  • Towards facilitating ontology publication and content negotiation: Widoco now produces a publishing bundle that you can copy and paste in your server. This bundle is published according to the W3C best practices, and adapts depending on whether your vocabulary is hash or slash.
  • Multiple serialization: Widoco creates multiple serializations of your ontology and points to them from the ontology document. This helps any user to download their favorite serialization to work with.
  • Provenance and page markup: The main metadata of the ontology is annotated using RDF-a, so the web searchers like Google can understand and point to the contents of the ontology easily. In addition, an html page is created with the main provenance statements of the ontology, described using the W3C PROV standard.
  • Multilingual publishing: Ontologies may be described in multiple languages, and I have enabled Widoco to generate the documentation in a multilingual way, linking to other languages on each page. That way you avoid having to run the program several times for generating the documentation in different languages.
  • Multiple styles for your documentation: now I have enabled two different styles for publishing the vocabularies, although I am planning to adapt the new respec style from the W3C.
  • Dynamic sections: For each section added in the document, the user will not have to worry about their numbering, as it will be done automatically. In addition, the table of contents will change accordingly to the sections the user wants to include in the final document.

Due to the amount of requests, I also created a console version of Widoco, with plenty of options to be able to run all the possible combinations of the features listed above. Even though you don’t need internet connection, you may want it for accessing Licensius and OOPS! webservices. Both the console version and desktop application are available through the same JAR, accessible in the Github:

I built this tool to make my life easier, but it turns out that it can be used to make the life of other people easier too. Do you want to use Widoco? Check out the latest release on Github. If you have any problems open an issue! Some new features (like an automated changelog) will be included in the next releases.