Category: Miscellaneous

Getting started with Docker: Modularizing your software in data-oriented experiments

Getting started with Docker: Modularizing your software in data-oriented experiments

As part of my work at the USC, I am always looking for different ways of helping scientist to reproduce their computational experiments. In order to facilitate software component deployment, I have been playing this week with Docker, a software wrapper that contains all the things you need to execute a software component.

The goal of this tutorial is to show you how you can get easily started to make your code reproducible. For more extensive tutorials and other Docker capabilities, I recommend you to go to the official Docker documentation:

Dockerizing your software: Docker images and containers

Docker handles two main concepts: containers and images. The images indicate how to set up and create an environment. The containers are the processes in charge of executing an image. For example, try installing Docker on your computer ( and test the “hello world” image:

docker run hello-world

If everything goes well, you should an image in your screen telling you that the Docker client contacted the Docker daemon, that the daemon pulled the “hello world” image from the Docker Hub repository, that then a new container was created, and that finally the output of the container was sent to your Docker client.

Docker has a local repository where it stores the images we create or pull from online repositories, such as the one we just retrieved. When we try to execute an image, Docker tries to find it locally and then online (e.g., on the Docker hub repository). If the system finds it, it will download it to our local repository. To browse over the images stored in your local repository, run the following command:

docker images

At the moment you should only see the “hello-world” image. Let’s try to do something fancier, like running an Ubuntu image with a unix command :

docker run ubuntu echo hello world

You should see “hello world” in the screen, after the image is downloaded. This is the same output you would obtain when executing that command in a terminal. If you are using popular software in your experiments, it is likely that someone has created an image and posted it online. For example, let’s consider that part of my experiment uses the samtools software, widely used in genomics analysis. In this example we will show how to reuse an image for samtools, the software we have used for the mpileup caller function.

The first thing we have to do is look for an image in Docker hub. In this case, the first result seems to be the appropriate image: The following command:

docker pull comics/samtools

will download the latest version. You can also specify the version by using a tag. For example comics/samtools:v1. Now if we execute the image locally:

docker run comics/samtools samtools mpileup

We will see the following on screen.


Basically, the program runs, but it is asking for its correct usage (we didn’t invoke it correctly). Since the mpileup software requires three inputs, in this tutorial we are going to choose a simpler function from the samtools software: sort, which sorts an input bam file.

In order to be able to pass the inputs file to our docker container, we need to mount a volume, i.e., tell the system that we want to share a folder with the container. This can be done with the “-v” option.

docker run -v PathToFolderYouWantToShare:/out comics/samtools samtools sort -o /out/sorted.bam /out/inputFileToSort.bam

Where the PathOfTheFolderYouWantToShare is the folder where you have your input file (“inputFileToSort.bam”). This will result in a sorted file (“sorted.bam”) of the input file “inputFileToSort” in the folder “PathToFolderYouWantToShare”.

All right, so now we have our component working. Now if we want anyone to use our inputs, we just have to tell them which Docker image to download. You may include your data also as part of the Docker image, but for that you will have to create your own Docker file (see below).

Creating Docker files

OK, so far it’s easy to reuse someone else’s software if there is an image online. But how do I create an image of the scripts/software I have done for others to reproduce? For this we need to create a Docker file, which will tell Docker how to build an image.

The first step is to build an image for the software we want to install. In my case, I chose the Ubuntu default image, and then added the steps and dependencies of the samtools software. My Docker file looks as it follows:

from ubuntu
MAINTAINER add yourself here
RUN apt-get update && apt-get install -y python unzip gcc make bzip2 zlib1g-dev ncurses-dev
COPY samtools-1.3.1.tar.bz2 samtools.tar.bz2
RUN bunzip2 samtools.tar.bz2 && tar xf samtools.tar && mv samtools-1.3.1 samtools && cd samtools && make
ENV PATH /samtools:$PATH

The image created by this Docker file modifies the Ubuntu image we downloaded before, installing python, unzip, gcc, make, bzip2, zlib-dev and ncurses-dev, which are packages used by samtools. Thanks to this, we will have access to those commands from our Linux terminal in our container. The second command copies the software we want to install into the container (download it from, unzips it and compiles it, adding “/samtools” to the system path. Note that if we want to copy sample data to the image, this would be another way to do so.

Now we just have to build the file using the following Docker command:

docker build -t youruser/nameOfImage -f pathToDockerFile .

youruser/nameOfImage is just a way to tag the images you create. In my case I named it dgarijo/test:v1. Later, when running the image as a container, we will use this name. The -f option points to the docker file you want to build as an image. This flag is optional: if you don’t include it, it will search on your local folder. Also, in some cases there are known issues. If you run into any trouble, just use:

docker build -t dgarijo/test:v1 DIRECTORY .

Where the “DIRECTORY” contains a docker file called “Dockerfile”.

Now that our image is in our local repository, let’s run it using the –v option to pass the appropriate inputs:

docker run -v PathOfTheFolderWithTheBamFile:/out nameOfYourImage samtools/samtools sort -o /out/sorted.bam /out/canary_test.bam

After a few seconds, you should see that the program ends, and a new file “sorted.bam” has appeared in your shared file. Now that your image works, you should consider uploading to the Docker hub repository (see the tutorial on the Docker site)

And that’s it for today! If you want to see more details on how some of these dockerized components can be used in a scientific workflow system like WINGS, check out this tutorial:


Making robots behave

I normally write about things that are somehow related to what I do, but last week I attended to a seminar that I really enjoyed, and I think it is worth making a short summary here. The title was “Making Robots Behave”, and it was presented by Leslie Pack, (MIT).
As you may have already guessed, the seminar was about Artificial Intelligence and robot behavior. In particular, the problem they wanted to address is how to solve robot uncertainty. The robot has a state estimation from which it creates its beliefs and uses the result to pick up the next action. Then a planner takes the action into consideration and tells the robot to execute the next planned action. From the result of the action a new input is generated and the robot re-estimates the next state, initiating the cycle again. A simple feedback loop! (For more details I recommend you to have a look at the talk, available online).
One of the cool things about the demo we saw is that most of the time you didn’t knew how the robot was going to act next. It depended on the reads the sensor gave to it in each moment, plus the planning algorithm and the feedback from its previous actions (i.e., its own knowledge of the world surrounding it). Sometimes the robot even moved its arms away because they were in the middle of its field of view.
I enjoy this kind of things because it reminds me somehow to a science fiction tale (the last question by Isaac Asimov) where humans ask a computer how to lower the entropy of the universe. After thousands of years, when the computer gets the response, it ends up knowing how to answer questions like the meaning of life. In the robot’s particular scenario the universe is the room, where it enters knowing nothing. When it exits, it is able to recognize the environment and the different objects found in its way. Whether the robot can learn from this experience and teach the actions to take in similar scenarios is something that we will have to wait to see.

How to (properly) publish a vocabulary or ontology in the web (part 3.5 of 6)

This is a short post that I want to write to expand on my previous part of the tutorial (how to create a nice human readable documentation for your vocabulary/ontology). Since I have been releasing some vocabularies lately, I have developed a simple tool that generates the main structure of an html document describing the resource with the 11 parts I introduced on my previous post (title and date, metadata, abstract, table of contents, introduction, namespace declarations, overview of classes and properties, description, Cross reference section, references and acknowledgements).

This tool does not intend to replace any of the other tools designed to describe the properties and classes of an ontology. In fact, it rather acts as wrapper using LODE for that very purpose in one of the sections (the cross reference section). So, why should you use it?

  1. It saves time by providing the whole structure of the html document.
  2. It doesn’t require you to add any RDF metadata to the ontology being described. The URI of the ontology itself is optional. All metadata can be configured in the file of the project (see readme for more info).
  3. It automatically adds the metadata as rdf-a annotations to the document, which makes it easier to parse by machines.

I have uploaded the tool to Github, and it’s available here, along with the code I used.

As stated, I have used LODE for one of the sections of the document. I have already added LODE in the acknowledgements. If you use this tool please make sure to acknowledge any tool you use to generate your documentation.

This is part of a tutorial divided in 7 parts:

  1. Overview of the tutorial.
  2. (Reqs addressed A1(partially), A2, A3, A4, P1) Publishing your vocabulary at a stable URI using RDFS/OWL.
  3. (Reqs addressed P2, P3). How to design a human readable documentation.
  4. Extra: A tool for creating html readable documentation (this post)
  5. (Reqs addressed P4). Derreferencing your vocabulary.
  6. (Reqs addressed A1 (partially)). Dealing with the license. (To appear)
  7. (Reqs addressed A5, P5). Reusing other vocabularies. (To appear)

How to (properly) publish a vocabulary or ontology in the web (part 2 of 6)

This part of the tutorial explains how to publish your vocabulary at a stable URI using RDFS/OWL. In order to make things easier, I’ll illustrate each step of this part of the tutorial with an example. The steps to follow are further described below:

1)      Select the name of your vocabulary/ontology. Easy, right? In my case I want to publish an ontology encoding the workflow motif catalogue we describe in this paper, so the name I have chosen is “The workflow motif ontology” (wf-motifs to keep it short).

2)      Select the proper URI to publish your vocabulary. Now that we know how we want to name our vocabulary, things start to get trickier. Which URI do we choose? How do we ensure that it is not going to change?
The URI you choose for your ontology should be permanent and defined in a domain you control. The rationale behind this is simple: imagine that somebody is reusing the concepts defined in your ontology and you change its URI. The person reusing your ontology will no longer know the proper definitions and semantics of the reused term.
Since I assume that most of the people reading this are not willing to pay for a new domain each time a new ontology is published, I recommend defining the URI of your vocabularies/ontologies in PURL stands for “persistent uniform resource locator”, and they are widely used to give persistent URIs to resources. Once you register in the page, the process is really simple. You define a new domain, wait for the approval and create the URI for your ontology. In my case it is:

EDIT: If you are interested in having more control on your redirections, w3id is a better alternative to purl. Have a look at my post for more information on how to set it up.

Note 1: If you create the name under the /net/ domain things will go faster, since it is the default domain. Otherwise they’ll have to approve the domain AND the name of your vocabulary/ontology.

Note 2: Someone could argue that by speaking to the system admin of your enterprise/university you can obtain the vocabulary URI as well. However, depending on who you are and the ontology you are working on, the URI they suggest could be something like: This is perfectly fine, but this looks more like the place where my .owl will finally be stored. If my file has to be moved, my URI will change. Using purl ensures the URI will be permanent, and that I have control over it.

3)      Create the ontology in RDF/OWL: There are several editors to create vocabularies/ontologies and their properties according to the W3C standards: Protégé, the NeOn Toolkit, TopBraid Composer, etc. The one I’m most familiar to is Protégé, which is free to install and use (they say that TopBraid is very good, but since the license is quite expensive I haven’t been able to test it).  Once you have installed your editor you just have to change the base URI of the ontology (Ontology IRI in Protégé) with the one you registered as a PURL. Protégé will use a hash (“#”) by default to identify the classes and properties you declare in the vocabulary/ontology. You can use a slash (“/”) for this purpose as well.

Hash versus slash debate: There has been a long discussion regarding the usage of “/” vs “#”. If you are not sure about which one is the best for your vocabulary/ontology, here is a tip: if your ontology will be huge and will be divided in many different modules, use “/”. Otherwise use “#”. It is easier to set up and will make it easier to point to the right spot in the documentation.

Returning back to the example, this is how my ontology IRI looks like:
and a sample class will be

4)      Redirect your permanent URI to your vocabulary/ontology file. Once you are done editing your vocabulary/ontology, you have to host the .owl file somewhere. It is not important where you host it, as long as you know that it won’t be deleted. It’s fine if it gets moved, as long as you know where. In my case, I talked to the system admin and he stored the owl file here:
Finally, we go back to the purl page and we add the basic redirection to the target URL we have just set up. The form looks like this:purl

Now whenever we enter the URI of our ontology, it will be redirected to the OWL file. Congrats!
Note: In my case will take you to the ontology if you load it in Protégé and to the documentation if you load it from the web browser. I’ll explain how to achieve that in part 4 of the tutorial, so don’t worry for the moment.

Note: the steps I propose here are not normative. There may be other ways to achieve what is covered here. This is just a possible way to do it.

This is part of a tutorial divided in 7 parts:

  1. Overview of the tutorial.
  2. (Reqs addressed A1(partially), A2, A3, A4, P1) Publishing your vocabulary at a stable URI using RDFS/OWL. (this post)
  3. (Reqs addressed P2, P3). How to design a human readable documentation.
  4. Extra: A tool for creating html readable documentation.
  5. (Reqs addressed P4). Derreferencing your vocabulary.
  6. (Reqs addressed A1 (partially)). Dealing with the license.
  7. (Reqs addressed A5, P5). Reusing other vocabularies.

Annotating your personal page with RDF-a

A couple of weeks ago some members of the OEG and me organized a small tutorial about RDF-a  to the rest of the group (also known as the First OEG RDF-a Collaborative Tripleton). The final goal was to provide an overview and eat our own dog food by annotating our personal web pages with some simple RDF-a statements. The bait, some free pizza:

Participantes enjoying their pizza. It always works
Participants enjoying their pizza. It always works

People were very participative, and we discussed some examples during the tutorial. Given the fact that nobody was an expert in RDF-a, I think that the overall experience was very useful for everyone.
Therefore, if you want to annotate your page with some RDF-a statements, I have prepared a small guideline below listing the main common steps to take into consideration. The guidelines are based on what we discussed on the tutorial and later:

1)      Distinguish your web page from yourself: Don’t use the URL of your home page as your URI. Instead, create a URI for yourself. For example, my personal page is: If I want to describe the page (title, creation date, etc), I would use that URL. If I want to add some descriptive statements about myself (name, email, phone, etc.) then I can use This is a recognized good practice, although you can use any identifier for yourself as long as you control the domain where you create it. Another could have been:

2)      Provide at least a minimum set of statements about yourself: If you provide some information in html for users, why not in RDF-a for machines? Add your name, an image, phone, email, institution, a link to your publications, the institutions you are working for, past and present projects, etc.

3)      Use widely used vocabularies like schema and foaf for describing yourself, Dublin Core to describe the document and, if you want to state the provenance of the document itself, you may even use the PROV standard.

4)      Try to use existent authoritative URIs for the resources you are describing. Linking to other resources is always better than creating your own URIs. If you don’t know the URI for an institution or a project you can always create your own and add an owl:sameAs once you know the good one. But you can try looking up in DBpedia or Sindice for existent ones.

5)      Validate your RDF-a! Before publishing, be sure to test the statements you have produced with an RDF validator like this one.

Do you want to know more? Check out the RDF-a Primer! It’s full of examples and it is very easy to follow.

Getting Started

Last week our group had a 2 day meeting in Cercedilla, located in the famous residence where the Semantic Summer School (SSSW) events have been happening for the past 10 years. The main purpose of the meeting was to identify commonalities between the members of the group and perform some exercises to identify future lines of work, mixing it with some social events and a bit of humor. For example, in the next picture some of my colleagues are trying to find out which role are they having in the current discussion. The role is written in the shower cap they are wearing (e.g., ignore me, obey me, etc.) so everyone else is able to see it but themselves:

One of the exercises: What role am I having in the discussion?

Part of the meeting was also related to the visibility of the group and its members. Having an appropriate personal page, a Google Scholar/Mendeley/LinkedIn profile or opening a blog for sharing what you are doing with the community were some of the ideas brought up by the participants.

I think the meeting was a good experience, something worth doing at least once a year. Standups and short group meetings are great to get an idea of what your colleagues are doing, but these longer encounters give you an idea of where the whole group is going, which is awesome.

As for me, the idea of creating a blog had been around my head for some time now, but I never found the time. Now I have no more excuses, so I have decided to open this blog to talk about the research activities and events that I am involved with (or at least I’ve heard of).