Category: Tutorial

Deploying sensors between a volcano and a hurricane: The IS-GEO 2018 Summer Institute

A few weeks ago I had the pleasure to attend the second IS-GEO summer institute, in Hawaii. The meeting was led by Suzanne Pierce and Daniel Fuka, who managed to bring together more than 40 participants with different background and expertise. From the environmental sciences side, we had researchers specialized in areas such as Hydrology, Ecology or Meteorology. From the intelligent systems side, we had experts in sensor handling and deployment, data analytics, data integration, reproducibility and data visualization. We participated together by assembling and deploying hand-made sensors in the 8 different ecosystems of Hawaii’s Big Island. All while the volcano was still active and hurricane Hector approached from the south-east! But let me go step by step.

What is IS-GEO?

For those who are not familiar with the organization, IS-GEO (https://is-geo.org) is a Research Collaboration Network (RCN) funded by NSF through the EarthCube program that aims to bring together researchers from Intelligent Systems (IS) and Geosciences (GEO). The RCN has had presence in top conferences like AGU, where it has led a session for the last couple of years. In addition, we organize (yes, I am part of the RCN as well) monthly teleconferences where we invite an expert from either the IS or GEO side to talk about their latest research. Have a look here: https://is-geo.org/resources/research-presentations/

Exploring NEW research, not just exposing previous work

From the very beginning, the objectives of the event were made clear: collaborate together, define new challenges, enhance communication among attendants and finally define potential robust collaborations. These objectives are key in interdisciplinary conversations, as a fruitful collaboration will only happen when both sides are interested in different aspects of the same research problem. The program was structured so we would have a few presentations during the morning and then spend some times crafting sensors and deploying them in the afternoon. There were different teams with flexible structure and we spent quite some time in the vans when going to the field, which allowed everyone to talk to everyone else about their expertise and interests.

An unexpected guest: Hurricane Hector

Hurricane Hector introduced a change of plans. Instead of deploying just a few initial stations, we decided to prioritize sensor deployment during the first days, and then see if by the end of the week we could visualize actual data that could measure the impact of the hurricane. We had a quick introduction to the different type of sensors, and I was amazed how Arduino and open hardware initiatives have facilitated the integration and reading from them. It makes you want to create a small sensor station at your home!

Misty after successfully ensembling her temperature sensor #isgeoHawaii pic.twitter.com/YbKjr51i5L

— Daniel Garijo (@dgarijov) August 7, 2018

Daily meteorological reports

One of the perks of being surrounded by scientists is that we had access to the local meteorologist, Harry Halpin, who provided reports of the movement of Hector every day before starting the sessions. Supercool!

What better time could there be for deploying weather stations and stream gauges? #ISGEOHawaii #IkeWai pic.twitter.com/0M00BmKQQI

— Jo Martin (@jrymart) August 6, 2018

It was also thanks to him that we were able to come close to the lava cam (http://lavacam.org/), which reports continuously about the status of the volcano. And take pictures of the recent lava burst, such as this one:

Building and planting sensors on the field:

We managed to deploy several sensor/weather stations in different ecosystems of the island. Some were deployed in some of the houses of the attendants, such as Dan Fuka’s backyard:

Looking where to deploy our first sensors in the field #ISGEOHawaii pic.twitter.com/6voREL6dUd

— Daniel Garijo (@dgarijov) August 7, 2018

But some others were deployed in further points of the island. In this case, several attendants are setting up a weather station near a Buddhist temple:

Or near a mountain, on an old lava field. The dome is a research facility to simulate the living conditions in mars:

Visualizing data:

Once the stations were set, we connected them to the CHORDS platform and visualized them in a map (http://is-geo.chordsrt.com/sites/map). Mike Daniels explained how all data is available for download, and how to set up stations that register new data in CHORDS. Unfortunately, we didn’t have much time to do data analysis, but learning about the data acquisition process is a valuable lesson for all the attendants. Collecting data requires hard work, and integrating and visualizing it to make it useful is full of challenges, from sensor calibration to error detection.

Personal collaboration outcomes and takeaways:

I wish there were more events like this one, combining a potential asset to the community (a data product that reflects the impact of Hector and future hurricanes) with hands on sessions that explain how to create, program and collect data from sensors. I now appreciate more the amount of work that goes into the data collection process. Furthermore, I hadn’t created any circuit for a long time. It’s always good to refresh your memory with some Arduino hands on.

During the week, I also had the chance of collaborating with Suzanne Pierce and Daniel Hardesty-Lewis to create workflows for groundwater modeling. In fact, we were able to describe in a machine readable manner how to invoke Modflow with different recharge files and add it to a model registry. With a little more effort, we will be able to also connect this work to data collected by a platform such as CHORDS.

Other highlights:

Planet Texas 2050 is going to be big! Created after hurricane Harvey’s disaster, this project is going to deploy a whole new cyberinfrastructure to study climate variations, track the impact of pumping and trying to predict how to irrigate different regions. I hope we can collaborate within our MINT project (see last bullet point)
The Ronin Institute looks like a great organization to apply for research grants when you can’t change locations.
IkeWai, like Planet Texas 2050, will set up a sensor infrastructure for data analysis. Many opportunities for data analysis!
Privacy and sensor data: There are many open questions on who should own sensor data when installed on private property. On the one hand, sensors could give away personal information about the owner. On the other hand, sensors could be exploited to detect illegal activities, such as water pollution.
Grafana is looking great for sensor data visualization. You can even configure alerts!
[Self promotion :D] I gave a presentation and demo on our Model INTegration (MINT) project, where we are trying to bring together models from economy, agronomy, hydrology and meteorology to answer important questions about a region. The project is only 6 months old, but so far we are doing great progress! See our IEMSs paper for a full description of MINT!

Getting started with Docker: Modularizing your software in data-oriented experiments

As part of my work at the USC, I am always looking for different ways of helping scientist to reproduce their computational experiments. In order to facilitate software component deployment, I have been playing this week with Docker, a software wrapper that contains all the things you need to execute a software component.

The goal of this tutorial is to show you how you can get easily started to make your code reproducible. For more extensive tutorials and other Docker capabilities, I recommend you to go to the official Docker documentation: https://docs.docker.com/engine/getstarted/

Dockerizing your software: Docker images and containers

Docker handles two main concepts: containers and images. The images indicate how to set up and create an environment. The containers are the processes in charge of executing an image. For example, try installing Docker on your computer (https://docs.docker.com/engine/installation/) and test the “hello world” image:

docker run hello-world

If everything goes well, you should an image in your screen telling you that the Docker client contacted the Docker daemon, that the daemon pulled the “hello world” image from the Docker Hub repository, that then a new container was created, and that finally the output of the container was sent to your Docker client.

Docker has a local repository where it stores the images we create or pull from online repositories, such as the one we just retrieved. When we try to execute an image, Docker tries to find it locally and then online (e.g., on the Docker hub repository). If the system finds it, it will download it to our local repository. To browse over the images stored in your local repository, run the following command:

docker images

At the moment you should only see the “hello-world” image. Let’s try to do something fancier, like running an Ubuntu image with a unix command :

docker run ubuntu echo hello world

You should see “hello world” in the screen, after the image is downloaded. This is the same output you would obtain when executing that command in a terminal. If you are using popular software in your experiments, it is likely that someone has created an image and posted it online. For example, let’s consider that part of my experiment uses the samtools software, widely used in genomics analysis. In this example we will show how to reuse an image for samtools, the software we have used for the mpileup caller function.

The first thing we have to do is look for an image in Docker hub. In this case, the first result seems to be the appropriate image: https://hub.docker.com/r/comics/samtools/. The following command:

docker pull comics/samtools

will download the latest version. You can also specify the version by using a tag. For example comics/samtools:v1. Now if we execute the image locally:

docker run comics/samtools samtools mpileup

We will see the following on screen.

Basically, the program runs, but it is asking for its correct usage (we didn’t invoke it correctly). Since the mpileup software requires three inputs, in this tutorial we are going to choose a simpler function from the samtools software: sort, which sorts an input bam file.

In order to be able to pass the inputs file to our docker container, we need to mount a volume, i.e., tell the system that we want to share a folder with the container. This can be done with the “-v” option.

docker run -v PathToFolderYouWantToShare:/out comics/samtools samtools sort -o /out/sorted.bam /out/inputFileToSort.bam

Where the PathOfTheFolderYouWantToShare is the folder where you have your input file (“inputFileToSort.bam”). This will result in a sorted file (“sorted.bam”) of the input file “inputFileToSort” in the folder “PathToFolderYouWantToShare”.

All right, so now we have our component working. Now if we want anyone to use our inputs, we just have to tell them which Docker image to download. You may include your data also as part of the Docker image, but for that you will have to create your own Docker file (see below).

Creating Docker files

OK, so far it’s easy to reuse someone else’s software if there is an image online. But how do I create an image of the scripts/software I have done for others to reproduce? For this we need to create a Docker file, which will tell Docker how to build an image.

The first step is to build an image for the software we want to install. In my case, I chose the Ubuntu default image, and then added the steps and dependencies of the samtools software. My Docker file looks as it follows:

from ubuntu
MAINTAINER add yourself here emailgoeshere@example.com
RUN apt-get update && apt-get install -y python unzip gcc make bzip2 zlib1g-dev ncurses-dev
COPY samtools-1.3.1.tar.bz2 samtools.tar.bz2
RUN bunzip2 samtools.tar.bz2 && tar xf samtools.tar && mv samtools-1.3.1 samtools && cd samtools && make
ENV PATH /samtools:$PATH

The image created by this Docker file modifies the Ubuntu image we downloaded before, installing python, unzip, gcc, make, bzip2, zlib-dev and ncurses-dev, which are packages used by samtools. Thanks to this, we will have access to those commands from our Linux terminal in our container. The second command copies the software we want to install into the container (download it from https://sourceforge.net/projects/samtools/files/samtools/), unzips it and compiles it, adding “/samtools” to the system path. Note that if we want to copy sample data to the image, this would be another way to do so.

Now we just have to build the file using the following Docker command:

docker build -t youruser/nameOfImage -f pathToDockerFile .

youruser/nameOfImage is just a way to tag the images you create. In my case I named it dgarijo/test:v1. Later, when running the image as a container, we will use this name. The -f option points to the docker file you want to build as an image. This flag is optional: if you don’t include it, it will search on your local folder. Also, in some cases there are known issues. If you run into any trouble, just use:

docker build -t dgarijo/test:v1 DIRECTORY .

Where the “DIRECTORY” contains a docker file called “Dockerfile”.

Now that our image is in our local repository, let’s run it using the –v option to pass the appropriate inputs:

docker run -v PathOfTheFolderWithTheBamFile:/out nameOfYourImage samtools/samtools sort -o /out/sorted.bam /out/canary_test.bam

After a few seconds, you should see that the program ends, and a new file “sorted.bam” has appeared in your shared file. Now that your image works, you should consider uploading to the Docker hub repository (see the tutorial on the Docker site)

And that’s it for today! If you want to see more details on how some of these dockerized components can be used in a scientific workflow system like WINGS, check out this tutorial: https://dgarijo.github.io/Materials/Tutorials/stanford5Dec2016/

How to (easily) publish your ontology permanently: OnToolgy and w3id

I have recently realized that I haven’t published any post for a while, so I don’t think there is a better way to start 2017 than with a small tutorial: how to mint w3ids for your ontologies without having to issue pull requests on Github.

In a previous post I described how to publish vocabularies and ontologies in a permanent manner using w3ids. These ids are community maintained and are a very flexible approach, but I have found out that doing pull requests to the w3id repository may be a hurdle for many people. Hence, I have been thinking and working towards lowering this barrier.

Together with some colleagues from the Universidad Politecnica de Madrid, we released a year and a half ago a tool for helping documenting and evaluating ontologies: OnToology. Given a Github repository, OnToology tracks all your updates and issues pull requests with their documentation, diagrams and evaluation. You can see a step by step tutorial to set up and try OnToology with the ontologies of your choice. The rest of the tutorial assumes that your ontology is tracked by OnToology.

So, how can you mint w3ids from OnToology? Simple, go to “my repositories tab:

Then expand your repository:

And select “publish” on the ontology you want to mint a w3id:

Now OnToology will request a name for your URI, and that’s it! The ontology will be published under the w3id that appears below the ontology you selected. In my case I selected to publish the wgs84 ontology under the “wgstest” name:

As shown in the figure, the ontology will be published under “https://w3id.org/def/wgstest”

If you want to update the html in Github and want to see the changes updated, you should click on the “republish” button that now replaces the old “publish” one:

Right now the ontologies are published on the OnToology server, but we will enable the publication in Github by using Github pages soon. If you want the w3id to point somewhere else, you can either contact us at ontoology@delicias.dia.fi.upm.es, or you can issue a pull request to w3id adding your redirection before the 302 redirection in our “def” namespace: https://github.com/perma-id/w3id.org/blob/master/def/.htaccess

Permanent identifiers and vocabulary publication: purl.org and w3id

Some time ago, I wrote a tutorial with the common practices for publishing vocabularies/ontologies on the Web. In particular, the second step of the tutorial addressed the guidelines for describing how to set a stable URI for your vocabulary. The tutorial referred to purl.org, a popular service for creating permanent urls on the web. Purl.org had been working for more than 15 years and was widely used by the community.

However, several months ago purl.org stopped registering new users. Then, only a couple of months ago the website stopped allowing registering or editing the permanent urls from a user. The official response is that there is a problem with the SOLR index, but I am afraid that the service is not reliable anymore. The current purl redirects work properly, but I have no clue on whether they intend to keep maintaining it in the future. It’s a bit sad, because it was a great infrastructure and service to the community.

Fortunately, other permanent identifier efforts have been hatched successfully by the community. In this post I am going to talk a little about w3id.org, an effort launched by the W3C permanent identifier community group that has been adopted by a great part of the community (with more than 10K registered ids). W3id is supported by several companies, and although there is no official commitment from the W3C for maintenance, I think it is currently one of the best options for publishing resources with a permanent id on the web.

Differences with purl.org: w3id is a bit geekier, but way more flexible and powerful when doing content negotiation. In fact, you don’t need to talk to your admin to do the content negotiation because you can do it yourself! Apart from that, the main difference between purl.org and w3id is that you don’t have a user interface to edit you purls. You do so through Github by editing there the .htaccess files.

How to use it: let’s imagine that I want to create a vocabulary for my domain. In my example, I will use the coil ontology, an extension of the videogame ontology for modeling a particular game. I have already created the ontology, and assigned it the URI: https://w3id.org/games/spec/coil#. I have produced the documentation and saved the ontology file in both rdf/xml and TTL formats. In this particular case, I have chosen to store everything in one of my repositories in Github: https://github.com/dgarijo/VideoGameOntology/tree/master/GameExtensions/CoilOntology. So, how to set up the w3id for it?

Go to the w3id repository and fork it. If you don’t have a Github account, you must create one before forking the repository.
Create the folder structure you assigned in the URI of your ontology (I assume that you won’t be rewriting somebody else’s URI, as if that is the case, the admins will likely detect it). In my example, I created the folders “games/spec/” (see in repo)
Create the .htaccess. In my case it can be seen in the following url: https://github.com/perma-id/w3id.org/blob/master/games/spec/.htaccess. Note that I have included negotiation for three vocabularies in there.
Push your changes to your local repository.
Create a pull request to the perma-id repository.
Wait until the admins accept your changes.
You are done! If you want to add more w3id ids, just push them to your local copy and create additional pull requests.

Now every time somebody accesses the URL https://w3id.org/games/spec/coil#, it will redirect to where the htaccess file points to. In my case, http://dgarijo.github.io/VideoGameOntology/GameExtensions/CoilOntology/coilDoc/ for the documentation, http://dgarijo.github.io/VideoGameOntology/GameExtensions/CoilOntology/coil.ttl for TTL and http://dgarijo.github.io/VideoGameOntology/GameExtensions/CoilOntology/coil.owl for rdf/xml. This works also if you want to do simple 302 redirections as well. W3id administrators are usually very fast to review and accept the changes (so far I haven’t had to wait more than a couple of hours before having everything reviewed). The whole process is perhaps slower than what purl.org used to be, but I really like the approach. And you can do negotiations that you were unable to achieve with purl.org.

Http vs https: As a final comment, w3id uses https. If you publish something with http, it will be redirected to https. This may look as an unimportant detail, but is critical in some cases. For example, I have found that some applications cannot negotiate properly if they have to handle a redirect from http to https. An example is Protégé: if you try to load http://w3id.org/games/spec/coil#, the program will raise an error. Using https in you URI works fine with the latest version of the program (Protégé 5).

General guidelines for reviewing a scientific publication

Lately I’ve been asked to do several revisions in different workshops, conferences and journals. In this post I would like to share with you a generic template to follow when reviewing a scientific publication. If you have been doing it for a while you may find it trivial, but I think it might be useful for people that have started recently in the reviewing process. At least, when I started, I had to ask for a similar one to my advisor and colleagues.

But first, several reasons why you should review papers:

Helps you to identify whether a scientific work is good or not. And refine your criteria by comparing yourself with other reviewers. Also, it trains you to defend your opinion based on what you read.
Helps you refining your own work, by identifying common flaws that you normally don’t detect when writing your own papers.
It’s an opportunity to update your state of the art, or learn a little on other areas.
Allows you contributing to the scientific community, and getting public visibility.

A scientific work might be the result of months of work. Even if you think it is trivial you should be methodic explaining the reasons why you think it should be accepted or rejected (yes, even if you think the paper should be accepted). A review should not be just an “Accepted” or “Rejected” statement, but also contain valuable feedback for the authors. Below you can see the main guidelines for a good review:

Start your review with an executive summary of the paper: this will let the authors know the main message you have understood from their work. Don’t copy and paste the abstract; try to communicate the summary in your own words. Otherwise they’ll just think you didn’t put much attention in reading the paper.
Include a paragraph summarizing the following points:
1. Grammar: Is the paper well written?
2. Structure: is the paper easy to follow? Do you think the order should have been different?
3. Relevance: Is the paper relevant for the target conference/journal/workshop?
4. Novelty: Is the paper dealing with a novel topic?
5. Your decision. Do you think the work should be accepted for the target publication? (If you don’t, expand your concerns in the following paragraphs)
Major Concerns: Here is where you should say why do you disagree with the authors, and highlight your main issues. In general, a good research paper should describe successfully four main points:
1. What is the problem the authors are tackling? (Research hypothesis) This point is tricky, because sometimes it is really hard to find! And in some cases the authors omit it and you have to infer it. If you don’t see it, mention it in your review.
2. Why is this a problem? (Motivation). The authors could have invented a problem which had no motivation. A good research paper is often motivated by a real world problem, potentially with a user community behind benefiting from the outcome.
3. What is the solution? (Approach). The description of the solution adopted by the authors. This is generally easy to spot on any paper.
4. Why is it a good solution? (Evaluation). The validation of the research hypothesis described in point one. The evaluation is normally the key of the paper, and the reason why many research publications are rejected. As my supervisor has told me many times, one does not evaluate an algorithm or an approach; one has to evaluate whether such proposed algorithm or approach validate the research hypothesis.

When a paper has the previous four points well described, it is accepted (generally). Of course, not all papers enter the category of a research papers (like a survey paper or an analysis paper). But the four previous points should cover a wide range of publications.

Minor concerns: You can point out minor issues after the big ones have been dealt with. Not mandatory, but t will help the authors to polish their work.
Typos: unless there are too many, you should point the main typos you find in your review. Or the sentences you think are confusing.

Other advice:

Don’t be a jerk: many reviews are anonymous, and people tend to be crueler when they know their names won’t be shown to the authors. Instead of saying that something “is garbage”, state clearly why you disagree with the authors proposal and conclusions. Make the facts talk for themselves; not your bias or opinion.
Consider the target publication. You can’t use the same criteria for a workshop, conference or journal. Normally people tend to be more permissive at workshops, where the evaluation is not that important if the idea is good, but require a good paper for conferences and journals.
Highlight the positive parts of the authors’ work, if any. Normally there is a reason why the authors have spent time on the presented research, even if the idea is not very well implemented.
Check the links, prototypes, evaluation files and in general, all the supplementary material provided by the authors. A scientist should not only review the paper, but the research described on it.
Be constructive. If you disagree with the authors in one point, always mention how they could improve their work. Otherwise they won’t know how to handle your issue and ignore your review.

If you want to check more guidelines, you can check the ones Elsevier gives to their reviewers, or the ones by PLOS ONE.

Provenance Week 2014

Last week I attended to the Provenance Week in Cologne. For the first time, IPAW and TAPP were celebrated together, even having some overlapping sessions like the poster lighting talks. The clear benefit of having both events at the same time is that a bigger part of the community was actually able to attend to the event, even if some argued that 5 full days of provenance is too long. I got to see many known faces, and finally meet some people who I had just talked to remotely.

In general, the event was very interesting, definitely worth paying a visit. I was able to gather an overview of the state of the art in provenance in many different domains, and how to protect it, collect it and exploit it for various purposes. Different sessions led to different discussions, but I liked 2 topics in particular:

The “Sexy” application for provenance (Paul Groth). After years of discussions we have a standard for provenance, and many applications are starting to use it and extending for representing provenance across different domains. But there is no application that uses provenance from different sources to do something meaningful for the final user. Some applications define metrics that are domain dependent to assess trust, others like PROV-O viz visualize it to see what is going on in the traces, and others try to use it to explain what kind of things we can find in a particular dataset. But we still don’t have the provenance killer app… will the community be able to find it before the next Provenance Week?

Provenance has been discussed for many years now. How come are we still so irrelevant? (Beth Plale). This was brought up by the keynote speaker and organizer Beth Plale, who talked about different consortiums in the U.S. that are starting to care about provenance (e.g., Hathitrust publisher or the Research Data Alliance). As some people pointed out, it is true that provenance has gathered a lot of importance in the recent years, up to the point at which some of the grants will only be provided if the researchers guarantee the tracking of provenance. The standard helps, but we are still far from solving the provenance related issues. Authors and researchers have to see the benefit from publishing provenance (e.g., attribution, with something like PROV-Pingback); otherwise it will be very difficult to convince them to do so.

Luc getting prepared for his introductory speech in IPAW

Apart from the pointers I have included above, many other applications and systems were presented during the week. These are my highlights:

Documentation of scientific experiments. A cool application for generating documentations of workflows using python notebook and the prov-o viz. Tested with Ducktape’s workflows.

Reconstruction of provenance: Hazeline Asuncion and Tom de Nies both presented their approaches for finding the dependencies among data files when the provenance is lost. I find this very interesting because it could be used (potentially) to label workflow activities automatically (e.g., with our motif list).

Provenance capture: RData tracker, an intrusive, yet simple way of capturing provenance of scripts in R. Other approaches like no workflow also looked ok, but seemed a little heavier.

Provenance benchmarking: Hugo Firth presented ProvGen, and interesting approach for creating huge synthetic provenance graphs simulating real world properties (e.g., twitter data). All the new provenance datasets were added to the ProvBench Github page, now also in Datahub.

Provenance pingbacks: Tim Lebo and Tom de Nies presented two different implementations (see here and here) for the PROV Pingback mechanism defined in the W3C. Even though security might still be an issue, this is a simple mechanism to provide attribution to the authors. Fantastic first steps!

Provenance abstraction: Paolo Missier presented a way of simplifying provenance graphs while preserving the prov notation, which helps to understand better what is going on in the provenance trace. Roly Perrera presented an interesting survey on how abstraction is also being used to present different levels of privacy when accessing the data, which will be more and more important as provenance gains a bigger role.

Applications of provenance: One of my favorites was Trusted Tiny Things, which aimed at describing everyday things with provenance descriptions. This would be very useful to know, in a city, how much the government spent on a certain item (like statue), and who was responsible for buying it. Other interesting applications were Pinar Alper’s approach for labeling workflows, Jun Zhao’s approach for generating queries for exploring provenance datasets and Matthew Gamble’s metric for quantifying the influence of an article in another just by using provenance.

The Provenance analytics workshop: I was offered to co-organize this satellite event on the first day. We got 11 submissions (8 accepted) and managed to keep a nice session running plus some discussion at the end. Some ongoing work on applications of provenance to different domains was presented (cloud, geospatial, national climate, crowdsourcing, scientific workflows) and the audience was open to provide feedback. I wouldn’t mind doing it again 🙂

The prov analytics workshop (pic by Paul Groth)

Elevator pitch

While being a PhD student, many people have asked me about the subject of my thesis and the main ideas behind my research. As a student you always think you have very clear what you are doing, at least until you have to actually explain it to someone who is not related to your domain. In fact, it is about using the right terminology. If you say something like “Oh yeah, I am trying to detect abstractions on scientific workflows semi-automatically in order to understand how they can better be reused and related to each other”, people will look at you as if you didn’t belong to this planet. Instead, something like “detecting commonalities in scientific experiments in order to study how we can understand them better” might be more appropriate.

But last week the challenge was slightly different. I was invited to give an overview talk about the work I have been doing as a PhD student. And that is not only what I am doing, but why am I doing it and how is it all related without going into the details of every step. It may appear as an easy task, but it kept me thinking more than I expected.

As I think some people might be interested in a global overview, I want to share the presentation here as well: http://www.slideshare.net/dgarijo/from-scientific-workflows-to-research-objects-publication-and-abstraction-of-scientific-experiments. Have a look!

How to (properly) publish a vocabulary or ontology in the web (part 5 of 6)

This week I want to quickly introduce how and why you should include a license in your vocabulary and documentation. Since this subject has been already dealt with, I am mainly going to be providing links to posts describing these matters in detail.

Why should you add a license to your ontologies? Because if others want to reuse your vocabulary or ontology, the license will clarify what are they allowed doing with it according to the law (for instance, if they have to give attribution to your work). Remember that you are the intellectual author and you have the rights over the resource being published. See more details and types of licenses here.

How can you specify a license? You can add it as a semantic description to the ontology/vocabulary. Two widely used properties are dc:rights and dc:license, from the Dublin Core vocabulary. These properties can be used to describe the OWL file being produced, or in the documentation itself with annotations in RDF-a or microdata. See how it can be done here.

Spend some time analyzing which is the most appropriate license for your work. It may help you and many others in the future! If you are confused on which license to use, this is the one which we use on our vocabularies: http://creativecommons.org/licenses/by-nc-sa/2.0/.

This is part of a tutorial divided in 7 parts:

Overview of the tutorial.
(Reqs addressed A1(partially), A2, A3, A4, P1) Publishing your vocabulary at a stable URI using RDFS/OWL.
(Reqs addressed P2, P3). How to design a human readable documentation.
Extra: A tool for creating html readable documentation
(Reqs addressed P4). Derreferencing your vocabulary
(Reqs addressed A1 (partially)). Dealing with the license. (this post)
(Reqs addressed A5, P5). Reusing other vocabularies. (To appear)

How to (properly) publish a vocabulary or ontology in the web (part 4 of 6)

(Update: purl.org seems to have stopped working. I recommend you to have a look at my latest post for doing content negotiation with w3id)

After a long summer break in blogging, I’m committed to finishing this tutorial. In this post I’ll explain why and how to dereference your vocabulary when publishing it in the Web.

But first, why should you dereference your vocabulary? In part 2 I showed how to create a permanent URL (purl) and redirect it to the ontology/vocabulary we wanted to publish (in my case it was http://purl.org/net/wf-motifs). If you followed the example you would have seen that now when you enter the purl you created in the web browser it redirects you to the ontology file. However if you enter http://purl.org/net/wf-motifs you will be redirected to the html documentation of the ontology. When entering the same URL in Protégé, the ontology file will be loaded in the system. By dereferencing the motifs vocabulary I am able to choose what to deliver depending on the type of request received by the server on a single resource: RDF files for applications and nice html pages for the people looking for information about the ontology (structured content for machines, human readable content for users).

Additionally, if you have used the tools I suggested in previous posts, when you ask for a certain concept the browser will take you to the exact part of the document defining it. For example, if you want to know the exact definition for the concept “FormatTransformation” in the Workflow Motif ontology, then you can paste its URI (http://purl.org/net/wf-motifs#FormatTransformation) in the web browser. This makes the life easier for users when browsing and reading your ontology.

And now, how do you dereference your vocabulary? First, you should set the purl redirection as a redirection for Semantic Web resources (add a 303 redirection instead a 302, and add the target URL where you plan to do the redirection). Note that you can only dereference a resource if you control the server where the resources are going to be delivered. The screenshot below shows how it would look in purl for the Workflow Motifs vocabulary. http://vocab.linkeddata.es/motifs is the place our system admin decided to store the vocabulary.

Now you should add the redirection itself. For this I always recommend having a look into the W3C documents, which will guide you step by step on how to achieve this. In this case in particular we followed http://www.w3.org/TR/swbp-vocab-pub/#recipe3, which is a simple redirection for vocabularies with a hash namespace. You have to create an htaccess file similar to the one pasted below. In my case the index.html file has the documentation of the ontology, while motif-ontology1.1.owl contains the rdf/xml encoding. If a ttl file exists, you can also add the appropriate content negotiation. All the files are located in a folder called motifs-content, in order to avoid an infinite loop when dealing with the redirections of the vocabulary:

# Turn off MultiViews
Options -MultiViews

# Directive to ensure *.rdf files served as appropriate content type,
# if not present in main apache config
AddType application/rdf+xml .rdf
AddType application/rdf+xml .owl
#AddType text/turtle .ttl #<---Add if you have a ttl serialization of the file

# Rewrite engine setup
RewriteEngine On
RewriteBase /def

# Rewrite rule to serve HTML content from the vocabulary URI if requested
RewriteCond %{HTTP_ACCEPT} !application/rdf\+xml.*(text/html|application/xhtml\+xml)
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^motifs$ motifs-content/index.html

# Rewrite rule to serve RDF/XML content from the vocabulary URI if requested
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteRule ^motifs$ motifs-content/motif-ontology1.1.owl [R=303]

# Rewrite rule to serve turtle content from the vocabulary URI if requested
#RewriteCond %{HTTP_ACCEPT} text/turtle
#RewriteRule ^motifs$ motifs-content/motifs_ontology-ttl.ttl [R=303]

# Choose the default response
# ---------------------------

# Rewrite rule to serve the RDF/XML content from the vocabulary URI by default
RewriteRule ^motifs$ motifs-content/motif-ontology1.1.owl [R=303]

Note the redirections when the owl is being requested. If you have a slash vocabulary, you will have to follow the aforementioned W3C document for further instructions.

Now it is time to test that everything works. The easiest way is just to paste the URI of the ontology in Protégé and in your browser and check that in one case it loads the ontology properly and in the other you can see the documentation. Another possibility is to use curl like this: curl -sH “Accept: application/rdf+xml” -L http://purl.org/net/wf-motifs (for checking that the rdf is obtained) or curl -sH “Accept: text/html” -L http://purl.org/net/wf for the html.

Finally, you may also use the Vapour validator to check that you have done the process correctly. After entering your ontology URL, you should see something like this:

Congratulations! You have dereferenced your vocabulary successfully 🙂

This is part of a tutorial divided in 7 parts:

Overview of the tutorial.
(Reqs addressed A1(partially), A2, A3, A4, P1) Publishing your vocabulary at a stable URI using RDFS/OWL.
(Reqs addressed P2, P3). How to design a human readable documentation.
Extra: A tool for creating html readable documentation
(Reqs addressed P4). Derreferencing your vocabulary (this post)
(Reqs addressed A1 (partially)). Dealing with the license. (To appear)
(Reqs addressed A5, P5). Reusing other vocabularies. (To appear)