Linking Research

AAAI 2017

Posted by dgarijov on February 16, 2017

The Association for the Advancement of Artificial Intelligence conference (AAAI) is held once a year to bring together experts from heterogeneous fields of AI and discuss their latest work. It is also a great venue if you are looking for a new job, as different companies and institutions often announce open positions. Last week, the 31st edition of the conference was celebrated in downtown San Francisco, and I attended the whole event. If you missed the conference and are curious about what was going on, make sure you read the rest of this post.


But first: what was I doing there?

I attended the conference to co-present a tutorial and a poster.

The tutorial was a training session called “The scientific paper of the future”, which introduced a set of best practices on how to describe data, software, metadata, methods and provenance associated with a scientific publication, along with different ways of implementing these practices. Yolanda Gil and I presented, but Gail Clement (lead of AuthorCarpentry at Caltech library) joined us as well to describe how to boost your research impact in 5 simple steps. I found some of her materials so useful that I have finally opened a profile on ImpactStory after her talk. All the materials of our talk are online, so feel free to check them out.


From left to right: Gail Clement, Yolanda Gil and me

The poster I presented described the latest additions of the DISK framework. In a nutshell, we have adapted our system for automating hypothesis analysis and revision to operate on data that is constantly growing. While doing this, we keep a detailed record of the inputs, outputs and workflows needed to do the revision of the hypothesis. Check out our paper for details!


Ok, enough self-promotion! Let’s get started with the conference:


In general, the quality of the keynotes and talks was outstanding. The presenters did a great job and effort to talk about their topics without jumping into the details of their field.

Rosalind Piccard started the week by talking about AI and emotions, or, using her own terms, “affective computing”. Detecting the emotion of the person interacting with the system is pivotal for decision making. But recognizing these emotions is not trivial (e.g., many people smile when they are frustrated, or even angry). It’s impressive how sometimes just training neural networks with sample data is not enough, as the history of the gestures play an important role in the detection as well. Rosalind described her work for detecting and predict emotions like the interest of an audience or stress. Thanks to a smart wristband they are able to predict seizures and breakouts in autistic kids. In the future, they aim to be able to predict your mood and possible depressions!

On Tuesday, the morning keynote was given by Steve Young, who talked about speech recognition and human-bots interaction. Their approach is mostly based on neural networks and reinforced learning. Curiously enough, this approach works better on the field (with real users) than with simulated results (for which other approaches work better). The challenges in this area lie in determining when a dialog is not accurate, as users tend to lie a lot when providing feedback. In fact, maybe the only way of knowing that something went wrong in a dialog is when it’s too late and the dialog has failed. As a person working on the Semantic Web domain, I found interesting that knowledge bases are an uncharted territory in this field at the moment.

Jeremy Frank spoke in the afternoon session for IAAI. He focused on the role of AI on autonomous space missions where sometimes the communications are interrupted and many anomalies may occur. The challenge in this case is not only to be able to plan what the robot or ship are going to do, but to monitor the plan and explain whether an order or a command did what it was actually supposed to. In this scenario, having new software becomes a risk.

On Wednesday, Dmitri Dolgov was in charge of talking about self-driving cars. More than 10 trillion miles are travelled every year across the world, with over 1.2 million casualties in accidents that are 94% of the time a human error. The speaker gave a great overview of the evolution of the field, starting in 2009 when they wanted to understand the problem and created a series of challenges to drive 100 miles in different scenarios. By 2010, they had developed a system good enough for driving a blind man across town, automatically. In 2012, the system was robust enough to drive in freeways. By 2015, they had finally achieved their goal: a complete driverless vehicle, without steering wheel or pedals. A capability of the system that surprised me is that it is able to read and mimic human behavior in intersections or stop signs without any trouble. In order to do this, the sensor data has to be very accurate, so they ended up creating their own sensors and hardware. As in the other talks, deep learning techniques have helped enormously to recognize certain scenarios and operate accordingly. Having the sensor data available has also helped. These cars have more than 1 billion virtual miles of training, and they are failing less and less as time goes by.


The afternoon session was led by Kristen Grauman, an expert in computer vision who analyzed how image recognition works in unlabeled video. The key challenge in this case is to be able to learn from images in a more natural way, as animals do. It turns out that our movement is heavily correlated to our vision sense, to the point that if we don’t allow an animal to move freely when it’s growing up and viewing the world, it may be damaged permanently. Therefore, maybe machines should learn from images in movement (videos) to understand better the context of an image. The first results in this direction look promising, and the system has so far learned to track relevant moving objects in video, by itself.

The final day opened with a panel that I am going to include in the keynote group, as it has been one of the breakthroughs of this year. An AI has recently beaten all the professional players against whom it has played in Poker (one to one), and two of the lead researchers in the field (Michael Bowling and Tuomas Sandholm) were invited to show us how they did it. Michael started describing DeepStack and why Poker is a particularly interesting challenge for AI: while in other games like chess you have all the information you need at a given state to decide your next move, Poker is an imperfect information game. You may have to remember the history of what has been done in order to proceed with your next decision. This creates a decision tree that is even bigger than complex board games like Chess and Go, so researchers have to abstract and explore the sparse tree. The problem is that, at some point, something may have happened that wasn’t taken into account in the abstraction, and this is where the problems start.

Their approach for addressing this issue is to reason over the possible cards that the opponent thinks the system has (game theory and Nash equilibrium play a crucial role). The previous history determines distributions of the cards, while evaluation functions have different heuristics based on the beliefs of the players in the current game (deep learning is used to choose the winning situation out of the possibilities). While current strategies are very exploitable, DeepStack is one of the least, being able to make 8 times what a regular player makes while being able to run in a laptop during the competition (the training part takes place before).

Tuomas followed introducing Libratus, an AI created last year but evolved from previous efforts. Libratus shares some strategies with DeepStack (card abstraction, etc.), as the Poker community has worked together on interoperable solutions. Libratus is the AI that actually played against the Poker professionals and beat them, even when they had a 200K $ incentive for the winner. The speaker mentioned that instead of trying to exploit the weaknesses of the opponent, Libratus focused on how the opponent exploits the strategies used by the AI. This way, Libratus could learn and fix these holes.

According to the follow up discussion, Libratus could probably defeat Deepstack, but they haven’t played against each other yet. The next challenges are applying these algorithms to solve similar issues in other domains, and making an AI that can actually be part of a table and join tournaments (this may imply a redefinition of the problem). Both researchers ended up stating how supportive the community has been providing feedback and useful ideas to improve their respective AIs.

The last keynote speaker was Russ Tedrake (MIT Robot labs), who presented advances in robotics and the lessons learned during the three year DARPA challenge on robotics. The challenge had a series of heterogeneous tasks (driving, opening a valve, cut a hole in a wall, open and traverse a door, etc.). Most of these problems are faced as optimization problems, and planning is a key feature that has to be updated on the go. Robustness is crucial for all the processes. For example, in the challenge, the MIT robot failed due to a human error and an arm broke off. However, thanks to the redundancy functions, the robot could finish the rest of the competition using only the other arm. As a side note, the speaker also explained why the robots always “walk funny”: their center of mass. It facilitates the equations for movement, so researchers have adopted it to avoid more complexity in their solutions.

One of the main challenges for these robots is perception. It has to run constantly to understand the surroundings of the robot (e.g., obstacles), dealing with possible noise data or incomplete information. The problem is that, when a new robot has to be trained, most of the data produced with other robots is not usable (different sensors, different means for grabbing and dealing with objects, etc.). Looking how babies react with their environment (touching everything and tasting it) might bring new insights in how to address these problems.

My highlights

-The “AI in practice” session that occurred on Sunday was great. The room was packed, and we saw presentations from companies like IBM, LinkedIn or Google.

I liked these talks because they highlighted some of the current challenges faced by AI. For example, Michael Witbrock (IBM) described how despite the advances in Machine Learning applications, the representations used to address a problem can barely be reused. The lack of explanation of deep learning techniques does not help either, specifically in diagnosing diseases: doctors want to know why a certain conclusion is reached. IBM is working towards improving the inference stack, so as to be able to combine symbolic systems with non-symbolic ones.

Another example was Gary Marcus (Uber labs), who explained that although there has been a lot of progress on AI, AGI (artificial general intelligence) has not advanced that much. Perception is more than being able to generalize from a situation, and machines are currently not very good at it. For example, an algorithm may be able to detect that there is a dog in a picture, and that the dog is lifting weights, but it won’t be able to tell you why this picture is unique or rare. The problem with current approaches is that they are incremental. Sometimes, there is a fear to step back and look at how some of our current problems are addressed. Focusing too much on incremental science (i.e., improving a small percentage of the precision of the current algorithms), may lead to get stuck in local maximums. Sometimes we need to address problems from different angles to make sure we make progress.

– AI in games is a thing! Over the years I have seen some approaches that aim to develop smart players, but attending this tutorial was one of the best experiences in the conference. Julian Toeglius gave an excellent overview/tutorial of the state of the art in the field, including how a simple A* algorithm may almost be a perfect player for Mario (if we omit those levels when we need to go back), how games are starting to adapt to players, how to build credible non player characters and how to create scenarios that are fun to play automatically. Then he introduced other problems that overlap with many of the challenges addressed in the keynotes: 1) How can we produce a general AI that learns how to play any game? And 2) how can we create a game automatically? For the first one, I found interesting that they have already developed a benchmark of simple games that will test your approach. The second one however is deeper, as the problem is not creating a game, or even a valid game. The real problem in my opinion is creating a game that a player considers fun. At the moment the current advances consist on modifications of existing games. I’ll be looking forward to reading more about this field and its future achievements.


– AI in education: Teaching ethics to researchers is becoming more and more necessary, given the pace at which science evolves. At the moment, this is an area often overlooked in any PhD or research program.

– The current NSF research plan is not mute! Lynne Parker introduced the creation of the AI research and development strategic plan, which expects to remain untouched even after the results of the latest election. The current focus is on how AI could help to the national priorities: liberty (e.g., security), life (education, medicine, law enforcement, personal services, etc.) and pursuit of happiness (manufacturing, logistics, agriculture, marketing, etc.). Knowledge discovery and transparent and explainable methods will help for this purpose.

– Games night! Great opportunity to socialize and meet part of the community by drawing, playing puzzles and board games.


– Many institutions are hiring. The Job fair had plenty of participating companies and institutions, but it was a little bit far away from the main events and I didn’t see many people attending. In any case, there were also plenty of companies with stands while the main conference was happening as well, which made it easy to talk to them and see what were they working on.

– Avoid reinventing the wheel! There was a cool panel on Expert systems history. Sometimes it is good to just take a step back and see how they analyzed research problems in the past. Some of their solutions still apply today

– Ontologies and Semantic Web were almost non-present in the whole conference. I think I only saw three talks related to the topic, about evolution and trust of knowledge bases, detection of redundant concepts in ontologies and the LIMES framework. I hope the semantic web community is more active in future editions of AAAI.

– Check out the program for more details on the talks and presentations.


Attending AAAI has been a great learning experience. I really recommend it to anyone working on any field of AI, especially if you are student or you are looking for a job. I also find very exciting that some of the problems I am working on are also identified as important by the rest of the community. In particular, the need of creating proper abstractions to facilitate understanding and shareability of current methods was part of the main topic of my thesis, while the need for explanation of the result of a certain technique is applied is highly related to what we do for capturing the provenance of scientific workflow results. As described by some of the speakers, “Debugging is a kind of alchemy” at the moment. Let’s turn it into a science.

Posted in Conference, Workshop | Tagged: , | Leave a Comment »

Getting started with Docker: Modularizing your software in data-oriented experiments

Posted by dgarijov on January 30, 2017

As part of my work at the USC, I am always looking for different ways of helping scientist to reproduce their computational experiments. In order to facilitate software component deployment, I have been playing this week with Docker, a software wrapper that contains all the things you need to execute a software component.

The goal of this tutorial is to show you how you can get easily started to make your code reproducible. For more extensive tutorials and other Docker capabilities, I recommend you to go to the official Docker documentation:

Dockerizing your software: Docker images and containers

Docker handles two main concepts: containers and images. The images indicate how to set up and create an environment. The containers are the processes in charge of executing an image. For example, try installing Docker on your computer ( and test the “hello world” image:

docker run hello-world

If everything goes well, you should an image in your screen telling you that the Docker client contacted the Docker daemon, that the daemon pulled the “hello world” image from the Docker Hub repository, that then a new container was created, and that finally the output of the container was sent to your Docker client.

Docker has a local repository where it stores the images we create or pull from online repositories, such as the one we just retrieved. When we try to execute an image, Docker tries to find it locally and then online (e.g., on the Docker hub repository). If the system finds it, it will download it to our local repository. To browse over the images stored in your local repository, run the following command:

docker images

At the moment you should only see the “hello-world” image. Let’s try to do something fancier, like running an Ubuntu image with a unix command :

docker run ubuntu echo hello world

You should see “hello world” in the screen, after the image is downloaded. This is the same output you would obtain when executing that command in a terminal. If you are using popular software in your experiments, it is likely that someone has created an image and posted it online. For example, let’s consider that part of my experiment uses the samtools software, widely used in genomics analysis. In this example we will show how to reuse an image for samtools, the software we have used for the mpileup caller function.

The first thing we have to do is look for an image in Docker hub. In this case, the first result seems to be the appropriate image: The following command:

docker pull comics/samtools

will download the latest version. You can also specify the version by using a tag. For example comics/samtools:v1. Now if we execute the image locally:

docker run comics/samtools samtools mpileup

We will see the following on screen.


Basically, the program runs, but it is asking for its correct usage (we didn’t invoke it correctly). Since the mpileup software requires three inputs, in this tutorial we are going to choose a simpler function from the samtools software: sort, which sorts an input bam file.

In order to be able to pass the inputs file to our docker container, we need to mount a volume, i.e., tell the system that we want to share a folder with the container. This can be done with the “-v” option.

docker run -v PathToFolderYouWantToShare:/out comics/samtools samtools sort -o /out/sorted.bam /out/inputFileToSort.bam

Where the PathOfTheFolderYouWantToShare is the folder where you have your input file (“inputFileToSort.bam”). This will result in a sorted file (“sorted.bam”) of the input file “inputFileToSort” in the folder “PathToFolderYouWantToShare”.

All right, so now we have our component working. Now if we want anyone to use our inputs, we just have to tell them which Docker image to download. You may include your data also as part of the Docker image, but for that you will have to create your own Docker file (see below).

Creating Docker files

OK, so far it’s easy to reuse someone else’s software if there is an image online. But how do I create an image of the scripts/software I have done for others to reproduce? For this we need to create a Docker file, which will tell Docker how to build an image.

The first step is to build an image for the software we want to install. In my case, I chose the Ubuntu default image, and then added the steps and dependencies of the samtools software. My Docker file looks as it follows:

from ubuntu
MAINTAINER add yourself here
RUN apt-get update && apt-get install -y python unzip gcc make bzip2 zlib1g-dev ncurses-dev
COPY samtools-1.3.1.tar.bz2 samtools.tar.bz2
RUN bunzip2 samtools.tar.bz2 && tar xf samtools.tar && mv samtools-1.3.1 samtools && cd samtools && make
ENV PATH /samtools:$PATH

The image created by this Docker file modifies the Ubuntu image we downloaded before, installing python, unzip, gcc, make, bzip2, zlib-dev and ncurses-dev, which are packages used by samtools. Thanks to this, we will have access to those commands from our Linux terminal in our container. The second command copies the software we want to install into the container (download it from, unzips it and compiles it, adding “/samtools” to the system path. Note that if we want to copy sample data to the image, this would be another way to do so.

Now we just have to build the file using the following Docker command:

docker build -t youruser/nameOfImage -f pathToDockerFile .

youruser/nameOfImage is just a way to tag the images you create. In my case I named it dgarijo/test:v1. Later, when running the image as a container, we will use this name. The -f option points to the docker file you want to build as an image. This flag is optional: if you don’t include it, it will search on your local folder. Also, in some cases there are known issues. If you run into any trouble, just use:

docker build -t dgarijo/test:v1 DIRECTORY .

Where the “DIRECTORY” contains a docker file called “Dockerfile”.

Now that our image is in our local repository, let’s run it using the –v option to pass the appropriate inputs:

docker run -v PathOfTheFolderWithTheBamFile:/out nameOfYourImage samtools/samtools sort -o /out/sorted.bam /out/canary_test.bam

After a few seconds, you should see that the program ends, and a new file “sorted.bam” has appeared in your shared file. Now that your image works, you should consider uploading to the Docker hub repository (see the tutorial on the Docker site)

And that’s it for today! If you want to see more details on how some of these dockerized components can be used in a scientific workflow system like WINGS, check out this tutorial:

Posted in Miscellaneous, Tutorial | Tagged: , , , , | 2 Comments »

How to (easily) publish your ontology permanently: OnToolgy and w3id

Posted by dgarijov on January 23, 2017

I have recently realized that I haven’t published any post for a while, so I don’t think there is a better way to start 2017 than with a small tutorial: how to mint w3ids for your ontologies without having to issue pull requests on Github.

In a previous post I described how to publish vocabularies and ontologies in a permanent manner using w3ids. These ids are community maintained and are a very flexible approach, but I have found out that doing pull requests to the w3id repository may be a hurdle for many people. Hence, I have been thinking and working towards lowering this barrier.

Together with some colleagues from the Universidad Politecnica de Madrid, we released a year and a half ago a tool for helping documenting and evaluating ontologies: OnToology. Given a Github repository, OnToology tracks all your updates and issues pull requests with their documentation, diagrams and evaluation. You can see a step by step tutorial to set up and try OnToology with the ontologies of your choice. The rest of the tutorial assumes that your ontology is tracked by OnToology.

So, how can you mint w3ids from OnToology? Simple, go to “my repositories tab:


Then expand your repository:


And select “publish” on the ontology you want to mint a w3id:


Now OnToology will request a name for your URI, and that’s it! The ontology will be published under the w3id that appears below the ontology you selected. In my case I selected to publish the wgs84 ontology under the “wgstest” name:


As shown in the figure, the ontology will be published under “”

If you want to update the html in Github and want to see the changes updated, you should click on the “republish” button that now replaces the old “publish” one:


Right now the ontologies are published on the OnToology server, but we will enable the publication in Github by using Github pages soon. If you want the w3id to point somewhere else, you can either contact us at, or you can issue a pull request to w3id adding your redirection before the 302 redirection in our “def” namespace:

Posted in Linked Data, ontology, Tutorial | Tagged: , , , | Leave a Comment »

Towards a human readable maintainable ontology documentation

Posted by dgarijov on August 29, 2016

Some time ago, I wrote a small post to guide people on how to easily develop the documentation of your ontology when publishing it on the Web. The ontology documentation is critical for reuse, as it provides an overview of the terms of the ontology with examples, diagrams and their definitions. Many researchers describe their ontologies in associated publications, but in my opinion a good documentation is what any potential reuser will browse if they want to include the ontology on their work.

As I pointed out in my previous post, there are several tools to produce a proper documentation, like LODE and Parrot. However, these tools focus just in the concepts of the ontology, and when using them I found myself facing three main limitations:

  1. That the tools are in web services external to my control, and whenever the ontology is larger than a certain size, the web service will not admit it.
  2. That whenever I want to export the produced ontology documentation, it’s not straightforward: I have to download a huge html and it dependencies from the browser.
  3. That if I want to edit the ontology documentation adding an introduction, diagrams, etc., I have to edit the huge downloaded html. This is cumbersome, as finding the spot where I want to add new contributions is difficult. Normally the edition of the text is mandatory, as some of the metadata of the ontology is not annotated within the ontology itself.

In order to face these limitations, I decided to create Widoco, a WIzard for DOCumenting Ontologies, more than a year ago. Widoco is based on LODE and helps you creating the ontology in three simple steps: introducing the ontology URI or file, completing its metadata and selecting the structure of the document you want to build. You can see a snapshot of the wizard below:


Widoco screenshot

Originally, Widoco produced the documentation offline (no need to use external web services and without a limit for the size of your ontology) and the output was divided in different documents, each of them containing a new section. That way, it was more manageable to edit each of them. The idea here is to be similar to Latex projects, where you include the sections you desire on the main document and comment those you don’t want to include. Ideally, the document would readapt itself to show only the sections you want, dynamically.

After some work, I have just released the version 1.2.2 of the tool, and I would like to comment some of its features here.

  • Metadata gathering improvements: Widoco will aim to extract metadata from the ontology itself, but that metadata is often incomplete. With Widoco now it is possible to introduce many metadata fields on the fly, if the user wants them to be added to the documentation. Some of the latest added metadata fields indicate the status of the document and how to properly cite the ontology, including its DOI. In addition, it is possible to save and load the metadata properties as a .properties file, in case the documentation needs to be regenerated in the future. As for the license, if an internet connection is available, Widoco will aim to retrieve the license name and metadata from the Licensius web services, where an endpoint of licenses is ready for exploitation.


    Widoco configuration screenshot

  • Access to a particular ontology term: I have changed the anchors in the document to match the URI of the terms. Therefore, if a user derreferences a particular ontology term, he/she will be redirected to the particular definition of that term in the document. This is useful because it saves time when looking for the definition of a particular concept.
  • Automatic evaluation: If an internet connection is available, Widoco uses the OOPS! web service to detect common pitfalls in your ontology design. The report can be published along with the documentation.
  • Towards facilitating ontology publication and content negotiation: Widoco now produces a publishing bundle that you can copy and paste in your server. This bundle is published according to the W3C best practices, and adapts depending on whether your vocabulary is hash or slash.
  • Multiple serialization: Widoco creates multiple serializations of your ontology and points to them from the ontology document. This helps any user to download their favorite serialization to work with.
  • Provenance and page markup: The main metadata of the ontology is annotated using RDF-a, so the web searchers like Google can understand and point to the contents of the ontology easily. In addition, an html page is created with the main provenance statements of the ontology, described using the W3C PROV standard.
  • Multilingual publishing: Ontologies may be described in multiple languages, and I have enabled Widoco to generate the documentation in a multilingual way, linking to other languages on each page. That way you avoid having to run the program several times for generating the documentation in different languages.
  • Multiple styles for your documentation: now I have enabled two different styles for publishing the vocabularies, although I am planning to adapt the new respec style from the W3C.
  • Dynamic sections: For each section added in the document, the user will not have to worry about their numbering, as it will be done automatically. In addition, the table of contents will change accordingly to the sections the user wants to include in the final document.

Due to the amount of requests, I also created a console version of Widoco, with plenty of options to be able to run all the possible combinations of the features listed above. Even though you don’t need internet connection, you may want it for accessing Licensius and OOPS! webservices. Both the console version and desktop application are available through the same JAR, accessible in the Github:

I built this tool to make my life easier, but it turns out that it can be used to make the life of other people easier too. Do you want to use Widoco? Check out the latest release on Github. If you have any problems open an issue! Some new features (like an automated changelog) will be included in the next releases.

Posted in Uncategorized | Leave a Comment »

Dagstuhl seminar report: Reproducibility of Data-Oriented Experiments in e-Science

Posted by dgarijov on February 21, 2016


Dagstuhl Castle, the venue for the seminar

The last week of January I was invited to a Dagstuhl seminar about reproducibility in e-Science, and I think it would be helpful to summarize and highlight some of the results in this blog post. A more detailed report will be published in the next few months; so take this as a sneak peek. If you want to reference any of the figures or tables in the summary, please cite the Dagstuhl report.

So… what are Dagstuhl seminars?

They consist on one week meetings that group together researchers of a community to discuss about a certain topic. The seminars are held in the Dagstuhl school of informatics, near Wadern, a location far from any big city. Basically, the purpose of these seminars is to isolate the participants from the world in order to push forward discussions about the topic at hand.

 What was I doing there?

Discuss, learn take notes and disseminate the work my colleagues and me have been doing! In the Ontology Engineering Group we have carried out several initiatives to promote reproducibility of scientific experiments, ranging from the formalization of protocols to allow detecting missing or inconsistent details to the automatic documentation and publication of workflows, their infrastructure conservation, how to bundle them together with their associated resources into research objects or how to handle their intellectual property rights. You can see the slides I presented during the seminar in this link:

 The seminar

The seminar was organized by Andreas Rauber, Norbert Fuhr and Juliana Freire, and I think they did a great job bringing people from different areas: Information retrieval, psychology, bioinformatics, etc. It would have been great to see more people from libraries (who have been in charge of preserving knowledge for centuries) and editorials and funding agencies, as in my opinion they are the ones who can really push forward reproducibility by making authors comply with reproducibility guidelines/manifestos. Maybe we can use the outcomes of this seminar to convince them to join us the next time.

Avoiding reproducing previous reproducibility outcomes

To be honest, I was a bit afraid that this effort would result in just another manifesto or set of guidelines for enabling reproducibility. Some of the attendants in the seminar shared the same feeling, and therefore one of the first items of the agenda resulted in summaries of other reproducibility workshops that other participants had attended to, like the Euro RV3 workshop or the Artifact Evaluation for Publication workshop (also held at Dagstuhl!). This helped shape a little bit the agenda and move forward.

Tools, state of the art and war stories:

Discussion is the main purpose of the Dagstuhl seminar, but the organizers scheduled a couple of sessions for each participant to introduce what they had been doing to promote reproducibility. This included specific tools for enabling reproducibility (e.g., noWorkflow, ReproZip, yesWorkflow, ROHub, etc.), updates on the state of the art of a particular area (e.g., the work done by the Research Data Alliance, music, earth sciences, bioinformatics, visualization, etc.) and war stories of participants that had attempted reproducing other people’s work. In general, the presentations I enjoyed the most were the war stories. At the beginning of my PhD I had to reproduce an experiment from a paper, and it may involve some frustration and a lot of work. I was amazed by the work done by Martin Potthast (see paper) and Christian Coldberg (see paper) to actually empirically reproduce the work by others. In particular, Christian maintains a list of the papers he and his group have been able to reproduce. Check it out here.

Measuring the information gain

What do we gain by making an experiment reproducible? In an attempt to address this question, we identified the main elements in which a scientific experiment can be decomposed. Then, we analyzed what would happen if each of these components changed, and how each of these changes relates to reproducibility.

The atomic elements of an experiment are the goals of the experiment, the abstract methods (algorithms, steps) used to achieve the goals, the particular method used to implement the abstract algorithm or sketch, the execution environment or infrastructure used to execute the experiment, the input data and parameter values and the scientists involved in the experiment execution. An example is given below:

  • (R) Research Objectives / Goals: Reorder stars by their size.
  • (M) Methods / Algorithms: Quicksort.
  • (I) Implementation / Code / Source-Code: Quicksort in Java .
  • (P) Platform / Execution Environment / Context : OS, JVM, RAM Memory.
  • (D) Data (input data and parameter values): The dataset X from the Virtual observatory catalog
  • (A) Actors / Persons: Daniel, who design executes the experiment.

The preservation of each these elements of the experiment may change the obtained results. For example, if we change the input data but keep the rest of the parts the same, we ensure the robustness of the experiment (new data may identify new corner cases that were not considered before). If we change the platform successfully but preserve the rest, then we improve the portability of the experiment. In the following table you can see a summary of the overall discussion. Due to time constraints we didn’t alter all the possible columns to represent all possible scenarios, but we represented the ones that are more likely to happen:

Involved Part Change? (0= no change, 1 = change, 0/1 = doesn’t matter)
Research goal 0 0 0 0 0 0 1
Method 0 0 0 0 0 1 0/1
Implementation 0 0 0 0 1 0/1 0/1
Platform 0 0 0 1 0/1 0/1 0/1
Data parameters 0 1 0/1 0 0 0/1 0/1
Input data 0 0 1 0 0 0 0
Actors 0 0/1 0/1 0/1 0/1 0/1 0/1
Information Gain Consistency Robustness\
Generality Portability\ adoption
Portability\ adoption Independent


 Decomposing reproducibility

There are three main types of actions that you can take in order to improve the reproducibility of your work. These are proactive actions (e.g., data sharing, workflow sharing, metadata documentation, etc.), reactive actions (e.g., a systematic peer review of the components of your experiment, reimplementation studies, etc.) and supportive actions (e.g., corpus construction for reproducibility, libraries of tools that help reproducibility, etc.). These actions affect three different categories: those which involve paper reproducibility (i.e., individual papers), those which involve improving the reproducibility of groups of papers affecting a particular area of interest (like health studies that recommend a solution for a particular problem) and those which involve the creation of benchmarks that ensure that a proposed method can be executed with other state of the art data.

The following figure (extracted from the report draft) summarizes the taxonomy discussion:


A taxonomy for reproducibility

Actors in reproducibility and guidelines for achieving reproducibility.

Another of the activities I think it’s worth mentioning on this summary is the analysis part of the group did about the different types of authors that participate in one way or the other in reproducibility, along with the obstacles these actors may find in their path.

There are 6 types of actors in reproducibility: those that create contents (authors, lab directors, research software engineers, etc), those that consume the contents (readers, users, authors, students, etc.), those that moderate the contents (editors), those who examine the contents (reviewers, examiners, etc.), those who enable the creation of the contents (funders, lab directors, etc.) and those who audit the contents (policy makers, funders).

For each of the actors, the group discussed checklists that guided them on how to fully achieve the reproducibility of their contents in three different levels: sufficient (i.e., minimum expectation of the actor regarding the demands for reproducibility), better (an additional set of demands which improve the previous ones) and exemplary (i.e., best practices). An example of these checklists for authors can be seen below (extracted from the report):


  • Methods section – to a level that allows imitation of the work
  • Appropriate comparison to appropriate benchmark
  • Data accurately described
  • Can re-run the experiment
  • Verify on demand (provide evidence that the work was done as described)
  • Ethical considerations noted, clearances listed
  • Conflicts noted, contributions and responsibilities noted
  • Use of other authors’ reproducibility materials should respect the original work and reflect an attempt to get best-possible results from those materials


  • Black/white box
  • Code is made available, in the form used for the experiments
  • Accessible or providable data


  • Open-source software
  • Engineered for re-use
  • Accessible data
  • Published in trustworthy, enduring repository
  • Data recipes, to allow construction of similar data
  • Data properly annotated and curated
  • Executable version of the paper; one-click installation and execution

Making a reproducibility paper publishable

Another cool effort aimed to determine whether reproducibility is a means or an end for a publication. Hence, the group discussed if an effort to reproduce an actual research paper would be publishable or not depending on the available resources and the obtained outcome. Generally, when someone intends to reproduce existing work is because they want to repurpose it or reuse it in their experiments. But that objective may be affected, for example, if the code that implemented the method aimed to be reproduced is no longer available. The discussion led to the following diagram, which discusses a set of possible scenarios:


Can reproducibility help you to publish a paper?

In the figure, the red crosses indicate that the effort would not have much value as a new publication. The pluses indicate the opposite, and the number of pluses would affect the target of the publication (one plus would be a workshop, while four pluses would be a top journal/conference publication). I find the diagram particularly interesting, as it introduces another benefit for trying to make reproduce someone else’s experiments.

 Incentives and barriers, or investments and returns?

The incentives are often the main reason why people adopt best practices and guidelines. The problem is that, in the case of reproducibility, each incentive has also an associated cost (e.g., making all the resources available in an open license). If the cost is excessive for its return, then some people might just not consider it worth it.

One of the discussion groups aimed to address this question by categorizing the costs/investments (e.g. artifact preparation, documentation, infrastructure, training, etc.) and returns/benefits (publicity, knowledge transfer, personal satisfaction, etc.) for the different actors identified above (funders, authors, reviewers, etc.). The tables are perhaps too big to include them here (you can have a look once we publish the final report), but in my opinion the important message to take home is that we have to be aware of the cost of reproducibility and its advantages. I have personally experienced how frustrating is to document in detail the inputs, methods and outputs used on a Research Object that expands on a paper that has already been accepted. But then, I have also seen the benefits of my efforts when I wanted to rerun the evaluations several months later, after I had done additional improvements.

 Defining a Research Agenda: Current challenges in reproducibility

Do you want to start a research topic about reproducibility? Here are a few challenges that may help you to get ideas to contribute to the state of the art!:

  1. What are the interventions needed to change of behavior of the researchers?
  2. Do reproducibility and replicability translate in long term impact for your work?
  3. How do we set the research environment for enabling reproducibility?
  4. Can we measure the cost of reproducibility/repeatability/documentation? What are the difficulties for newcomers?

Final thoughts:

In conclusion, I think the seminar was a positive experience. I learnt, met new people and discussed about a topic that is very close to my research area with experts on the field. I think there could be a couple of things that could be improved, like having a better synchronization with other reproducibility efforts taking place in Dagstuhl or having more representation from the publisher and funding agencies side, but I think the organizers will take it into account for future meetings.

Special thanks to Andy, Norbert and Juliana for making the seminar happen. I hope everyone enjoyed as much as I did. If you want to know more about the seminar and some of its outcomes, have a look at the report!


Participants of the Dagstuhl seminar

Posted in Uncategorized | 1 Comment »

Permanent identifiers and vocabulary publication: and w3id

Posted by dgarijov on January 17, 2016

Some time ago, I wrote a tutorial with the common practices for publishing vocabularies/ontologies on the Web. In particular, the second step of the tutorial addressed the guidelines for describing how to set a stable URI for your vocabulary. The tutorial referred to, a popular service for creating permanent urls on the web. had been working for more than 15 years and was widely used by the community.

However, several months ago stopped registering new users. Then, only a couple of months ago the website stopped allowing registering or editing the permanent urls from a user. The official response is that there is a problem with the SOLR index, but I am afraid that the service is not reliable anymore. The current purl redirects work properly, but I have no clue on whether they intend to keep maintaining it in the future. It’s a bit sad, because it was a great infrastructure and service to the community.

Fortunately, other permanent identifier efforts have been hatched successfully by the community. In this post I am going to talk a little about, an effort launched by the W3C permanent identifier community group that has been adopted by a great part of the community (with more than 10K registered ids). W3id is supported by several companies, and although there is no official commitment from the W3C for maintenance, I think it is currently one of the best options for publishing resources with a permanent id on the web.

Differences with w3id is a bit geekier, but way more flexible and powerful when doing content negotiation. In fact, you don’t need to talk to your admin to do the content negotiation because you can do it yourself! Apart from that, the main difference between and w3id is that you don’t have a user interface to edit you purls. You do so through Github by editing there the .htaccess files.

How to use it: let’s imagine that I want to create a vocabulary for my domain. In my example, I will use the coil ontology, an extension of the videogame ontology for modeling a particular game. I have already created the ontology, and assigned it the URI: I have produced the documentation and saved the ontology file in both rdf/xml and TTL formats. In this particular case, I have chosen to store everything in one of my repositories in Github: So, how to set up the w3id for it?

  1. Go to the w3id repository and fork it. If you don’t have a Github account, you must create one before forking the repository.
  2. Create the folder structure you assigned in the URI of your ontology (I assume that you won’t be rewriting somebody else’s URI, as if that is the case, the admins will likely detect it). In my example, I created the folders “games/spec/” (see in repo)
  3. Create the .htaccess. In my case it can be seen in the following url: Note that I have included negotiation for three vocabularies in there.
  4. Push your changes to your local repository.
  5. Create a pull request to the perma-id repository.
  6. Wait until the admins accept your changes.
  7. You are done! If you want to add more w3id ids, just push them to your local copy and create additional pull requests.

Now every time somebody accesses the URL, it will redirect to where the htaccess file points to. In my case, for the documentation, for TTL and for rdf/xml. This works also if you want to do simple 302 redirections as well. W3id administrators are usually very fast to review and accept the changes (so far I haven’t had to wait more than a couple of hours before having everything reviewed). The whole process is perhaps slower than what used to be, but I really like the approach. And you can do negotiations that you were unable to achieve with

Http vs https: As a final comment, w3id uses https. If you publish something with http, it will be redirected to https. This may look as an unimportant detail, but is critical in some cases. For example, I have found that some applications cannot negotiate properly if they have to handle a redirect from http to https. An example is Protégé: if you try to load, the program will raise an error. Using https in you URI works fine with the latest version of the program (Protégé 5).

Posted in Tutorial | Tagged: , , , , | 9 Comments »

EC3 Fieldtrip: 1-7 August, 2015: It’s rock time!

Posted by dgarijov on August 11, 2015

A few months ago my supervisor told me about the opportunity to join a group of geologists in a field trip to Yosemite. The initiative was driven by the Earthcube community, in an effort to join together experts from different geological domains (tectonics, geochemistry, etc.) and computer scientists. I immediately applied for a place in the trip, and I have just returned back to Spain. It has been an amazing experience, so I want to summarize in this post my views and experiences during the whole week.

Travelling and people

For someone travelling from Europe, the trip was exhausting (2 scales and up to 24 hours of flights + waiting), but I really think it was worth it. I have learnt a lot from the group and the challenges geologists are facing when collecting and sharing their data, samples and methods. All participants were open and had the patience to explain any doubts or concerns on the geological terms being used in the exercises and talks. Also, all the attendants were highly motivated and enthusiastic to learn new technologies and methods that could help them to solve some of their current issues. I think this was crucial for creating the positive environment of discussion and collaboration we got during the whole experience. I hope this trip helps pushing forward best practices and recommendations for the community.

Yosemite National Park

There is little I can say about the park and its surroundings that hasn’t been already told. Therefore, I’ll let the pictures speak for themselves:

Yosemite National Park. Nope, that's not snow, just rocks!

Yosemite National Park. Nope, that’s not snow, just rocks!

Apparently, some of the most interesting rocks could be found in the middle of the dessert

Apparently, some of the most interesting rocks could be found in the middle of the dessert

A thousand year old forest

A thousand year old forest

What was the rationale behind the trip?

As I said before, the purpose of the fieldtrip was to bring together computer scientists and geologists. The main reason why this could be interesting for geologists is twofold: first, the geologists could show and tell computer scientists how they work and their current struggle with either hardware or software on the field. The second reason is that geologists could connect to other geologists (or computer scientists) in order to foster future collaborations.

From a computer science point of view, I believe this kind of trip is beneficial to raise awareness of current technologies to end users (in many cases we have the technology but we don’t have the users to use it). Also, it always helps seeing by one’s eyes what are the real issues faced by scientists on a particular domain. It makes them easier to understand.

What was I doing there?

Nobody would believe me when I told them that I was going to travel to Yosemite with geologists to do some “field” work. And, to be honest, one of my main concerns preparing the trip was that I had no idea on how I would make myself useful for the rest of the attendants. I felt like I would learn a lot from all the other people, since some of their problems are likely to be similar to other problems in other areas, and I wanted to give something in return. Therefore I talked to everyone and asked a lot of questions. I also gave a 10 minute introductory talk on the Semantic Web (available here), to help them understand the main concepts they had already heard in other talks or project proposals. Finally, I came up with a list of challenges they have from the computational perspective and proposed extending existing standards to address some of them.

Challenges for geologists

I think it is worth describing here some of the main challenges that these scientists are facing when collecting, accessing, sharing and reusing data:

  1. Sample archival and description: there is no standard way of processing and archiving the metadata related to samples. Sometimes it is very difficult to find the metadata associated to a sample, and a sample with no metadata is worthless. Similarly, it is not trivial to find the samples that were used in a paper. NSF is now demanding a Data Management Plan, but what about the Sample Management Plan? Currently, every scientist is responsible for his/her samples, and some of those might be very expensive to collect (e.g., a sample from an expedition to Mount Everest). If someone retires or changes institutions, the samples are usually lost. Someone told me that the samples used in his work could be found in his parent’s garden, as he didn’t have space for them anymore (at least those could be found 🙂 ).
  2. Repository heterogeinity and redundancy. Some repositories have started collecting sample data (e.g., SESAR), which shows an effort from the community to address the previous issue. Every sample is given a unique identifier, but it is very difficult to determine if a sample already exists on the database (or other repositories). Similarly, there are currently no applications that allow exploiting the data of the repository. Domain experts perform SQL queries, which will be different for each repository as well. This makes integrating data from different repositories difficult at the moment.
  3. Licensing: People are not sure about the license which that have to attach to their data. This is key for being attributed correctly when someone reuses your results. I have seen this issue in other areas as well. In this link I think they explain everything with high detail:
  4. Sharing and reusing data: Currently if someone wants to reuse some other researcher’s mapping data (i.e., those geological observations they have written down in a map), they would have to contact the authors and ask them for a copy of their original field book. With luck, there will be a scanned copy or a digitized map, which then will have to be compared (manually) to the observations performed by the researcher. There are no approaches for performing such comparison automatically.
  5. Trust: Data from other researchers is often trusted, as there are no means to check whether the observations performed by a scientist are true or not unless one goes into the field.
  6. Sharing methods: I was surprised to hear that the mean reason why the methods and workflows followed on an experiment are not shared is because there is no culture for doing it. Apparently the workflows are there because some people use them as a set of instructions for students, but they are not documented in the scientific publications. This is an issue for the reproducibility of the results. Note that here we define workflow as the set of computational steps that are necessary to produce a research output on a paper. Geologists have also manual workflows for collecting observations on the field. These are described on their notebooks.
  7. Reliability: This was brought up by many scientists on the field. Many still think that the applications on their phones are often not reliable. In fact we did some experiments with an Iphone and Ipad and you could see differences in their measures due to their sensors. Furthermore, I was told that if a rock is magnetic, they become useless. Most of the scientists still rely on their compasses to perform their measurements.

Why should geologists share their data?

The vans haven’t been just a vehicle to take us to some beautiful places in this trip; they have been a useful means to get people to discuss some of the challenges and issues described above. In particular, I would like to recall the conversation we had one of the last days between Snir, Zach, Basil, Andreas, Cliff and others. After discussing some of the benefits that sharing has to other researchers, Andreas asked about the direct benefit he would obtain for sharing his data. This is crucial in my view, as if sharing data is only going to have benefits for other people and not me, why should I do it? (unless I get funding for it). Below you can find the arguments in favor of doing this practice as a community, tied with some of the potential benefits. (Quoting Cliff Joslyn in points 1 and 2)

  1. Meta-analysis: or being able to reuse other researcher’s data to analyze and compare new features. This is also beneficial for one’s own research, in case you change your laptop/institution and no longer have access to your previous data.
  2. Using consumer communities to help curating data: apparently, some geophysicists would love to reuse the data produced by geologists. They could be considered as clients and taken into account for applying into a grant in a collaboration.
  3. Credit and attribution: Recently some journals like PLOS or Elsevier have started creating data journals. In there you would just upload your dataset as a publication, so people using it can cite it. Additionally, there are data repositories like FigShare, where just by uploading a file you make it citable. This way someone could cite an intermediate result you obtained during part of your experiments!
  4. Reproducibility: sharing data and methods is a clear sign of transparency. By accessing the data and methods used in a paper, a reviewer would be able to check the intermediate and final results of a paper in order to see if the conclusions hold.

Are these benefits enough to convince geologists to share and annotate their data? In my opinion, the amount of time that one has to spend documenting work is still a barrier for many scientists. The benefits cannot be seen instantly, and in most of the cases people don’t bother after writing the paper. It is an effort that a whole community has to undertake, and make it part of its culture. Obviously, automatic metadata recording will always help.


This trip has demonstrated to be very useful to join together people from different communities. Now, how do we move forward? (again, I do some quoting from Cliff Joslyn, who summarized some of the points discussed during the week):

  1. Identify motivated people who are willing to contribute with their data.
  2. Creation of a community database.
    1. Agree on standards to use as a community, using common vocabularies to relate the main concepts on each domain.
    2. Analyze whether there are already existing valuable efforts already developed instead of starting from scratch.
    3. Contact computer scientists, ontologists and user interface experts to create a model that is both understandable and easy to consume from.
  3. Exploit the community database. Simple visualization in maps is often useful to compare and get an idea of mapped areas.
  4. Collaborate with computer scientists instead of considering them as merely servants. Computer scientist are interested in challenging real world problems, but they have to be in the loop.

Finally, I would like to thank Matty Mookerjee, Basil Tikoff and all the rest of the people who made this trip possible. I hope it happens again next year. And special thanks to Lisa, our cook. All the food was amazing!

Below I attach a summary of the main activities of the trip by days, in case someone is interested on attending future excursions. Apologies in advance on the incorrect usage of geological terms.

Summary of the trip

Day1: after a short introduction on how to configure your notebook (your convention, narrative, location, legend, etc.) we learnt how to identify the rock we had in front of us by using the hand lens. Rocks can be igneous, metamorphic and sedimentary, and in this case, as can be seen in the pictures below, we were in front of the igneous type. In particular, granite.

identifying a particular type of rock

identifying a particular type of rock

Once you know the type of rock you are dealing with and its location, it’s time to sketch, leaving the details and representing just those that are relevant for your observation. Note that different type of geologists might consider relevant different features. Another interesting detail is that observations are always associated with areas, not points, because of a possible error. This might sound trivial but adds a huge difference (and more complexity) when representing the information as a computer scientist.

The day ended with three short talks: one about the Strabo app for easily handling and mapping your data with your phone, and the Fieldmove app (Andrew Bladon) for easily measuring strike and dip, adding annotations and representing them in a map. Shawn Ross wrapped up the session by talking briefly about his collaborations with archaeologists for field data collection.

Day2: We learnt about cross sections in Sierra Nevada, after a short explanation on the evolution of the area from a geological perspective. Apparently geologist think in time when analyzing a landscape, in order to determine which were the main changes that were necessary to produce the current result. In this regard, it is like learning about the provenance of the earth, which I think it is pretty cool.

Matty's favourite section had to be explained with a poster because the road was no more

Matty’s favourite section had to be explained with a poster because the road was no more

Unfortunately, Matty’s favorite section was not accessible and had to be explained via a poster. Some flooding had destroyed the road and was too far away to be reached by foot. Therefore we were driven to another place in the Sierra where we were asked to draw a cross section ourselves (with the help of a geologist). It was an area with very clear faults, and most of us drew their direction right. The excursion ended when one of the geologist gave a detailed speech on the rationale behind her sketch, so we could compare.

When we arrived at the research center, Jim Bowing gave a short talk on state, and how geologists should be aware of their observations and the value of the attributes described on them. We as computer scientists can only recreate what we are given. We then divided in groups and thought about use cases, reporting two to the rest of the groups.

Day3: It was time to learn about the gear: GPS, tablet and laptop (which can be heavy). All equipped with long lasting batteries (could last more than 2 days of fieldwork). We went to the Deep Springs Valley, and after locating ourselves on a topological map we followed a contact (i.e., line between two geological units). We could experience some frustration with the devices (the screen was really hard to see) and we poured some acid on the rocks in order to determine whether they were carbonated or not.

Moureen's notebook: high quality sketching!

Moureen’s notebook: high quality sketching!

Learning how to measure strike and dip on rocks.

Learning how to measure strike and dip on rocks.

The contact finished abruptly in a fault after a few hundred meters (represented as a “v” in a map). We determined its orientation and fault access, which was possible thanks to some of the mobile applications we were using on the field. If done by hand, we would have had to analyze our measurements at home.

After a brief stop on an observatory full of metamorphic rocks, we headed back to the research center. There, Cliff Joslyn and I gave a brief introduction to databases, relational models and the Semantic Web before doing another group activity. In this case, we tried to think about the perfect app for geologists, and what kind of metadata would it need to capture.

Day4: We went to the Caldera, close to a huge crack in the ground, where we learnt a bit more about of its formation. There was a volcanic eruption in two phases, which can be distinguished by the materials that are around the pomez stones.

Geologists analyzing the mountain.

Geologists analyzing the mountain.

We then went to the lakes, where we learnt from Matty on how to extract a sample. First you ought to identify properly the rock, annotate it with the appropriate measurements (orientation, strike, dip), label the rock and then extract it. If you use a sample repository like SESAR, you may also ask in advance for identifiers and print stickers for labeling the rock.

Learning how to extract a sample.

Learning how to extract a sample.

We ended the hike with a short presentation by Amanda Vizedom on ontologies and discussing about the future steps for the community.

Posted in Workshop | Tagged: , , , | Leave a Comment »

Open Research Data Day 2015: Get the credit!

Posted by dgarijov on June 10, 2015

Last week I attended a two-day event on Open Research Data: Implications and Society. The event was located at Warsaw’s University Library, close to the old district, and it took place while all the students were actually studying on the library.

Palace of culture

Warsaw’s Palace of culture

Warsaw's old district

Warsaw’s old district

The event was sponsored by the Research Data Alliance and OpenAire among others, with presenters from institutions like CERN, companies that aim at facilitating publishing scientific data like Figshare (or benefit from them like Altmetric) and people from the editorial world like Elsevier and Thomson Reuters. Lidia Stępińska-Ustasiak was the main organizer of the event, and she did a fantastic job. My special thanks to her and her team.

In general, the audience was very friendly and willing to learn more about the problems exposed by the presenters. The program was packed with keynotes and presentations, which made it quite a non-stop conference.

What I presented

I attended the event to talk about Research Objects and our approach for their proper preservation by using checklists. Check the slides here. In general, our proposal was well received, even though much work is still necessary to make it happen as a whole. Applications like RODL or MyExperiment are the first step forward towards achieving reproducible publications.

What I liked

The environment, the talks (kept on 10 minutes for the short talks and on 25 for keynotes), people staying to hear others and not running away after their presentations, and all the discussions that happened during and after the events.

What I missed

Even though I enjoyed the event very much, I missed some innovative incentives for scientist to actually share their methods and their data. Credit and attribution were the main reasons given by everyone to share their data. However, these are long term benefits. For instance, after sharing the data and methods I have used in several papers as Research Objects, I have noticed that it really takes a lot of time to document everything properly. It pays off on the long term when you (or others) want to reuse your own data, but not immediately. Thus, I can imagine that other scientists may use this as an excuse to avoid publishing their data and workflow when they publish the associated paper. The paper is the documentation, right?

My question is: can we provide a benefit for sharing data/workflows that is immediate? For example: if you publish the workflow, the “Methods” page of your paper will be written automatically, or you will have an interactive drawing that looks supercool on your paper, etc. I haven’t found an answer to this question yet, but I hope to see some advance in this direction in the future.

But enough with my own thoughts, let’s stick to the content. I summarize both days below.

Day 1

After the welcome message, Marek Niezgódk introduced the efforts made in Poland towards research open data. The polish Digital Library now offers access to all scientific publications for everyone, in order to foster polish scholar bibliography in the scientific world. Since polish is not an easy language, they are investing in the development of tools and projects like Wordnet and Europeana.

Mark Parsons (Research Data Alliance) followed by describing the problem of replication of scientific results. Before working in RDA, he used to work on the NSDIC, which observes and measures climate change. Apparently, some results were really hard to replicate because different experts understood concepts differently. For example, the term “ice edge” is defined differently in several communities. Open data is not enough: we need to build bridges among different communities of experts, and this is precisely the mission of RDA. With more than 30 working and interest groups integrating people from industry and academia, RDA aims to improve the “data fabric” by building foundational terminologies, enabling discovery among different registries and standardizing methodologies between different communities:

The data fabric

The data fabric

Jean-Claude Burgelman (European Commission) provided a great overview of the open research lifecycle:

Data Publication Lifecycle

Data Publication Lifecycle

The presenter described the current concerns with open access in the European Commission, and how they are proposing a bottom-up approach by enabling a pilot for open research data which has provided encouraging preliminary results.
Although open data is currently being opened in some areas (see picture below), it is good to see that the European Commission is also focusing on infrastructures, hosting, intellectual property rights and governance. For example, in the open pilot even patents are possible with the open data policy.

Open data by community

Open data by community

The talk ended up with an interesting thought: High impact journals are less than 1% of the scientific production.

The next presenter was Kevin Ashley, from the British Digital Curation Center. Kevin started his talk with the benefits of data sharing, both from a selfish view (credit) and the community view (for example, data from archaeology has been used by paleontology experts). Good research needs good data, and what some people consider noise could be a valuable input for other researchers in different areas.
I liked how Kevin provided some numbers regarding the maintenance of an infrastructure for open access of research papers. Assuming that only 1 out of 100 papers are reused, in 5 years we could save up to 3 million per year from buying papers online. Also, linking publication and data increases its value. Open data and closed software, on the other hand, is a barrier.
The talk ended with the typical reasons people give to not to share their data, as well at the main problems that actually stop data reuse:

Excuses and responses for not making your data available

Excuses and responses for not making your data available

What stops data reuse?

What stops data reuse?

The evening was followed by a set of quick presentations.

  • Giulia Ajmone (OECD) introduced open science policy trends by using the “stick and carrot” metaphor: carrots are financial incentives, proper acknowledgement and attribution, while the sticks are the mandatory rules necessary to make them happen. Individual policies are at the national levels on many countries.
  • Magdalena Szuflita (Gdańsk University of Technology) tried to identify additional benefits for data sharing by doing a survey on economics and chemistry (areas where the researchers didn’t share their data).

    Incentives for data sharing

    Incentives for data sharing

  • Ralf Toepfer (Leibniz centre of economics) provided more details on open research data in economics, where up 80 % of the researchers do not share their data (although the majority of the people think other people should share their data). I personally find this very shocking in an environment where trust and credibility is key, as some of these studies might be the cause of big political changes.
  • Marta Teperek (University of Cambridge) talked about the training activities and workshops for sharing data at the University of Cambridge.
  • Helena Cousijn (Elsevier) described ways for researchers to store, share and discover data. I liked the slide comparing the research initiatives versus the research needs (see below). I also learnt that Elsevier has a data repository where they assign DOIs and up to 2 data journals.

    Initiatives vs research data needs

    Initiatives vs research data needs

  • Marcin Kapczyński introduced the data citation index they are developing at Thomspon Reuters, which covers 240 high value multidisciplinary repositories. A cool feature is that it can distinguish between datasets and papers.
  • Monica Rogoza (national library of Poland) presented an approach to connect their digital library to other repositories, providing a set of tools to visualize and detect pictures in texts.

The day ended with some tools and methodologies for opening data in different domains. Daniel Hook, from FigShare, gave the invited talk by appealing to our altruism instead of our selfishness for sharing data. He surveyed the different ages of research: individual research led to the age of enlightenment, institutional research to an age of evaluation, national research to an age of collaboration and international research to an age of impact. Unfortunately, sometimes impact might be a step back from collaboration. Most of the data is still hidden in Dropbox or pendrives, and when institutions share it we find three common cases: 1) they are forced to do it, in which case the budget for accomplishing it is low; 2) they are really excited to do it, but it is not a requirement; and 3) May not understand the infrastructure, but they aim to provide tools to allow authors to collaborate internationally.

And finally, a manifesto:

Manifesto for sharing data

Manifesto for sharing data

The short talks can be summarized as follows:

  • Marcin Wichorowsky (University of Warsaw) talked about the GAME project database to integrate oceanographic data repositories and link them to social data.
  • Alexander Nowinsky (University of Warsaw) described COCOs, a cosmological simulation database which aims at storing large scale simulations of the universe (with just 2 datasets they are over 100TB!)
  • Marta Hoffman (University of Warsaw) introduced RepOD, the first repository for open data in Poland complementary to other platforms like the Open Science Platform. It adapts C-KAN and focuses explicitly on research data.
  • Henry Lütke (ETH Zurich) described their publication pipeline for scientific data, by using OpenBis for data management, electronic notebooks and OAI-PMH to track the metadata. Integrated with C-KAN as well.

Day 2

The second day was packed with presentations as well. Martin Hamilton (Jisc) gave the first keynote by analyzing the role of the pioneer. Assuming that in 2030 there will be tourists in Mars, what are the main causes that could enable it? Who were the pioneers that pushed this effort forward? For example, Tesla Motors will not initiate any lawsuit against someone who, in good faith, wants to use their technology for the greater good. These are the kind of examples we need to see for research data as well. New patrons may arise (e.g., Google, Amazon, etc. give awards as research grants) and there will be a spirit of co-opetition (i.e., groups with opposite interests working together on the same problem), but working together we could address the issue of open access in research data and move towards other challenges like full reproducibility of the scientific experiments.

Tim Smith (CERN, Zenodo) followed by describing how we often find ourselves on the shoulders of secluded giants. We build up on the work done by other researchers, but the shareablity of data might be a burden in the process: “If you stand on the shoulders of hidden giants, you may not see too far”. Tim argued that researches participating in the human collective enterprise that pushes research forward often look for their own best interest, and that by fostering feedback one’s own interest may become a collective interest. Of course, this also involves a scientist-centric approach providing access to the tools, data, materials and infrastructure that delivered the results. Given that software is crucial for producing research, Zenodo was presented as an application for collaborative development to publish code as part of the active research process (integrated with Github). The keynote ended by explaining how data is shared in an institution like CERN, where there are PetaBytes of data stored. Since all the data can’t be opened due to its size, only a set of selected data for education and research purposes is made public (currently around 40 TB). The funny thing is how opening data has actually benefitted them: they did an open challenge asking people to improve their machine learning algorithm on the input data. Machine learning experts, who had no idea about the purpose of the data, won.

Zenodo-Github bridge

Zenodo-Github bridge

A set of short presentations were next:

  • Pawel Krajewski presented the transplant project, a software infrastructure for plant scientists based on checklist for publishing the data. It follows the ISA-TAB format.
  • Cinzia Daraio (Sapienza) described how to link heterogeneous data sources in an interoperable setting with their ontology-based (14 modules!) data management system. The ontology is used to represent indicators on different disciplines and be able to do comparisons (e.g., opportunistic behavior).
  • Kimil Wais (University of Information Technology and Management in Rzeszów) showed how to monitor open data automatically by using an application, Odgar, based on R for visualizing and computing statistics.
  • Me: I presented our approach for preserving Research Objects by using checklists described above.

After the break, Mark Thorley (NERC-UK) gave the last invited talk. He presented, an international group like RDA that instead of following a bottom-up approach, follows a top-down one. As described before, a huge problem relies on the knowledge translators, who are people that know how to talk to experts in different domains for their uses of data. In this regard, the role of the knowledge broker/intermediary is gaining relevance: people that know the data and know how to use it for other people’s needs. Rather than exposing the data, in Codata they are working towards exposing and exploiting (IP rights) the knowledge behind.

A series of short talks followed the invited talk:

  • Ben McLeish (Altmetric) described how in their company they look for any research output using text mining: Reddit, Youtube, repositories, blogs, etc. They have come up with a new relevance metric based on donut-shaped graphics which can even show how your institution is doing and how engaging your work is.
  • Krzysztof Siewicz (University of Warsaw) explained from the legal point of view how different data policies could interfere when opening data.
  • Magdalena Rutkowska-Sowa (University of Białystok) finished up by describing the models for commercialization of R&D findings. With Horizon 2020, new policy models and requirements will have to be introduced.

The second day finished with a panel discussion with Tim Smith, Giulia Ajmone, Martin Hamilton, Mark Parsons and Mark Thorley as participants, discussing further some of the issues presented during both days. Although I didn’t take many notes, some of the discussion were about how enterprises could figure out open data models, data privacy, how to build services on top of open data or the value of making data available.

The pannel. From left to right Giulia Ajmone, Mark Thorley,  Martin Hamilton, Tim Smith and Mark Parsons

The pannel. From left to right Giulia Ajmone, Mark Thorley, Martin Hamilton, Tim Smith and Mark Parsons

Posted in Conference | Tagged: , , , , , , | Leave a Comment »

WWW2015: Linked Data or DBpedia?

Posted by dgarijov on June 4, 2015

A couple of weeks ago I attended the International World Wide Web (WWW) conference in Florence. This was my first time in WWW, and I was impressed by the amount of attendants (apparently, more than 1400). Everyone was willing to talk and discuss about their work, so I met new people, talked to some I already knew and left with a very positive experience. I hope to be back in the future.

In this post I summarize my views on the conference. Given its size, I could not attend all the different talks, workshops and tutorials, but if you could not come you might be able to get an idea on the types of the contents that were presented. The proceedings can be accessed online here.

The venue

The conference was held in Fortezza da Basso, one of Florence’s largest historical buildings. Although it was packed with talks, tutorials and presentations, more than one attendant managed to skip one or two sessions to do some sightseeing, and I can’t blame them. I didn’t skip any sessions, but I managed to visit the Ponte Vecchio and have a walk around the city after the second day was over :).

Fortezza da Basso (left) and Vechio bridge (right)

Fortezza da Basso (left) and Vechio bridge (right)

My contribution: Linked Data Platform and Research Objects

My role in the conference was to present a poster in the Save-SD workshop. We use the Linked Data Platform standards to access Research Objects according to the Linked Data principles, which make them easy to create, manage, retrieve and edit. You can check our slides here, and we have a live demo prototype here. The poster can be seen in the picture below. We got some nice potential users and feedback from the attendants!

Our poster: Linked Data Platform and Research Objects

Our poster: Linked Data Platform and Research Objects

The conference keynotes

The keynotes were one of the best part of the conference. Jeanette Hoffman opened the first day by describing the dilemmas of digitalization, comparing them to the myth of falling between Scylla and Charybdis. She introduced four main dilemmas, which may not have a best solution:

  • The privacy paradox, as we have a lot of “free” services at our disposal, but the currency in which we pay for them is our own private data
  • Bias on free services: For example, org, is an alliance of enterprises that claim to be offering local services for free in countries where people cannot afford it. But some protesters claim that they offer a manipulated internet where people can’t decide. Is it better to have something biased for free or an unbiased product for which you have to pay?
  • Data protection versus free access to information: illustrated with the right to be forgotten, celebrated in Germany as a success of the individual over Google, but heavily criticized in other countries like Spain where corrupt politicians use it to look better to the potential voters after the sentence has expired. The process of “being forgotten” is not transparent at all.
  • Big brother is always watching you: how do the security / law enforcement / secret services collect everything about us? (All for the sake of our own protection). National services collect the data on the foreigners to protect the locals. What about data protection? Shall we consider ourselves under constant surveillance?

The second keynote was given by Deborah Estrin, and it discussed what we could do with our small data. We are walking sensors constantly generating data with our mobile devices and “small data is to individuals what big data is to institutions”. However, most people don’t like analyzing their data. They download apps that passively record and use their data to show them useful stuff: healthy purchases based on your diet, decline at an old age, monitoring, etc. The issue of privacy is still there, but “is it creepy when you know what is going on, instead of everybody using this data without you knowing. What can’t you benefit from your own data as well?”.

Andrei Broder, from Google, was the last keynote presenter. He did a retrospective of the Web, analyzing whether their predictions for the last decade were true or not, and doing some additional ones for the future. He introduced the 3 drivers of progress: scaling up with quality, a faster response and higher functionality levels:

3driversThe keynote also included some impressive data, from then and now. In 1999 people had still to be explained what a web crawler was. Today 20 million pages are crawled every day, and the index is over 100 PetaBytes. Wow. Regarding future predictions, it looks like Google is evolving from a search box to a request box:


Saving scholarly discourse

I attended the full day SAVE-SD workshop, designed for enhancing scholarly data with semantics, analytics and visualization. The workshop was organized by Francesco Osborne, Silvio Peroni and Jun Zhao, and it received a lot of attention (even though the LDOW workshop was running in parallel). One of the features of the workshop was that you could submit your paper in html using the RASH grammar. The paper is then enriched and can be directly converted to other formats demanded by publishers like the ACM’s pdf template.

Paul Groth kicked off the workshop by introducing in his keynote how to increase the productivity in scholarship by using knowledge graphs. I liked how Paul quantified productivity with numbers: taking as productivity the amount of stuff we can do in one hour, the productivity has raised up to 30% in places like the US since 1999. Scholarly output has grown up to 60%, but that doesn’t translate necessarily into a productivity boost. The main reason why we are not productive is “the burden of knowledge”: we need longer times to study and process the amount of research output being produced in our areas of expertise. Even though tools for collaborating among researchers have been created, in order to boost our productivity we need synthesized knowledge, and Knowledge Graphs can help with that. Hopefully we’ll see more apps based on personalized knowledge graphs in the future 🙂

The rest of the workshop covered a variety of domains:

  • Bibliography: with the Semantic Lancet portal, allows exploring citations as a first class citizen, and Conference Live, a tool for accessing collecting and exploiting conference information and papers as Linked Data.
  • Licensing, with Elsevier’s copyright model.
  • Enhanced publications, where Bahar Sateli won the best paper award with her approach to create knowledge bases from papers using NLP techniques (pdf) and Hugo Mougard described an approach to align conference video talks to their respective papers.
  • Fostering collaborations: Luigi Di Caro described the impact of the collaborators in one’s own research(d-index). I tested it and I am glad to see that I am less and less dependent on my co-authors!
D-index: a tool for testing your trajectory dependence

D-index: a tool for testing your trajectory dependence

Linked Data or DBpedia?

I was a bit disappointed to discover that although many different papers claimed to be using/publishing Linked Data, in reality they were just approaches to work with one dataset: DBpedia. Ideally Linked Data applications should exploit and integrate the links from different distributed sources and datasets, not just a huge centralized dataset like DBpedia. In fact, the only paper that I saw that exploited the concept of Linked Data was the one presented by Ilaria Tiddi on using Linked Data to label academic communities (pdf), in which they aimed to explain data patterns detecting communities of research topics by doing link transersal and applying clustering techniques according to the LSA distance.

Web mining and Social Networks: is WWW becoming the conference of the big companies?

After assisting to the Web mining and Social Network tracks, I wonder whether it is possible to actually have a paper accepted about these topics if Microsoft, IBM, Yahoo or Google is not supporting the work with their data. I think almost all the papers in these tracks had collaborators from one of these companies, and I fear that in the future WWW might become monopolized by them. It is true that having industry involved is good for research. They provide useful real world use cases and data to test them. However, most of the presented work reduced itself at the presentation of a problem solved with a machine learning technique and a lot of training (which has the risk of over fitting the model). The innovation on the solutions wasn’t much, and the data was not accessible, as in most cases it’s private. A way to overcome this issue could be to make the authors of submitted papers to share their data as a requirement, which would be consistent to the open data movements we have been seeing in events like Open Research Data Day or Beyond the PDF; and would allow other researchers to test their own methods as well.

Opinions aside, some interesting papers were presented. Wei Song described how to extract patterns from titles for entity recognition with a high precision to produce templates of web articles (pdf); I saw automatic tagging of pictures using a 6 level neural network plus the derivation of a three level taxonomy from the tags (although the semantics was a bit naive in my opinion) (pdf); Pei Li introduced how to link groups of entities together to identify business chains (Pei Li. Univ of Zurich + Google) (pdf) and Gong Cheng described the creation of summaries for effective human-centered entity linking (pdf).

My personal favorites were the methods to detect content abusers in Yahoo answers to help the moderators’ work (pdf), by analyzing the flagged contents of the users; and the approach for detecting early rumors in Twitter (pdf) by Zhe Zhao. According to Zhe, they were able to detect rumors up to 3 hours before than anyone else.

Graph and subgraph mining

Since I have been exploring how to use graph mining techniques to find patterns in scientific workflows, I thought that attending these sessions might help me to understand better my problem. Unfortunately none of the presenters described approaches for common sub-graph mining, but I learnt about current hot topics regarding social networks: finding the densest sub-graphs (pdf, pdf and pdf), which I think it is important for determining which nodes are the most important to influence/control the network; and discovering knowledge from the graph, useful to derive small communities (pdf) and web discovery (pdf). I deliberately avoid providing details here, as these papers tend to be technical quite quickly.

Semantic Web

Finally, I couldn’t miss the Semantic Web track, since it was the one that could have the most potential overlap with the work my colleagues and I do in Madrid. We had 5 different papers, each one on a different topic:

  • benchmarking: Axel Ngonga presented GERBIL, a general entity annotator benchmark that can compare up to 10 entity annotations systems (pdf).
  • instance matching: Arnab Dutta explained their approach to match instances depending on the schema by using Markov clustering (pdf).
  • provenance: Marcin Wylot described their approach for materializing views for representing the provenance of the information. The paper uses TripleProv as a query execution engine, and claims to be the most efficient way to handle provenance enabled queries (pdf).
  • RDF2RDB: uncommon topic, as it is usually the other way around. Minh-Duc Pham proposed to obtain a relational schema from an RDF dump in order to exploit the efficiency of typical databases (pdf). However he recognized that if the model is not static this could present some issues.
  • triplestores: Philip Stutz introduced TripleRush (pdf) a triplestore that uses sampling and random walks to create a special index structure and be more efficient in clustering and ranking RDF data.

Final remarks

  • I liked a paper discussing the gender roles in movies against the actual census (pdf). Gives you an idea of how manipulative the media can be.
  • The microposts workshop was fun, although mainly focused on named entity recognition (e.g., Pinar Kagaroz’s approach). I think that “random walk” is the sentence I have heard the most in the conference.
  • Check Isabel Colluci’s analysis on contemporary social movements.
  • What are the top ten reasons why people lie on the internet? Check out this poster.

Next WWW will be in Montreal, Canada and James Hendler was happy about it. Do you want to see more highlights? Check Paul Groth’s trip report here, Thomas Steiner’s here, Amy Guy’s here and Marcin Wylot’s here.

Posted in Conference | Tagged: , , , | Leave a Comment »

General guidelines for reviewing a scientific publication

Posted by dgarijov on February 15, 2015

Lately I’ve been asked to do several revisions in different workshops, conferences and journals. In this post I would like to share with you a generic template to follow when reviewing a scientific publication. If you have been doing it for a while you may find it trivial, but I think it might be useful for people that have started recently in the reviewing process. At least, when I started, I had to ask for a similar one to my advisor and colleagues.

But first, several reasons why you should review papers:

  • Helps you to identify whether a scientific work is good or not. And refine your criteria by comparing yourself with other reviewers. Also, it trains you to defend your opinion based on what you read.
  • Helps you refining your own work, by identifying common flaws that you normally don’t detect when writing your own papers.
  • It’s an opportunity to update your state of the art, or learn a little on other areas.
  • Allows you contributing to the scientific community, and getting public visibility.

A scientific work might be the result of months of work. Even if you think it is trivial you should be methodic explaining the reasons why you think it should be accepted or rejected (yes, even if you think the paper should be accepted). A review should not be just an “Accepted” or “Rejected” statement, but also contain valuable feedback for the authors. Below you can see the main guidelines for a good review:

  • Start your review with an executive summary of the paper: this will let the authors know the main message you have understood from their work. Don’t copy and paste the abstract; try to communicate the summary in your own words. Otherwise they’ll just think you didn’t put much attention in reading the paper.
  • Include a paragraph summarizing the following points:
    1. Grammar: Is the paper well written?
    2. Structure: is the paper easy to follow? Do you think the order should have been different?
    3. Relevance: Is the paper relevant for the target conference/journal/workshop?
    4. Novelty: Is the paper dealing with a novel topic?
    5. Your decision. Do you think the work should be accepted for the target publication? (If you don’t, expand your concerns in the following paragraphs)
  • Major Concerns: Here is where you should say why do you disagree with the authors, and highlight your main issues. In general, a good research paper should describe successfully four main points:
    1. What is the problem the authors are tackling? (Research hypothesis) This point is tricky, because sometimes it is really hard to find! And in some cases the authors omit it and you have to infer it. If you don’t see it, mention it in your review.
    2. Why is this a problem? (Motivation). The authors could have invented a problem which had no motivation. A good research paper is often motivated by a real world problem, potentially with a user community behind benefiting from the outcome.
    3. What is the solution? (Approach). The description of the solution adopted by the authors. This is generally easy to spot on any paper.
    4. Why is it a good solution? (Evaluation). The validation of the research hypothesis described in point one. The evaluation is normally the key of the paper, and the reason why many research publications are rejected. As my supervisor has told me many times, one does not evaluate an algorithm or an approach; one has to evaluate whether such proposed algorithm or approach validate the research hypothesis.

When a paper has the previous four points well described, it is accepted (generally). Of course, not all papers enter the category of a research papers (like a survey paper or an analysis paper). But the four previous points should cover a wide range of publications.

  • Minor concerns: You can point out minor issues after the big ones have been dealt with. Not mandatory, but t will help the authors to polish their work.
  • Typos: unless there are too many, you should point the main typos you find in your review. Or the sentences you think are confusing.

Other advice:

  • Don’t be a jerk: many reviews are anonymous, and people tend to be crueler when they know their names won’t be shown to the authors. Instead of saying that something “is garbage”, state clearly why you disagree with the authors proposal and conclusions. Make the facts talk for themselves; not your bias or opinion.
  • Consider the target publication. You can’t use the same criteria for a workshop, conference or journal. Normally people tend to be more permissive at workshops, where the evaluation is not that important if the idea is good, but require a good paper for conferences and journals.
  • Highlight the positive parts of the authors’ work, if any. Normally there is a reason why the authors have spent time on the presented research, even if the idea is not very well implemented.
  • Check the links, prototypes, evaluation files and in general, all the supplementary material provided by the authors. A scientist should not only review the paper, but the research described on it.
  • Be constructive. If you disagree with the authors in one point, always mention how they could improve their work. Otherwise they won’t know how to handle your issue and ignore your review.

If you want to check more guidelines, you can check the ones Elsevier gives to their reviewers, or the ones by PLOS ONE.

Posted in Conference, Tutorial, Workshop | Tagged: , , , , , , | Leave a Comment »