Last week I attended the 9th edition of iEMSs in Fort Collins, Denver. IEMSs is a bi-annual conference that brings together between 300 and 400 researchers from software engineering, intelligent systems, environmental modeling and decision making domains (among others). There were very few people that knew about ontologies and Semantic Web, which makes it a unique experience to learn about the problems from other communities. Going to this kind of events (outside of your community of expertise) has been eye opening for me in the past, and I cannot recommend it enough. Get out of your community bubble once in a while J
What was I doing at iEMSs?
I attended the conference to present 3 papers about our Model Integration project (MINT). The papers describe an overview of the project, in which we aim to reduce the time required to integrate together models from climate, hydrology, agriculture, economics and social sciences. In addition, we introduce a new approach to describe model variables and processes using the Ontosoft software registry and our plan to integrate Pegasus and Emely for efficient model coupling. More information is available in the conference program (hopefully our papers will soon be available in the conference proceedings as well). Overall, the presentations were well received and I was glad to learn that there is huge interest in some of the problems we are tackling, such as the description of models to facilitate their reusability or enabling model coupling.
One of the best parts of the conference were the keynotes. Temple Grandin started on Monday with a cry for acceptance of visual thinkers (“I see risk, other people try to measure it!”) together with the need to get closer to the infrastructure we use every day. Get out of the office and get your hands dirty once in a while!
The last keynote speaker was Thomas Vilsack, former US Secretary of Agriculture under the Obama administration. This is the first keynote I have seen given by a politician, with no slides and a direct but compelling speech. The speaker tackled several problems related to modeling, from the role of science in different debates (GMOs and climate change) to the need for new sustainable solutions given the increase of population around the globe. How can we make models that convince farmers and policy makers about the long term consequences of their actions? How can models be used to increase the productivity per individual acre? Can we find solutions so we become better consumers of food? How can we reduce and reuse food waste?
Excellent wake up call from former US Secretary of Agriculture, Thomas Vilsack on conflicting short- and -long term challenges in Agriculture- e.g. ecosystem markets #iEMSs2018pic.twitter.com/fjEa6r4V7m
Given that many sessions happened in parallel, this is a personal vision with the highlights of the talks I attended to:
Ibrahim Demir’s FloodAI is a very cool approach that mixes science with visual explanations early detection observations. They have done an impressive amount of work to be able to communicate their results with chat bots. No wonder why he won a conference award!
Alexei Voinov described surveys, tools and methods for participatory modeling. Remaining challenges are a) people tend to use the tools and models they are more familiar with, rather than experiment new ones in different contexts; b) Failure in method execution is not reported.
Ruth Falconer (University of Abertay) and the use of videogames in environmental modeling.
Sarah Mubareka’s report on integration of models of biomass supply. Creating accurate indicators for estimating biomass in Europe is a real challenge, as everyone one uses different definitions and metrics in their country.
I have just returned from an amazing IUI2017 in Limassol, Cyprus and, as I have done with other conferences, I think it would be useful to share a summary of my notes in this post. This was my first time attending the IUI conference, and I am gladly surprised with both the quality of the event and friendliness of the community. As a Semantic Web researcher, it was also very positive to learn how problems are tackled from a human-computer interaction perspective. I have to admit that this is often overlooked in many semantic web applications.
What was I doing in IUI2017?
My role in the conference was to present our paper towards the generation of data narratives, or, in a more ambitious manner, our attempt to write the “methods” section of a paper automatically (see some examples here). The idea is simple: in computational experiments, the inputs, methods (i.e., scientific workflows), intermediate results, outputs and provenance are explicit in the experiment. However, scientists have to process all these data by themselves and summarize it in the paper. By doing so, they may omit important details that are critical for reusing or reproducing the work. Instead, our approach aims to use all the resources that are explicit in the experiment to generate accurate textual descriptions in an automated way.
I wanted to attend the conference in part to receive feedback on our current approach. Although our work was well received, I learned that the problem can get complex really quickly. In fact, I think it can become a whole area of research itself! I hope to see more approaches in the future in this direction. But that is the topic for another post. Let’s continue with the rest of the conference:
The conference lasted three days, with one main keynote opening each of them. The conference opened with Shumin Zhain, from Google, who described their work on modern touchscreen keyboard interfaces. This will ring a bell to anyone reading this post, as the result of their work can be seen on any Android phone nowadays. I am sure they will not have problems finding users to evaluate their approaches.
In particular, the speaker introduced the system to capture gestures to recognize words, as if you were drawing a line. Apparently, before 2004 they had been playing around with different keyboard configurations that helped users write in a more efficient manner. However, people have different finger sizes, and adapting the keyboard to them is still a challenge. Current systems have several user models, and combine them to adapt to different situations. It was in 2004 when they came with the first prototype of SHARK, a shape writer that used neural networks to decode keyboard movements. They refined their prototype until achieving the result that we see today on every phone.
However, there are still many challenges remaining. Smart watches have a screen that is too small for writing. And new formats without screen such as wearable devices or virtual reality don’t use standard keyboards. Eye tracking solutions have not made significant progress, and while speech recognition has evolved a lot, it is not likely to replace traditional writers any time soon.
The second speakers was George Samaras, who described their work to personalize interfaces based on the emotions shown by the users of a system. The motivation for this need is that currently an 80% of the errors of automated systems are due to human mistakes rather than mechanical ones, especially when the interfaces are complex, such as in aviation or nuclear plants. Here cognitive systems are crucial, and adapting the content and navigation to the humans using them becomes a priority.
The speaker presented their framework to classify users based on the relevant factors in interfaces. For example, the verbals prefer textual explanations, while imagers like image explanations for e.g., browsing results. Another example is how users prefer to explore the results: we have the wholist, who prefer a top down exploration, versus the analysit, who would rather go for bottom up search. This is can become an issue in collaborations, as users that prefer to perceive the information in the same way may collaborate more efficiently together. A study performed over 10 years with more than 1500 shows that personalized interfaces lead to a faster task completion.
Finally, the speaker presented their work for tackling the emotions of users. Recognizing them is important, as depending on their mood, users may be keen to see the interface in one way or the other. They have developed a set of cognitive agents, which aim to personalize services and persuade users to complete certain tasks. Persuasion is more efficient when taking into account emotions as well.
The final keynote was presented by Panos Markopoulos, who introduced their work on hci design for patient rehabilitation. Having a proper interaction with patients (in exercises for kids and elderly people, arm training for stroke survivors, etc.) is critical for their recovery. However, this interaction has to be meaningful or patients will get bored and not complete their recovery exercises. The speaker described their work with therapists to track patient recovery in exercises such as pouring wine, cleaning windows, etc. The talk ended with a summary of some of the current challenges in this area, such as adapting feedback from patient behavior, sustaining engagement on the long run or personalization of exercises.
Recommendation is still a major topic in HCI. Peter Brusilovsky gave a nice overview of their work on personalization in the context of relevance-based visualization, as part of the ESIDA workshop. Personalized visualizations are now gaining more relevance in recommendation, but picking the right visualization for users is still a challenge. In addition, users are starting to demand why certain recommendations are more relevant, so non-symbolic approaches like topic modeling present issues.
Semantic web as a means to address curiosity in recommendations. SIRUP uses LOD paths with cosine similarity to find potential connections relevant for users.
Most influential paper award: Trust in recommender systems (O’Donovan and Smyth), where they developed a trust model for users, taking into account provenance too. Congrats!
Exploration of datasets from natural language queries. Christina Christodoulakis presented an approach to help analysts explore the next query to perform based on previous queries. A cool feature being explored here is that they abstract queries using hierarchies (e.g., what is a “sum of money over period of time” instead of “revenue per month”). Kevin McCurley introduced Analyza, an impressive effort led by Google to explore data with conversation. Originally motivated to simplify complex interfaces when retrieving data from CSVs, they have developed a virtual data analyst that breaks down queries into smaller pieces to help users translating their natural language questions to database queries. I wonder if we will see this feature soon in Google spreadsheets, as it looks tremendously useful.
The gala dinner showed me something: the people of Cyprus know how to eat. It is the first time I see a table so full of food. And new dishes they kept coming! It felt like the meal of Mannekenpix, one of the 12 tasks of Asterix.
IUI 2017 had 193 participants this year, almost half of them students (86); and an acceptance rate of 23% (27% for full papers). You can check the program for more details. I usually prefer this kind of conferences because they are relatively small, you can see most of the presented work without having to choose and you can talk to everyone very easily. If I can, I will definitely come back.
I also hope to see more influence of Semantic Web techniques to facilitate some of the challenges in HCI, as I think it there is a lot of potential to help in explanation, trust or personalization. I look forward to attending next year in Tokyo!
Last week I attended a two-day event on Open Research Data: Implications and Society. The event was located at Warsaw’s University Library, close to the old district, and it took place while all the students were actually studying on the library.
The event was sponsored by the Research Data Alliance and OpenAire among others, with presenters from institutions like CERN, companies that aim at facilitating publishing scientific data like Figshare (or benefit from them like Altmetric) and people from the editorial world like Elsevier and Thomson Reuters. Lidia Stępińska-Ustasiak was the main organizer of the event, and she did a fantastic job. My special thanks to her and her team.
In general, the audience was very friendly and willing to learn more about the problems exposed by the presenters. The program was packed with keynotes and presentations, which made it quite a non-stop conference.
What I presented
I attended the event to talk about Research Objects and our approach for their proper preservation by using checklists. Check the slides here. In general, our proposal was well received, even though much work is still necessary to make it happen as a whole. Applications like RODL or MyExperiment are the first step forward towards achieving reproducible publications.
What I liked
The environment, the talks (kept on 10 minutes for the short talks and on 25 for keynotes), people staying to hear others and not running away after their presentations, and all the discussions that happened during and after the events.
What I missed
Even though I enjoyed the event very much, I missed some innovative incentives for scientist to actually share their methods and their data. Credit and attribution were the main reasons given by everyone to share their data. However, these are long term benefits. For instance, after sharing the data and methods I have used in several papers as Research Objects, I have noticed that it really takes a lot of time to document everything properly. It pays off on the long term when you (or others) want to reuse your own data, but not immediately. Thus, I can imagine that other scientists may use this as an excuse to avoid publishing their data and workflow when they publish the associated paper. The paper is the documentation, right?
My question is: can we provide a benefit for sharing data/workflows that is immediate? For example: if you publish the workflow, the “Methods” page of your paper will be written automatically, or you will have an interactive drawing that looks supercool on your paper, etc. I haven’t found an answer to this question yet, but I hope to see some advance in this direction in the future.
But enough with my own thoughts, let’s stick to the content. I summarize both days below.
After the welcome message, Marek Niezgódk introduced the efforts made in Poland towards research open data. The polish Digital Library now offers access to all scientific publications for everyone, in order to foster polish scholar bibliography in the scientific world. Since polish is not an easy language, they are investing in the development of tools and projects like Wordnet and Europeana.
Mark Parsons (Research Data Alliance) followed by describing the problem of replication of scientific results. Before working in RDA, he used to work on the NSDIC, which observes and measures climate change. Apparently, some results were really hard to replicate because different experts understood concepts differently. For example, the term “ice edge” is defined differently in several communities. Open data is not enough: we need to build bridges among different communities of experts, and this is precisely the mission of RDA. With more than 30 working and interest groups integrating people from industry and academia, RDA aims to improve the “data fabric” by building foundational terminologies, enabling discovery among different registries and standardizing methodologies between different communities:
Jean-Claude Burgelman (European Commission) provided a great overview of the open research lifecycle:
The presenter described the current concerns with open access in the European Commission, and how they are proposing a bottom-up approach by enabling a pilot for open research data which has provided encouraging preliminary results.
Although open data is currently being opened in some areas (see picture below), it is good to see that the European Commission is also focusing on infrastructures, hosting, intellectual property rights and governance. For example, in the open pilot even patents are possible with the open data policy.
The talk ended up with an interesting thought: High impact journals are less than 1% of the scientific production.
The next presenter was Kevin Ashley, from the British Digital Curation Center. Kevin started his talk with the benefits of data sharing, both from a selfish view (credit) and the community view (for example, data from archaeology has been used by paleontology experts). Good research needs good data, and what some people consider noise could be a valuable input for other researchers in different areas.
I liked how Kevin provided some numbers regarding the maintenance of an infrastructure for open access of research papers. Assuming that only 1 out of 100 papers are reused, in 5 years we could save up to 3 million per year from buying papers online. Also, linking publication and data increases its value. Open data and closed software, on the other hand, is a barrier.
The talk ended with the typical reasons people give to not to share their data, as well at the main problems that actually stop data reuse:
The evening was followed by a set of quick presentations.
Giulia Ajmone (OECD) introduced open science policy trends by using the “stick and carrot” metaphor: carrots are financial incentives, proper acknowledgement and attribution, while the sticks are the mandatory rules necessary to make them happen. Individual policies are at the national levels on many countries.
Magdalena Szuflita (Gdańsk University of Technology) tried to identify additional benefits for data sharing by doing a survey on economics and chemistry (areas where the researchers didn’t share their data).
Ralf Toepfer (Leibniz centre of economics) provided more details on open research data in economics, where up 80 % of the researchers do not share their data (although the majority of the people think other people should share their data). I personally find this very shocking in an environment where trust and credibility is key, as some of these studies might be the cause of big political changes.
Marta Teperek (University of Cambridge) talked about the training activities and workshops for sharing data at the University of Cambridge.
Helena Cousijn (Elsevier) described ways for researchers to store, share and discover data. I liked the slide comparing the research initiatives versus the research needs (see below). I also learnt that Elsevier has a data repository where they assign DOIs and up to 2 data journals.
Marcin Kapczyński introduced the data citation index they are developing at Thomspon Reuters, which covers 240 high value multidisciplinary repositories. A cool feature is that it can distinguish between datasets and papers.
Monica Rogoza (national library of Poland) presented an approach to connect their digital library to other repositories, providing a set of tools to visualize and detect pictures in texts.
The day ended with some tools and methodologies for opening data in different domains. Daniel Hook, from FigShare, gave the invited talk by appealing to our altruism instead of our selfishness for sharing data. He surveyed the different ages of research: individual research led to the age of enlightenment, institutional research to an age of evaluation, national research to an age of collaboration and international research to an age of impact. Unfortunately, sometimes impact might be a step back from collaboration. Most of the data is still hidden in Dropbox or pendrives, and when institutions share it we find three common cases: 1) they are forced to do it, in which case the budget for accomplishing it is low; 2) they are really excited to do it, but it is not a requirement; and 3) May not understand the infrastructure, but they aim to provide tools to allow authors to collaborate internationally.
And finally, a manifesto:
The short talks can be summarized as follows:
Marcin Wichorowsky (University of Warsaw) talked about the GAME project database to integrate oceanographic data repositories and link them to social data.
Alexander Nowinsky (University of Warsaw) described COCOs, a cosmological simulation database which aims at storing large scale simulations of the universe (with just 2 datasets they are over 100TB!)
Marta Hoffman (University of Warsaw) introduced RepOD, the first repository for open data in Poland complementary to other platforms like the Open Science Platform. It adapts C-KAN and focuses explicitly on research data.
Henry Lütke (ETH Zurich) described their publication pipeline for scientific data, by using OpenBis for data management, electronic notebooks and OAI-PMH to track the metadata. Integrated with C-KAN as well.
The second day was packed with presentations as well. Martin Hamilton (Jisc) gave the first keynote by analyzing the role of the pioneer. Assuming that in 2030 there will be tourists in Mars, what are the main causes that could enable it? Who were the pioneers that pushed this effort forward? For example, Tesla Motors will not initiate any lawsuit against someone who, in good faith, wants to use their technology for the greater good. These are the kind of examples we need to see for research data as well. New patrons may arise (e.g., Google, Amazon, etc. give awards as research grants) and there will be a spirit of co-opetition (i.e., groups with opposite interests working together on the same problem), but working together we could address the issue of open access in research data and move towards other challenges like full reproducibility of the scientific experiments.
Tim Smith (CERN, Zenodo) followed by describing how we often find ourselves on the shoulders of secluded giants. We build up on the work done by other researchers, but the shareablity of data might be a burden in the process: “If you stand on the shoulders of hidden giants, you may not see too far”. Tim argued that researches participating in the human collective enterprise that pushes research forward often look for their own best interest, and that by fostering feedback one’s own interest may become a collective interest. Of course, this also involves a scientist-centric approach providing access to the tools, data, materials and infrastructure that delivered the results. Given that software is crucial for producing research, Zenodo was presented as an application for collaborative development to publish code as part of the active research process (integrated with Github). The keynote ended by explaining how data is shared in an institution like CERN, where there are PetaBytes of data stored. Since all the data can’t be opened due to its size, only a set of selected data for education and research purposes is made public (currently around 40 TB). The funny thing is how opening data has actually benefitted them: they did an open challenge asking people to improve their machine learning algorithm on the input data. Machine learning experts, who had no idea about the purpose of the data, won.
A set of short presentations were next:
Pawel Krajewski presented the transplant project, a software infrastructure for plant scientists based on checklist for publishing the data. It follows the ISA-TAB format.
Cinzia Daraio (Sapienza) described how to link heterogeneous data sources in an interoperable setting with their ontology-based (14 modules!) data management system. The ontology is used to represent indicators on different disciplines and be able to do comparisons (e.g., opportunistic behavior).
Kimil Wais (University of Information Technology and Management in Rzeszów) showed how to monitor open data automatically by using an application, Odgar, based on R for visualizing and computing statistics.
Me: I presented our approach for preserving Research Objects by using checklists described above.
After the break, Mark Thorley (NERC-UK) gave the last invited talk. He presented Cotadata.org, an international group like RDA that instead of following a bottom-up approach, follows a top-down one. As described before, a huge problem relies on the knowledge translators, who are people that know how to talk to experts in different domains for their uses of data. In this regard, the role of the knowledge broker/intermediary is gaining relevance: people that know the data and know how to use it for other people’s needs. Rather than exposing the data, in Codata they are working towards exposing and exploiting (IP rights) the knowledge behind.
A series of short talks followed the invited talk:
Ben McLeish (Altmetric) described how in their company they look for any research output using text mining: Reddit, Youtube, repositories, blogs, etc. They have come up with a new relevance metric based on donut-shaped graphics which can even show how your institution is doing and how engaging your work is.
Krzysztof Siewicz (University of Warsaw) explained from the legal point of view how different data policies could interfere when opening data.
Magdalena Rutkowska-Sowa (University of Białystok) finished up by describing the models for commercialization of R&D findings. With Horizon 2020, new policy models and requirements will have to be introduced.
The second day finished with a panel discussion with Tim Smith, Giulia Ajmone, Martin Hamilton, Mark Parsons and Mark Thorley as participants, discussing further some of the issues presented during both days. Although I didn’t take many notes, some of the discussion were about how enterprises could figure out open data models, data privacy, how to build services on top of open data or the value of making data available.
A couple of weeks ago I attended the International World Wide Web (WWW) conference in Florence. This was my first time in WWW, and I was impressed by the amount of attendants (apparently, more than 1400). Everyone was willing to talk and discuss about their work, so I met new people, talked to some I already knew and left with a very positive experience. I hope to be back in the future.
In this post I summarize my views on the conference. Given its size, I could not attend all the different talks, workshops and tutorials, but if you could not come you might be able to get an idea on the types of the contents that were presented. The proceedings can be accessed online here.
The conference was held in Fortezza da Basso, one of Florence’s largest historical buildings. Although it was packed with talks, tutorials and presentations, more than one attendant managed to skip one or two sessions to do some sightseeing, and I can’t blame them. I didn’t skip any sessions, but I managed to visit the Ponte Vecchio and have a walk around the city after the second day was over :).
My contribution: Linked Data Platform and Research Objects
My role in the conference was to present a poster in the Save-SD workshop. We use the Linked Data Platform standards to access Research Objects according to the Linked Data principles, which make them easy to create, manage, retrieve and edit. You can check our slides here, and we have a live demo prototype here. The poster can be seen in the picture below. We got some nice potential users and feedback from the attendants!
The conference keynotes
The keynotes were one of the best part of the conference. Jeanette Hoffman opened the first day by describing the dilemmas of digitalization, comparing them to the myth of falling between Scylla and Charybdis. She introduced four main dilemmas, which may not have a best solution:
The privacy paradox, as we have a lot of “free” services at our disposal, but the currency in which we pay for them is our own private data
Bias on free services: For example, org, is an alliance of enterprises that claim to be offering local services for free in countries where people cannot afford it. But some protesters claim that they offer a manipulated internet where people can’t decide. Is it better to have something biased for free or an unbiased product for which you have to pay?
Data protection versus free access to information: illustrated with the right to be forgotten, celebrated in Germany as a success of the individual over Google, but heavily criticized in other countries like Spain where corrupt politicians use it to look better to the potential voters after the sentence has expired. The process of “being forgotten” is not transparent at all.
Big brother is always watching you: how do the security / law enforcement / secret services collect everything about us? (All for the sake of our own protection). National services collect the data on the foreigners to protect the locals. What about data protection? Shall we consider ourselves under constant surveillance?
The second keynote was given by Deborah Estrin, and it discussed what we could do with our small data. We are walking sensors constantly generating data with our mobile devices and “small data is to individuals what big data is to institutions”. However, most people don’t like analyzing their data. They download apps that passively record and use their data to show them useful stuff: healthy purchases based on your diet, decline at an old age, monitoring, etc. The issue of privacy is still there, but “is it creepy when you know what is going on, instead of everybody using this data without you knowing. What can’t you benefit from your own data as well?”.
Andrei Broder, from Google, was the last keynote presenter. He did a retrospective of the Web, analyzing whether their predictions for the last decade were true or not, and doing some additional ones for the future. He introduced the 3 drivers of progress: scaling up with quality, a faster response and higher functionality levels:
The keynote also included some impressive data, from then and now. In 1999 people had still to be explained what a web crawler was. Today 20 million pages are crawled every day, and the index is over 100 PetaBytes. Wow. Regarding future predictions, it looks like Google is evolving from a search box to a request box:
Saving scholarly discourse
I attended the full day SAVE-SD workshop, designed for enhancing scholarly data with semantics, analytics and visualization. The workshop was organized by Francesco Osborne, Silvio Peroni and Jun Zhao, and it received a lot of attention (even though the LDOW workshop was running in parallel). One of the features of the workshop was that you could submit your paper in html using the RASH grammar. The paper is then enriched and can be directly converted to other formats demanded by publishers like the ACM’s pdf template.
Paul Groth kicked off the workshop by introducing in his keynote how to increase the productivity in scholarship by using knowledge graphs. I liked how Paul quantified productivity with numbers: taking as productivity the amount of stuff we can do in one hour, the productivity has raised up to 30% in places like the US since 1999. Scholarly output has grown up to 60%, but that doesn’t translate necessarily into a productivity boost. The main reason why we are not productive is “the burden of knowledge”: we need longer times to study and process the amount of research output being produced in our areas of expertise. Even though tools for collaborating among researchers have been created, in order to boost our productivity we need synthesized knowledge, and Knowledge Graphs can help with that. Hopefully we’ll see more apps based on personalized knowledge graphs in the future 🙂
The rest of the workshop covered a variety of domains:
Bibliography: with the Semantic Lancet portal, allows exploring citations as a first class citizen, and Conference Live, a tool for accessing collecting and exploiting conference information and papers as Linked Data.
Enhanced publications, where Bahar Sateli won the best paper award with her approach to create knowledge bases from papers using NLP techniques (pdf) and Hugo Mougard described an approach to align conference video talks to their respective papers.
Fostering collaborations: Luigi Di Caro described the impact of the collaborators in one’s own research(d-index). I tested it and I am glad to see that I am less and less dependent on my co-authors!
Linked Data or DBpedia?
I was a bit disappointed to discover that although many different papers claimed to be using/publishing Linked Data, in reality they were just approaches to work with one dataset: DBpedia. Ideally Linked Data applications should exploit and integrate the links from different distributed sources and datasets, not just a huge centralized dataset like DBpedia. In fact, the only paper that I saw that exploited the concept of Linked Data was the one presented by Ilaria Tiddi on using Linked Data to label academic communities (pdf), in which they aimed to explain data patterns detecting communities of research topics by doing link transersal and applying clustering techniques according to the LSA distance.
Web mining and Social Networks: is WWW becoming the conference of the big companies?
After assisting to the Web mining and Social Network tracks, I wonder whether it is possible to actually have a paper accepted about these topics if Microsoft, IBM, Yahoo or Google is not supporting the work with their data. I think almost all the papers in these tracks had collaborators from one of these companies, and I fear that in the future WWW might become monopolized by them. It is true that having industry involved is good for research. They provide useful real world use cases and data to test them. However, most of the presented work reduced itself at the presentation of a problem solved with a machine learning technique and a lot of training (which has the risk of over fitting the model). The innovation on the solutions wasn’t much, and the data was not accessible, as in most cases it’s private. A way to overcome this issue could be to make the authors of submitted papers to share their data as a requirement, which would be consistent to the open data movements we have been seeing in events like Open Research Data Day or Beyond the PDF; and would allow other researchers to test their own methods as well.
Opinions aside, some interesting papers were presented. Wei Song described how to extract patterns from titles for entity recognition with a high precision to produce templates of web articles (pdf); I saw automatic tagging of pictures using a 6 level neural network plus the derivation of a three level taxonomy from the tags (although the semantics was a bit naive in my opinion) (pdf); Pei Li introduced how to link groups of entities together to identify business chains (Pei Li. Univ of Zurich + Google) (pdf) and Gong Cheng described the creation of summaries for effective human-centered entity linking (pdf).
My personal favorites were the methods to detect content abusers in Yahoo answers to help the moderators’ work (pdf), by analyzing the flagged contents of the users; and the approach for detecting early rumors in Twitter (pdf) by Zhe Zhao. According to Zhe, they were able to detect rumors up to 3 hours before than anyone else.
Graph and subgraph mining
Since I have been exploring how to use graph mining techniques to find patterns in scientific workflows, I thought that attending these sessions might help me to understand better my problem. Unfortunately none of the presenters described approaches for common sub-graph mining, but I learnt about current hot topics regarding social networks: finding the densest sub-graphs (pdf, pdf and pdf), which I think it is important for determining which nodes are the most important to influence/control the network; and discovering knowledge from the graph, useful to derive small communities (pdf) and web discovery (pdf). I deliberately avoid providing details here, as these papers tend to be technical quite quickly.
Finally, I couldn’t miss the Semantic Web track, since it was the one that could have the most potential overlap with the work my colleagues and I do in Madrid. We had 5 different papers, each one on a different topic:
benchmarking: Axel Ngonga presented GERBIL, a general entity annotator benchmark that can compare up to 10 entity annotations systems (pdf).
instance matching: Arnab Dutta explained their approach to match instances depending on the schema by using Markov clustering (pdf).
provenance: Marcin Wylot described their approach for materializing views for representing the provenance of the information. The paper uses TripleProv as a query execution engine, and claims to be the most efficient way to handle provenance enabled queries (pdf).
RDF2RDB: uncommon topic, as it is usually the other way around. Minh-Duc Pham proposed to obtain a relational schema from an RDF dump in order to exploit the efficiency of typical databases (pdf). However he recognized that if the model is not static this could present some issues.
triplestores: Philip Stutz introduced TripleRush (pdf) a triplestore that uses sampling and random walks to create a special index structure and be more efficient in clustering and ranking RDF data.
I liked a paper discussing the gender roles in movies against the actual census (pdf). Gives you an idea of how manipulative the media can be.
The microposts workshop was fun, although mainly focused on named entity recognition (e.g., Pinar Kagaroz’s approach). I think that “random walk” is the sentence I have heard the most in the conference.
Lately I’ve been asked to do several revisions in different workshops, conferences and journals. In this post I would like to share with you a generic template to follow when reviewing a scientific publication. If you have been doing it for a while you may find it trivial, but I think it might be useful for people that have started recently in the reviewing process. At least, when I started, I had to ask for a similar one to my advisor and colleagues.
But first, several reasons why you should review papers:
Helps you to identify whether a scientific work is good or not. And refine your criteria by comparing yourself with other reviewers. Also, it trains you to defend your opinion based on what you read.
Helps you refining your own work, by identifying common flaws that you normally don’t detect when writing your own papers.
It’s an opportunity to update your state of the art, or learn a little on other areas.
Allows you contributing to the scientific community, and getting public visibility.
A scientific work might be the result of months of work. Even if you think it is trivial you should be methodic explaining the reasons why you think it should be accepted or rejected (yes, even if you think the paper should be accepted). A review should not be just an “Accepted” or “Rejected” statement, but also contain valuable feedback for the authors. Below you can see the main guidelines for a good review:
Start your review with an executive summary of the paper: this will let the authors know the main message you have understood from their work. Don’t copy and paste the abstract; try to communicate the summary in your own words. Otherwise they’ll just think you didn’t put much attention in reading the paper.
Include a paragraph summarizing the following points:
Grammar: Is the paper well written?
Structure: is the paper easy to follow? Do you think the order should have been different?
Relevance: Is the paper relevant for the target conference/journal/workshop?
Novelty: Is the paper dealing with a novel topic?
Your decision. Do you think the work should be accepted for the target publication? (If you don’t, expand your concerns in the following paragraphs)
Major Concerns: Here is where you should say why do you disagree with the authors, and highlight your main issues. In general, a good research paper should describe successfully four main points:
What is the problem the authors are tackling? (Research hypothesis) This point is tricky, because sometimes it is really hard to find! And in some cases the authors omit it and you have to infer it. If you don’t see it, mention it in your review.
Why is this a problem? (Motivation). The authors could have invented a problem which had no motivation. A good research paper is often motivated by a real world problem, potentially with a user community behind benefiting from the outcome.
What is the solution? (Approach). The description of the solution adopted by the authors. This is generally easy to spot on any paper.
Why is it a good solution? (Evaluation). The validation of the research hypothesis described in point one. The evaluation is normally the key of the paper, and the reason why many research publications are rejected. As my supervisor has told me many times, one does not evaluate an algorithm or an approach; one has to evaluate whether such proposed algorithm or approach validate the research hypothesis.
When a paper has the previous four points well described, it is accepted (generally). Of course, not all papers enter the category of a research papers (like a survey paper or an analysis paper). But the four previous points should cover a wide range of publications.
Minor concerns: You can point out minor issues after the big ones have been dealt with. Not mandatory, but t will help the authors to polish their work.
Typos: unless there are too many, you should point the main typos you find in your review. Or the sentences you think are confusing.
Don’t be a jerk: many reviews are anonymous, and people tend to be crueler when they know their names won’t be shown to the authors. Instead of saying that something “is garbage”, state clearly why you disagree with the authors proposal and conclusions. Make the facts talk for themselves; not your bias or opinion.
Consider the target publication. You can’t use the same criteria for a workshop, conference or journal. Normally people tend to be more permissive at workshops, where the evaluation is not that important if the idea is good, but require a good paper for conferences and journals.
Highlight the positive parts of the authors’ work, if any. Normally there is a reason why the authors have spent time on the presented research, even if the idea is not very well implemented.
Check the links, prototypes, evaluation files and in general, all the supplementary material provided by the authors. A scientist should not only review the paper, but the research described on it.
Be constructive. If you disagree with the authors in one point, always mention how they could improve their work. Otherwise they won’t know how to handle your issue and ignore your review.
After a few days back in Madrid, I have finally found some time to write about the eScience 2014 conference, which took place last week in Guarujá, Brasil. The conference lasted for 5 days (the first two days with workshops), and it got attendants from all over the world. It was especially good to see many young people who could attend thanks to the scholarships awarded by the conference, even when they were not presenting a paper. I found a bit unorthodox that the presenters couldn’t apply for these scholarships (I wanted to!), but I am glad to see this kind of giveaway. Conferences are expensive and I was able to have interesting discussions about my work thanks to this initiative. I think this is also a reflection of Jim Gray’s will: pushing science into the next generation.
We were placed in touristic resort in Guarujá, at the beach. This is what you could see when you got out of the hotel:
And the jungle was not far away either. After a 20 minute walk you were able to arrive at something like this…
…which is pretty amazing. However, the conference schedule was packed with interesting talks from 8:30 to 20:30 most of the days, and in general we were unable to do some sightseeing. In my opinion they could have reduced one workshop day and relax the schedule a little bit. Or at least remove the parallel sessions in the main conference. It always sucks to have to choose between two different interesting sessions. That said, I would like to congratulate everyone involved in the organization of the conference. They did an amazing job!
Another thing that surprised me is that I wasn’t expecting to see many Semantic Web people, since the ISWC Conference occurred at the same time in Italy, but I found quite a few. We are everywhere!
But let’s get back to the workshop, demos and conference. As I introduced above, the first 2 days included workshop talks, demos and tutorials. Here are my highlights:
Workshops and demos:
Microsoft is investing on scientific workflows!: I attended the Azure research training workshop, were Mateus Velloso introduced the Azure infrastructure for creating and setting up virtual machines, web services, webs and workflows. It is really impressive how easily you are able to create and run experiments with their infrastructure, although you are limited to their own library of software components (in this case, a machine learning library). If you want to add your own software, you have to expose it as a web service.
Impressive visualizations using Excel sheets at the Demofest! All the demos belonged to Microsoft (guess who was one of the main sponsors of the conference) although I have to admit that they looked pretty cool. I was impressed by two demos in particular, the Sanddance beta and the Worldwide Telescope. The former is used to load Excel files with large datasets to play with the data, select, filter and plot the resources by different facets. Easy to use and very fluid in the animations. The latter was similar to Google Maps, but you were able to load your excel dataset (more than 300K points at the same time) and show it on real time. For example, in the demo you could draw the itineraries of several whales in the sea at different points in time, and show their movement minute after minute.
New provenance use cases are always interesting. Dario Oliveira introduced their approach to extract biographic information from the Brazilian Historical Biographical Dictionary at the Digital Humanities Workshop. This included not only the life of the different persons collected as part of the dictionary, but also each reference that contributed to tell part of the story. Certainly a complex and interesting use case for provenance, which they are currently refining.
Paul Watson was awarded with the Jim Gray Award. In his keynote, he talked about the social exclusion and the effect of digital technologies. Having a lack of ability to log online may stop you from having access to many services, and ongoing work on helping people with accessibility problems (even through scientific workflows) was presented. Clouds play an important role too, as they have the potential for dealing with the fast growth of applications. However, the people who could benefit the most from the cloud often do not have the resources or skills to do so. He also described e-Science Central, a workflow system for easily creating workflows in your web browser, with provenance recording and exploring capabilities and the possibility to tune and improve the scalability of your workflows with the Azure infrastructure. The keynote ended by highlighting how important is to make things fun for the user (“gamification “ of evaluations, for example), and how important eScience is for computer science research: new challenges are continuously presented supported by real use cases in application domains with a lot of data behind.
I liked the three dreams for eScience of the “strategic importance of eScience” panel:
Find and support the misfits, by addressing those people with needs in escience.
Support cross domain overlap. Many communities base their work on the work made by other communities, although the collaboration rarely happens at the moment.
Cross domain collaboration.
Conference general highlights:
Great discussion in the “Going native Panel”, chaired by Tony Hey, with experts from chemistry, scientific workflows and ornithology (talk about domain diversity). They analyzed the key elements of a successful collaboration, explaining how in their different projects they have a wide range of collaborators. It is crucial to have passionate people, who don’t lose the inertia after the grant from the project has been obtained. For example, one of the best databases for accessing chemicals descriptions on the UK came out from a personal project initiated by a minority. In general, people like to consume curated data, but very few are willing to contribute. In the end what people want is to have impact. Showing relevance and impact (or reputation, altmetrics, etc.) will grant additional collaborators. Finally, the issue of data interoperability between different communities was brought up for discussion. Data without methods is in many cases not very useful, which encourages part of the work I’ve been doing during the last years.
Awesome keynotes!! The one I liked the most was given by Noshir Contractor, who talked about “Grand Societal Challenges”. The keynote was basically about how to assemble a “dream team” of people for delivering a product/proposal, and all the analyses that had been done to determine which factors are the most influential. He started by talking about the Watson team, who built a machine capable of beating a human on TV, and continued by presenting the tendencies people have when selecting people for their own teams. He also presented a very interesting study of videogames as “leadership online labs”. In videogames very heterogeneous people meet, and they have to collaborate in groups in order to be successful. The takeaway conclusion was that diversity in a group can be very successful, but it is also very risky and often it ends in a failure. That is why people tend to collaborate with people they have already collaborated with when writing a proposal.
The keynote by Kathleen R. McKeown was also amazing. She presented a high level overview of the work in NLP developed in their group concerning summarization of news, journal articles, blog posts, and even novels! (which IMO has a lot of merit without going into the detail). She presented co-reference detection of events, temporal summarization, sub-event identification and analysis of conversations in literature, depending on the type of text being addressed. Semantics can make a difference!
New workflow systems: I think I haven’t seen an eScience conference without new workflow systems being presented 😀 In this case the focus was more on the efficient execution and distribution of the resources. Dispel4py and Tigres workflow systems were introduced for scientists working in Python.
Cross domain workflows and scientific gateways:
Antonella Galizia presented the DRIHM infrastructure to set up Hydro-Meteorological experiments in minutes. Impressive, as they had to integrate models for meteorology, hydrology, pluviology and hydraulic systems, while reusing existent OGC standards and developing a gateway for citizen scientists. A powerful approach, as they were able to do flooding predictions on in certain parts of Italy. According to Antonella, one of the biggest challenges on achieving their results was to create a common vocabulary which could be understood by all the scientists involved. Once again we come back to semantics…
Rosa Filgueira presented another gateway, but for vulcanologists and rock physicists. Scientists often have problems to share data among different disciplines, even if they belong to the same domain (geology in this case). This is because every lab often records their data in a different way.
Finally, Silvia Olabarriaga gave an interesting talk about workflow management in astrophysics, heliophysics and biomedicine, distinguishing the conceptual level (user in the science gateway), abstract level (scientific workflow) and concrete level (how the workflow is finally executed on an infrastructure), and how to capture provenance at these different granularities.
Other more specific work that I liked:
A tool for understanding the copyright in science, presented by Richard Hoskings. A plethora of different licenses coexist in the Linked Open Data, and it is often difficult to understand how one can use the different resources exposed in the Web. This tool helps on guiding the user about the possible consequences of using a given resource or another in their applications. Very useful to detect any incompatibility on your application!
An interesting workflow similarity approach by Johannes Starlinger, which improves the current state of the art by making efficient matching on workflows. Johannes said they would release a new search engine soon, so I look forward to analyzing their results. They have published a corpus of similar workflows here.
Context of scientific experiments: Rudolf Mayer presented the work made on the Timbus project to capture the context of scientific workflows. This includes their dependencies, methods and data under a very fine granularity. Definitely related to Research Objects!
An agile annotation of scientific texts to identify and link biomedical entities by Marcus Silva, with the particularity of being capable of loading very large ontologies to do the matching.
Workflow ecosystems in Pegasus: Ewa Deelman presented a set of combinable tools for Pegasus able to archive, distribute simulate and re-compute efficiently workflows. All tested with a huge workflow in astronomy.
Provenance is still playing an important role in the conference, with a whole session for related papers. PROV is being reused and extended in different domains, but I still have to see an interoperable use across different domains to show its full potential.
In summary, I think the conference has been a very positive experience and definitely worth the trip. It is very encouraging to see that collaborations among different communities are really happening thanks to the infrastructure being developed on eScience, although there are still many challenges to address. I think we will see more and more cross domain workflows and workflow ecosystems in the next years, and I hope to be able to contribute with my research.
I also got plenty of new references to add to the state of the art of my thesis, so I think that I also did a good job by talking to people and letting others know of my work. Unfortunately my return flight was delayed and I missed my connection back to Spain, converting my 14 hour flight home to almost 48 hours. Certainly the longest journey from any conference I have assisted to.
Apparently September was the month of library conferences. First, the DC-Ipres conference took place during the first week of the month, while the Theory and Practice of Digital Libraries (TPDL) was celebrated from the 22 to the 26th in Malta. I have recently realized that I forgot to add the summary of TPDL, so my highlights can be found below.
In general, the impression that I got is that despite its name, TPDL is a very technology-oriented event. Linked Data was a hot topic, but also user interfaces, mining algorithms, classification, preservation and visualizations approaches were discussed for the library domain. Another curious fact is that many of the talks and papers were related to Europeana project data or models. I had no idea of the size of the project, which is leading to many contributions from a huge amount of institutions all over Europe.
Since there were many parallel sessions, my highlights won’t cover everything. If you want more information you can see the whole program here.
The COST actions for Digital Libraries, which serve to create networks of researchers all over the world.
An Interesting map based visualizations using hierarchies and Eruopeana data with a layer approach (see more here)
The project presented in the session “Using Requirements in Audio Visual Research, a quantitative approach”, which will link together fragments of videos (from a repository of more than 800k hours) and annotate them. I asked the responsible whether the data was supposed to be made available or not, but for the moment it doesn’t look like it. Very cool ideas though, and very useful for journalists and regular users.
The semantic hierarchical structuring of cultural heritage objects done with Eruopeana data to put together resources that refer to the same “thing”, using metadata (for example, to detect duplicates and several different views (pictures) of the same object). Very useful to curate the data, but it lacked a comparison with other clustering methods, which should be done in the future.
The keynote by Sören Auer, where he presented several of the Linked Data aware applications that he and his group had been developing and how they could help librarians in different ways. Ontowiki was the most complete one, a semantic wiki for creating portals and annotating them according to the Linked Data principles (including content negotiation for each of its pages).
The “resurrecting myRevolution paper”, regarding the tweets and links that go missing in the web and how to archive and preserve them properly. This presentation in particular focused on tweets that referenced images that don’t exist anymore (e.g., those taken during the green revolution in Iran).
A nice motivational presentation by Sarah Callaghan on data citation, why we need it and why we should have it. More details here.
The Investigation Research Objects being created in the SCAPE project, based on the foundations settled by wf4Ever and combining them with persistent identifiers like DOIs.
Finally I wouldn’t like to finish without mentioning that the organizers were given the title of Knights of the Digital Libraries, which was very well received by everyone in the conference. Below you can see some of the ceremony, along with one of the Malta’s National library.
After a summer break I’m back into blogging. Before continuing my tutorial, I’d like to talk about the Dublin Core conference, which took place last week in Lisbon and was collocated with iPres. Both conferences had a heavy presence of people from the digital libraries, who provided different perspectives on many Semantic Web topics. Also, when registered in one event, people could freely sneak in the talks of the other and vice versa, providing a flexible schedule for all the attendants.
My main reason to be there was a 3 hour tutorial about PROV and the mapping to Dublin Core the first day. 22 people attended and participated a lot during the session, which made it more interesting. There was some discussion and I received some questions, mainly from librarians, about PROV. The mapping received some critics but many participants considered it valuable as an entry point to use PROV in their datasets. Also, I learnt that now there is an effort to align PREMIS (http://www.loc.gov/standards/premis/) to PROV. For those interested, the slides of the tutorial can be accessed here: http://www.slideshare.net/dgarijo/provo-tutorial-dc2013-conference.
My highlights of the conference (there were parallel sessions, see the full program here):
Gildas Illien’s keynote (Darling, we need to talk), where he emphasized the different views of the preservationist vs bibliographists, and summarized the process of making the BNF (French National Library) compliant to the Linked Data standards. As always, an important conclusion is to keep the complexity of the system (e.g., RDF) hidden from the users.
The vocabulary preservation session, where the LOCKSS system (Lots of copies keep stuff safe) was briefly presented and discussed, along with other projects like Memento, which aim to provide entities at a given point in time. The people from LOV were also present. Unfortunately the discussion ended without a clear conclusion, so Ivan Herman proposed to create a new W3C community group where the discussion continues with all the parties interested.
The poor versus rich session on vocabularies used for exposing Linked Data of digital libraries across different countries like Spain, France and Germany. All the participants of the panel introduced how their systems had adopted Linked Data, although in the end there was not much discussion about the vocabularies used.
The last keynote by Carlos Morais Pires about the Horizon 2020 call and how it will be related to e-Infrastructure. Data will have to become an infrastructure itself, by sharing and federating it. “Homeless data becomes no data after all”. This is very interesting since many proposals in this regard (related also to Research Objects) will arise during the next year.
Finally, one of the most interesting sessions in my opinion was the one about Schema.org. Dan Brickley stated how Schema.org does not aim to be a top level ontology, but an empirical vocabulary to communicate with search engines and improve the visibility of your web pages (Google accepts annotations in microdata and RDF-a). The vocabulary is still incomplete in many aspects, but different organizations and community groups are starting to propose terms for adoption (the OCLC people had recently proposed one to track the bibliography items). However this task is not easy, as the need for a term has to be motivated by demonstrating that any of the existing ones don’t fit its purpose.
And I’d like to end this post with two images: the Fado concert given during the dinner plus the reception by the Tuna of the university. Special thanks to Stuart Sutton and José Borbinha, the organizers.
After a 2 week holiday, I’m finally back to work. Before letting more time pass by, I would like to share here a small summary of the e-Science conference I attended about a month and a half ago in Chicago.
I’ll start with the keynotes. There were four in the 3 days that the conference lasted. Gerhard Klimeck (slides) introduced Nanohub, a platform to publish and use separate components and tools via user-friendly interfaces, showing how they could be used for different purposes like education or research in a scalable way. It has a lot of potential (specially since they try to make things easier through simple interfaces), but I found curious how the notion of workflows doesn’t exist (or they are barely used).
Gregory Wilson (slides) raised a nice issue in e-Science: sometimes the main issue about the products developed by the scientific community is not that they have the wrong functionality, but that users don’t understand what are these products or how to use them. In order to address it, we should first prepare the users and then give them the tools.
The third speaker was Carole Goble (slides), who talked about reproducibility in e-Science and the multiple projects in which she is participating. She mentioned specially the wf4Ever project (where she collaborates with the OEG) and the Research Objects, the data artifacts that myExperiment is starting to adopt in order to preserve workflows and their provenance.
The last keynote was given by Leonard Smith (slides), and unlike the others (which were more computer science oriented), he presented from the point of view of a scientist that is looking for the appropriate tools to keep doing his research successfully. He talked about doing “science in the dark” (predictions over past observations) versus “science in the light” (analysis with empirical evaluations), and showed the example of meteorological predictions. Apparently the Royal Society wanted to drop the weather predictions in the past, but they were forced by users to have them back. Leonard highlighted the importance of never giving a 100% or 0% chance in the forecasts and ended his talk asking how could the e-Science community help this kind of research. I really recommend taking a look at the slides.
As for the panels, I attended the one about operating cities and Big Data. The work presented was very interesting, but I was a bit disappointed. I haven’t been to many panels before, and I thought a panel discussion was more a discussion between the speakers and the audience rather than presentations about the speakers’ work and a longer round of questions. This does not imply that the work was bad at all, just that I missed some debate among the invited speakers.
Regarding the sessions, most of them happened in parallel. The whole program can be seen here, so I will just post those which I enjoyed the most:
Workflow 1: Where Khalid Belhajjame presented the work on decay analyzed by the wf4Ever people in Taverna workflows (slides). Definitely a good first step for those seeking to preserve the workflow functionality and their reprpoducibility. In this session I also talked about our empirical analysis on scientific workflows in order to find common patterns in their functionality (see slides).
Data provenance: Beth Plale’s students (Pend Chen and You-Wei Cheah) introduced their work on temporal representation and quality of the workflow traces; and Sarah Cohen-Boulakia presented her work about workflow rewriting in order to make scalable analyses on the workflow graphs. I liked all the aforementioned presentations, as they where interesting and easy to follow. However they all shared the need on real workflow traces (they had created artifical ones for testing their approaches).
Workflow 2: From this session I found relevant the work presented by Sonja Holl (slides), who talked about the approach they use to find automatically the appropriate parameters for running a workflow. Once again, she was interested for traces o real workflows, specifically from Taverna (since it is the system she had been dealing with).
In conclusion, I was very happy to attend to the conference (my first one if I don’t count workshops!), even if I missed the 3 day workshops from Microsoft that happened earlier in the week. I had the chance to meet new people that I had only seen through e-mail, and I talked to all the thinking heads working close to what I do.
From the sessions also became clear to me that the community is asking for a scientific workflow provenance curated benchmark for testing their different algorithms and methods. Fortunately I have seen a call for paper with this theme: https://sites.google.com/site/bigprov13/. It covers provenance in general, but in the Wf4ever project we are already planning a joint submission with more than 100 executions of different workflows from Taverna and Wings systems. Specifically, the ones from Wings are already online published as Linked Data (see some examples here). Lets see how the call works out!