Linking Research

Archive for June, 2015

Open Research Data Day 2015: Get the credit!

Posted by dgarijov on June 10, 2015

Last week I attended a two-day event on Open Research Data: Implications and Society. The event was located at Warsaw’s University Library, close to the old district, and it took place while all the students were actually studying on the library.

Palace of culture

Warsaw’s Palace of culture

Warsaw's old district

Warsaw’s old district

The event was sponsored by the Research Data Alliance and OpenAire among others, with presenters from institutions like CERN, companies that aim at facilitating publishing scientific data like Figshare (or benefit from them like Altmetric) and people from the editorial world like Elsevier and Thomson Reuters. Lidia Stępińska-Ustasiak was the main organizer of the event, and she did a fantastic job. My special thanks to her and her team.

In general, the audience was very friendly and willing to learn more about the problems exposed by the presenters. The program was packed with keynotes and presentations, which made it quite a non-stop conference.

What I presented

I attended the event to talk about Research Objects and our approach for their proper preservation by using checklists. Check the slides here. In general, our proposal was well received, even though much work is still necessary to make it happen as a whole. Applications like RODL or MyExperiment are the first step forward towards achieving reproducible publications.

What I liked

The environment, the talks (kept on 10 minutes for the short talks and on 25 for keynotes), people staying to hear others and not running away after their presentations, and all the discussions that happened during and after the events.

What I missed

Even though I enjoyed the event very much, I missed some innovative incentives for scientist to actually share their methods and their data. Credit and attribution were the main reasons given by everyone to share their data. However, these are long term benefits. For instance, after sharing the data and methods I have used in several papers as Research Objects, I have noticed that it really takes a lot of time to document everything properly. It pays off on the long term when you (or others) want to reuse your own data, but not immediately. Thus, I can imagine that other scientists may use this as an excuse to avoid publishing their data and workflow when they publish the associated paper. The paper is the documentation, right?

My question is: can we provide a benefit for sharing data/workflows that is immediate? For example: if you publish the workflow, the “Methods” page of your paper will be written automatically, or you will have an interactive drawing that looks supercool on your paper, etc. I haven’t found an answer to this question yet, but I hope to see some advance in this direction in the future.

But enough with my own thoughts, let’s stick to the content. I summarize both days below.

Day 1

After the welcome message, Marek Niezgódk introduced the efforts made in Poland towards research open data. The polish Digital Library now offers access to all scientific publications for everyone, in order to foster polish scholar bibliography in the scientific world. Since polish is not an easy language, they are investing in the development of tools and projects like Wordnet and Europeana.

Mark Parsons (Research Data Alliance) followed by describing the problem of replication of scientific results. Before working in RDA, he used to work on the NSDIC, which observes and measures climate change. Apparently, some results were really hard to replicate because different experts understood concepts differently. For example, the term “ice edge” is defined differently in several communities. Open data is not enough: we need to build bridges among different communities of experts, and this is precisely the mission of RDA. With more than 30 working and interest groups integrating people from industry and academia, RDA aims to improve the “data fabric” by building foundational terminologies, enabling discovery among different registries and standardizing methodologies between different communities:

The data fabric

The data fabric

Jean-Claude Burgelman (European Commission) provided a great overview of the open research lifecycle:

Data Publication Lifecycle

Data Publication Lifecycle

The presenter described the current concerns with open access in the European Commission, and how they are proposing a bottom-up approach by enabling a pilot for open research data which has provided encouraging preliminary results.
Although open data is currently being opened in some areas (see picture below), it is good to see that the European Commission is also focusing on infrastructures, hosting, intellectual property rights and governance. For example, in the open pilot even patents are possible with the open data policy.

Open data by community

Open data by community

The talk ended up with an interesting thought: High impact journals are less than 1% of the scientific production.

The next presenter was Kevin Ashley, from the British Digital Curation Center. Kevin started his talk with the benefits of data sharing, both from a selfish view (credit) and the community view (for example, data from archaeology has been used by paleontology experts). Good research needs good data, and what some people consider noise could be a valuable input for other researchers in different areas.
I liked how Kevin provided some numbers regarding the maintenance of an infrastructure for open access of research papers. Assuming that only 1 out of 100 papers are reused, in 5 years we could save up to 3 million per year from buying papers online. Also, linking publication and data increases its value. Open data and closed software, on the other hand, is a barrier.
The talk ended with the typical reasons people give to not to share their data, as well at the main problems that actually stop data reuse:

Excuses and responses for not making your data available

Excuses and responses for not making your data available

What stops data reuse?

What stops data reuse?

The evening was followed by a set of quick presentations.

  • Giulia Ajmone (OECD) introduced open science policy trends by using the “stick and carrot” metaphor: carrots are financial incentives, proper acknowledgement and attribution, while the sticks are the mandatory rules necessary to make them happen. Individual policies are at the national levels on many countries.
  • Magdalena Szuflita (Gdańsk University of Technology) tried to identify additional benefits for data sharing by doing a survey on economics and chemistry (areas where the researchers didn’t share their data).

    Incentives for data sharing

    Incentives for data sharing

  • Ralf Toepfer (Leibniz centre of economics) provided more details on open research data in economics, where up 80 % of the researchers do not share their data (although the majority of the people think other people should share their data). I personally find this very shocking in an environment where trust and credibility is key, as some of these studies might be the cause of big political changes.
  • Marta Teperek (University of Cambridge) talked about the training activities and workshops for sharing data at the University of Cambridge.
  • Helena Cousijn (Elsevier) described ways for researchers to store, share and discover data. I liked the slide comparing the research initiatives versus the research needs (see below). I also learnt that Elsevier has a data repository where they assign DOIs and up to 2 data journals.

    Initiatives vs research data needs

    Initiatives vs research data needs

  • Marcin Kapczyński introduced the data citation index they are developing at Thomspon Reuters, which covers 240 high value multidisciplinary repositories. A cool feature is that it can distinguish between datasets and papers.
  • Monica Rogoza (national library of Poland) presented an approach to connect their digital library to other repositories, providing a set of tools to visualize and detect pictures in texts.

The day ended with some tools and methodologies for opening data in different domains. Daniel Hook, from FigShare, gave the invited talk by appealing to our altruism instead of our selfishness for sharing data. He surveyed the different ages of research: individual research led to the age of enlightenment, institutional research to an age of evaluation, national research to an age of collaboration and international research to an age of impact. Unfortunately, sometimes impact might be a step back from collaboration. Most of the data is still hidden in Dropbox or pendrives, and when institutions share it we find three common cases: 1) they are forced to do it, in which case the budget for accomplishing it is low; 2) they are really excited to do it, but it is not a requirement; and 3) May not understand the infrastructure, but they aim to provide tools to allow authors to collaborate internationally.

And finally, a manifesto:

Manifesto for sharing data

Manifesto for sharing data

The short talks can be summarized as follows:

  • Marcin Wichorowsky (University of Warsaw) talked about the GAME project database to integrate oceanographic data repositories and link them to social data.
  • Alexander Nowinsky (University of Warsaw) described COCOs, a cosmological simulation database which aims at storing large scale simulations of the universe (with just 2 datasets they are over 100TB!)
  • Marta Hoffman (University of Warsaw) introduced RepOD, the first repository for open data in Poland complementary to other platforms like the Open Science Platform. It adapts C-KAN and focuses explicitly on research data.
  • Henry Lütke (ETH Zurich) described their publication pipeline for scientific data, by using OpenBis for data management, electronic notebooks and OAI-PMH to track the metadata. Integrated with C-KAN as well.

Day 2

The second day was packed with presentations as well. Martin Hamilton (Jisc) gave the first keynote by analyzing the role of the pioneer. Assuming that in 2030 there will be tourists in Mars, what are the main causes that could enable it? Who were the pioneers that pushed this effort forward? For example, Tesla Motors will not initiate any lawsuit against someone who, in good faith, wants to use their technology for the greater good. These are the kind of examples we need to see for research data as well. New patrons may arise (e.g., Google, Amazon, etc. give awards as research grants) and there will be a spirit of co-opetition (i.e., groups with opposite interests working together on the same problem), but working together we could address the issue of open access in research data and move towards other challenges like full reproducibility of the scientific experiments.

Tim Smith (CERN, Zenodo) followed by describing how we often find ourselves on the shoulders of secluded giants. We build up on the work done by other researchers, but the shareablity of data might be a burden in the process: “If you stand on the shoulders of hidden giants, you may not see too far”. Tim argued that researches participating in the human collective enterprise that pushes research forward often look for their own best interest, and that by fostering feedback one’s own interest may become a collective interest. Of course, this also involves a scientist-centric approach providing access to the tools, data, materials and infrastructure that delivered the results. Given that software is crucial for producing research, Zenodo was presented as an application for collaborative development to publish code as part of the active research process (integrated with Github). The keynote ended by explaining how data is shared in an institution like CERN, where there are PetaBytes of data stored. Since all the data can’t be opened due to its size, only a set of selected data for education and research purposes is made public (currently around 40 TB). The funny thing is how opening data has actually benefitted them: they did an open challenge asking people to improve their machine learning algorithm on the input data. Machine learning experts, who had no idea about the purpose of the data, won.

Zenodo-Github bridge

Zenodo-Github bridge

A set of short presentations were next:

  • Pawel Krajewski presented the transplant project, a software infrastructure for plant scientists based on checklist for publishing the data. It follows the ISA-TAB format.
  • Cinzia Daraio (Sapienza) described how to link heterogeneous data sources in an interoperable setting with their ontology-based (14 modules!) data management system. The ontology is used to represent indicators on different disciplines and be able to do comparisons (e.g., opportunistic behavior).
  • Kimil Wais (University of Information Technology and Management in Rzeszów) showed how to monitor open data automatically by using an application, Odgar, based on R for visualizing and computing statistics.
  • Me: I presented our approach for preserving Research Objects by using checklists described above.

After the break, Mark Thorley (NERC-UK) gave the last invited talk. He presented Cotadata.org, an international group like RDA that instead of following a bottom-up approach, follows a top-down one. As described before, a huge problem relies on the knowledge translators, who are people that know how to talk to experts in different domains for their uses of data. In this regard, the role of the knowledge broker/intermediary is gaining relevance: people that know the data and know how to use it for other people’s needs. Rather than exposing the data, in Codata they are working towards exposing and exploiting (IP rights) the knowledge behind.

A series of short talks followed the invited talk:

  • Ben McLeish (Altmetric) described how in their company they look for any research output using text mining: Reddit, Youtube, repositories, blogs, etc. They have come up with a new relevance metric based on donut-shaped graphics which can even show how your institution is doing and how engaging your work is.
  • Krzysztof Siewicz (University of Warsaw) explained from the legal point of view how different data policies could interfere when opening data.
  • Magdalena Rutkowska-Sowa (University of Białystok) finished up by describing the models for commercialization of R&D findings. With Horizon 2020, new policy models and requirements will have to be introduced.

The second day finished with a panel discussion with Tim Smith, Giulia Ajmone, Martin Hamilton, Mark Parsons and Mark Thorley as participants, discussing further some of the issues presented during both days. Although I didn’t take many notes, some of the discussion were about how enterprises could figure out open data models, data privacy, how to build services on top of open data or the value of making data available.

The pannel. From left to right Giulia Ajmone, Mark Thorley,  Martin Hamilton, Tim Smith and Mark Parsons

The pannel. From left to right Giulia Ajmone, Mark Thorley, Martin Hamilton, Tim Smith and Mark Parsons

Posted in Conference | Tagged: , , , , , , | Leave a Comment »

WWW2015: Linked Data or DBpedia?

Posted by dgarijov on June 4, 2015

A couple of weeks ago I attended the International World Wide Web (WWW) conference in Florence. This was my first time in WWW, and I was impressed by the amount of attendants (apparently, more than 1400). Everyone was willing to talk and discuss about their work, so I met new people, talked to some I already knew and left with a very positive experience. I hope to be back in the future.

In this post I summarize my views on the conference. Given its size, I could not attend all the different talks, workshops and tutorials, but if you could not come you might be able to get an idea on the types of the contents that were presented. The proceedings can be accessed online here.

The venue

The conference was held in Fortezza da Basso, one of Florence’s largest historical buildings. Although it was packed with talks, tutorials and presentations, more than one attendant managed to skip one or two sessions to do some sightseeing, and I can’t blame them. I didn’t skip any sessions, but I managed to visit the Ponte Vecchio and have a walk around the city after the second day was over :).

Fortezza da Basso (left) and Vechio bridge (right)

Fortezza da Basso (left) and Vechio bridge (right)

My contribution: Linked Data Platform and Research Objects

My role in the conference was to present a poster in the Save-SD workshop. We use the Linked Data Platform standards to access Research Objects according to the Linked Data principles, which make them easy to create, manage, retrieve and edit. You can check our slides here, and we have a live demo prototype here. The poster can be seen in the picture below. We got some nice potential users and feedback from the attendants!

Our poster: Linked Data Platform and Research Objects

Our poster: Linked Data Platform and Research Objects

The conference keynotes

The keynotes were one of the best part of the conference. Jeanette Hoffman opened the first day by describing the dilemmas of digitalization, comparing them to the myth of falling between Scylla and Charybdis. She introduced four main dilemmas, which may not have a best solution:

  • The privacy paradox, as we have a lot of “free” services at our disposal, but the currency in which we pay for them is our own private data
  • Bias on free services: For example, org, is an alliance of enterprises that claim to be offering local services for free in countries where people cannot afford it. But some protesters claim that they offer a manipulated internet where people can’t decide. Is it better to have something biased for free or an unbiased product for which you have to pay?
  • Data protection versus free access to information: illustrated with the right to be forgotten, celebrated in Germany as a success of the individual over Google, but heavily criticized in other countries like Spain where corrupt politicians use it to look better to the potential voters after the sentence has expired. The process of “being forgotten” is not transparent at all.
  • Big brother is always watching you: how do the security / law enforcement / secret services collect everything about us? (All for the sake of our own protection). National services collect the data on the foreigners to protect the locals. What about data protection? Shall we consider ourselves under constant surveillance?

The second keynote was given by Deborah Estrin, and it discussed what we could do with our small data. We are walking sensors constantly generating data with our mobile devices and “small data is to individuals what big data is to institutions”. However, most people don’t like analyzing their data. They download apps that passively record and use their data to show them useful stuff: healthy purchases based on your diet, decline at an old age, monitoring, etc. The issue of privacy is still there, but “is it creepy when you know what is going on, instead of everybody using this data without you knowing. What can’t you benefit from your own data as well?”.

Andrei Broder, from Google, was the last keynote presenter. He did a retrospective of the Web, analyzing whether their predictions for the last decade were true or not, and doing some additional ones for the future. He introduced the 3 drivers of progress: scaling up with quality, a faster response and higher functionality levels:

3driversThe keynote also included some impressive data, from then and now. In 1999 people had still to be explained what a web crawler was. Today 20 million pages are crawled every day, and the index is over 100 PetaBytes. Wow. Regarding future predictions, it looks like Google is evolving from a search box to a request box:

pyramid

Saving scholarly discourse

I attended the full day SAVE-SD workshop, designed for enhancing scholarly data with semantics, analytics and visualization. The workshop was organized by Francesco Osborne, Silvio Peroni and Jun Zhao, and it received a lot of attention (even though the LDOW workshop was running in parallel). One of the features of the workshop was that you could submit your paper in html using the RASH grammar. The paper is then enriched and can be directly converted to other formats demanded by publishers like the ACM’s pdf template.

Paul Groth kicked off the workshop by introducing in his keynote how to increase the productivity in scholarship by using knowledge graphs. I liked how Paul quantified productivity with numbers: taking as productivity the amount of stuff we can do in one hour, the productivity has raised up to 30% in places like the US since 1999. Scholarly output has grown up to 60%, but that doesn’t translate necessarily into a productivity boost. The main reason why we are not productive is “the burden of knowledge”: we need longer times to study and process the amount of research output being produced in our areas of expertise. Even though tools for collaborating among researchers have been created, in order to boost our productivity we need synthesized knowledge, and Knowledge Graphs can help with that. Hopefully we’ll see more apps based on personalized knowledge graphs in the future 🙂

The rest of the workshop covered a variety of domains:

  • Bibliography: with the Semantic Lancet portal, allows exploring citations as a first class citizen, and Conference Live, a tool for accessing collecting and exploiting conference information and papers as Linked Data.
  • Licensing, with Elsevier’s copyright model.
  • Enhanced publications, where Bahar Sateli won the best paper award with her approach to create knowledge bases from papers using NLP techniques (pdf) and Hugo Mougard described an approach to align conference video talks to their respective papers.
  • Fostering collaborations: Luigi Di Caro described the impact of the collaborators in one’s own research(d-index). I tested it and I am glad to see that I am less and less dependent on my co-authors!
D-index: a tool for testing your trajectory dependence

D-index: a tool for testing your trajectory dependence

Linked Data or DBpedia?

I was a bit disappointed to discover that although many different papers claimed to be using/publishing Linked Data, in reality they were just approaches to work with one dataset: DBpedia. Ideally Linked Data applications should exploit and integrate the links from different distributed sources and datasets, not just a huge centralized dataset like DBpedia. In fact, the only paper that I saw that exploited the concept of Linked Data was the one presented by Ilaria Tiddi on using Linked Data to label academic communities (pdf), in which they aimed to explain data patterns detecting communities of research topics by doing link transersal and applying clustering techniques according to the LSA distance.

Web mining and Social Networks: is WWW becoming the conference of the big companies?

After assisting to the Web mining and Social Network tracks, I wonder whether it is possible to actually have a paper accepted about these topics if Microsoft, IBM, Yahoo or Google is not supporting the work with their data. I think almost all the papers in these tracks had collaborators from one of these companies, and I fear that in the future WWW might become monopolized by them. It is true that having industry involved is good for research. They provide useful real world use cases and data to test them. However, most of the presented work reduced itself at the presentation of a problem solved with a machine learning technique and a lot of training (which has the risk of over fitting the model). The innovation on the solutions wasn’t much, and the data was not accessible, as in most cases it’s private. A way to overcome this issue could be to make the authors of submitted papers to share their data as a requirement, which would be consistent to the open data movements we have been seeing in events like Open Research Data Day or Beyond the PDF; and would allow other researchers to test their own methods as well.

Opinions aside, some interesting papers were presented. Wei Song described how to extract patterns from titles for entity recognition with a high precision to produce templates of web articles (pdf); I saw automatic tagging of pictures using a 6 level neural network plus the derivation of a three level taxonomy from the tags (although the semantics was a bit naive in my opinion) (pdf); Pei Li introduced how to link groups of entities together to identify business chains (Pei Li. Univ of Zurich + Google) (pdf) and Gong Cheng described the creation of summaries for effective human-centered entity linking (pdf).

My personal favorites were the methods to detect content abusers in Yahoo answers to help the moderators’ work (pdf), by analyzing the flagged contents of the users; and the approach for detecting early rumors in Twitter (pdf) by Zhe Zhao. According to Zhe, they were able to detect rumors up to 3 hours before than anyone else.

Graph and subgraph mining

Since I have been exploring how to use graph mining techniques to find patterns in scientific workflows, I thought that attending these sessions might help me to understand better my problem. Unfortunately none of the presenters described approaches for common sub-graph mining, but I learnt about current hot topics regarding social networks: finding the densest sub-graphs (pdf, pdf and pdf), which I think it is important for determining which nodes are the most important to influence/control the network; and discovering knowledge from the graph, useful to derive small communities (pdf) and web discovery (pdf). I deliberately avoid providing details here, as these papers tend to be technical quite quickly.

Semantic Web

Finally, I couldn’t miss the Semantic Web track, since it was the one that could have the most potential overlap with the work my colleagues and I do in Madrid. We had 5 different papers, each one on a different topic:

  • benchmarking: Axel Ngonga presented GERBIL, a general entity annotator benchmark that can compare up to 10 entity annotations systems (pdf).
  • instance matching: Arnab Dutta explained their approach to match instances depending on the schema by using Markov clustering (pdf).
  • provenance: Marcin Wylot described their approach for materializing views for representing the provenance of the information. The paper uses TripleProv as a query execution engine, and claims to be the most efficient way to handle provenance enabled queries (pdf).
  • RDF2RDB: uncommon topic, as it is usually the other way around. Minh-Duc Pham proposed to obtain a relational schema from an RDF dump in order to exploit the efficiency of typical databases (pdf). However he recognized that if the model is not static this could present some issues.
  • triplestores: Philip Stutz introduced TripleRush (pdf) a triplestore that uses sampling and random walks to create a special index structure and be more efficient in clustering and ranking RDF data.

Final remarks

  • I liked a paper discussing the gender roles in movies against the actual census (pdf). Gives you an idea of how manipulative the media can be.
  • The microposts workshop was fun, although mainly focused on named entity recognition (e.g., Pinar Kagaroz’s approach). I think that “random walk” is the sentence I have heard the most in the conference.
  • Check Isabel Colluci’s analysis on contemporary social movements.
  • What are the top ten reasons why people lie on the internet? Check out this poster.

Next WWW will be in Montreal, Canada and James Hendler was happy about it. Do you want to see more highlights? Check Paul Groth’s trip report here, Thomas Steiner’s here, Amy Guy’s here and Marcin Wylot’s here.

Posted in Conference | Tagged: , , , | Leave a Comment »