A couple of weeks ago I attended the International World Wide Web (WWW) conference in Florence. This was my first time in WWW, and I was impressed by the amount of attendants (apparently, more than 1400). Everyone was willing to talk and discuss about their work, so I met new people, talked to some I already knew and left with a very positive experience. I hope to be back in the future.
In this post I summarize my views on the conference. Given its size, I could not attend all the different talks, workshops and tutorials, but if you could not come you might be able to get an idea on the types of the contents that were presented. The proceedings can be accessed online here.
The conference was held in Fortezza da Basso, one of Florence’s largest historical buildings. Although it was packed with talks, tutorials and presentations, more than one attendant managed to skip one or two sessions to do some sightseeing, and I can’t blame them. I didn’t skip any sessions, but I managed to visit the Ponte Vecchio and have a walk around the city after the second day was over :).
My contribution: Linked Data Platform and Research Objects
My role in the conference was to present a poster in the Save-SD workshop. We use the Linked Data Platform standards to access Research Objects according to the Linked Data principles, which make them easy to create, manage, retrieve and edit. You can check our slides here, and we have a live demo prototype here. The poster can be seen in the picture below. We got some nice potential users and feedback from the attendants!
The conference keynotes
The keynotes were one of the best part of the conference. Jeanette Hoffman opened the first day by describing the dilemmas of digitalization, comparing them to the myth of falling between Scylla and Charybdis. She introduced four main dilemmas, which may not have a best solution:
- The privacy paradox, as we have a lot of “free” services at our disposal, but the currency in which we pay for them is our own private data
- Bias on free services: For example, org, is an alliance of enterprises that claim to be offering local services for free in countries where people cannot afford it. But some protesters claim that they offer a manipulated internet where people can’t decide. Is it better to have something biased for free or an unbiased product for which you have to pay?
- Data protection versus free access to information: illustrated with the right to be forgotten, celebrated in Germany as a success of the individual over Google, but heavily criticized in other countries like Spain where corrupt politicians use it to look better to the potential voters after the sentence has expired. The process of “being forgotten” is not transparent at all.
- Big brother is always watching you: how do the security / law enforcement / secret services collect everything about us? (All for the sake of our own protection). National services collect the data on the foreigners to protect the locals. What about data protection? Shall we consider ourselves under constant surveillance?
The second keynote was given by Deborah Estrin, and it discussed what we could do with our small data. We are walking sensors constantly generating data with our mobile devices and “small data is to individuals what big data is to institutions”. However, most people don’t like analyzing their data. They download apps that passively record and use their data to show them useful stuff: healthy purchases based on your diet, decline at an old age, monitoring, etc. The issue of privacy is still there, but “is it creepy when you know what is going on, instead of everybody using this data without you knowing. What can’t you benefit from your own data as well?”.
Andrei Broder, from Google, was the last keynote presenter. He did a retrospective of the Web, analyzing whether their predictions for the last decade were true or not, and doing some additional ones for the future. He introduced the 3 drivers of progress: scaling up with quality, a faster response and higher functionality levels:
The keynote also included some impressive data, from then and now. In 1999 people had still to be explained what a web crawler was. Today 20 million pages are crawled every day, and the index is over 100 PetaBytes. Wow. Regarding future predictions, it looks like Google is evolving from a search box to a request box:
Saving scholarly discourse
I attended the full day SAVE-SD workshop, designed for enhancing scholarly data with semantics, analytics and visualization. The workshop was organized by Francesco Osborne, Silvio Peroni and Jun Zhao, and it received a lot of attention (even though the LDOW workshop was running in parallel). One of the features of the workshop was that you could submit your paper in html using the RASH grammar. The paper is then enriched and can be directly converted to other formats demanded by publishers like the ACM’s pdf template.
Paul Groth kicked off the workshop by introducing in his keynote how to increase the productivity in scholarship by using knowledge graphs. I liked how Paul quantified productivity with numbers: taking as productivity the amount of stuff we can do in one hour, the productivity has raised up to 30% in places like the US since 1999. Scholarly output has grown up to 60%, but that doesn’t translate necessarily into a productivity boost. The main reason why we are not productive is “the burden of knowledge”: we need longer times to study and process the amount of research output being produced in our areas of expertise. Even though tools for collaborating among researchers have been created, in order to boost our productivity we need synthesized knowledge, and Knowledge Graphs can help with that. Hopefully we’ll see more apps based on personalized knowledge graphs in the future 🙂
The rest of the workshop covered a variety of domains:
- Bibliography: with the Semantic Lancet portal, allows exploring citations as a first class citizen, and Conference Live, a tool for accessing collecting and exploiting conference information and papers as Linked Data.
- Licensing, with Elsevier’s copyright model.
- Enhanced publications, where Bahar Sateli won the best paper award with her approach to create knowledge bases from papers using NLP techniques (pdf) and Hugo Mougard described an approach to align conference video talks to their respective papers.
- Fostering collaborations: Luigi Di Caro described the impact of the collaborators in one’s own research(d-index). I tested it and I am glad to see that I am less and less dependent on my co-authors!
Linked Data or DBpedia?
I was a bit disappointed to discover that although many different papers claimed to be using/publishing Linked Data, in reality they were just approaches to work with one dataset: DBpedia. Ideally Linked Data applications should exploit and integrate the links from different distributed sources and datasets, not just a huge centralized dataset like DBpedia. In fact, the only paper that I saw that exploited the concept of Linked Data was the one presented by Ilaria Tiddi on using Linked Data to label academic communities (pdf), in which they aimed to explain data patterns detecting communities of research topics by doing link transersal and applying clustering techniques according to the LSA distance.
Web mining and Social Networks: is WWW becoming the conference of the big companies?
After assisting to the Web mining and Social Network tracks, I wonder whether it is possible to actually have a paper accepted about these topics if Microsoft, IBM, Yahoo or Google is not supporting the work with their data. I think almost all the papers in these tracks had collaborators from one of these companies, and I fear that in the future WWW might become monopolized by them. It is true that having industry involved is good for research. They provide useful real world use cases and data to test them. However, most of the presented work reduced itself at the presentation of a problem solved with a machine learning technique and a lot of training (which has the risk of over fitting the model). The innovation on the solutions wasn’t much, and the data was not accessible, as in most cases it’s private. A way to overcome this issue could be to make the authors of submitted papers to share their data as a requirement, which would be consistent to the open data movements we have been seeing in events like Open Research Data Day or Beyond the PDF; and would allow other researchers to test their own methods as well.
Opinions aside, some interesting papers were presented. Wei Song described how to extract patterns from titles for entity recognition with a high precision to produce templates of web articles (pdf); I saw automatic tagging of pictures using a 6 level neural network plus the derivation of a three level taxonomy from the tags (although the semantics was a bit naive in my opinion) (pdf); Pei Li introduced how to link groups of entities together to identify business chains (Pei Li. Univ of Zurich + Google) (pdf) and Gong Cheng described the creation of summaries for effective human-centered entity linking (pdf).
My personal favorites were the methods to detect content abusers in Yahoo answers to help the moderators’ work (pdf), by analyzing the flagged contents of the users; and the approach for detecting early rumors in Twitter (pdf) by Zhe Zhao. According to Zhe, they were able to detect rumors up to 3 hours before than anyone else.
Graph and subgraph mining
Since I have been exploring how to use graph mining techniques to find patterns in scientific workflows, I thought that attending these sessions might help me to understand better my problem. Unfortunately none of the presenters described approaches for common sub-graph mining, but I learnt about current hot topics regarding social networks: finding the densest sub-graphs (pdf, pdf and pdf), which I think it is important for determining which nodes are the most important to influence/control the network; and discovering knowledge from the graph, useful to derive small communities (pdf) and web discovery (pdf). I deliberately avoid providing details here, as these papers tend to be technical quite quickly.
Finally, I couldn’t miss the Semantic Web track, since it was the one that could have the most potential overlap with the work my colleagues and I do in Madrid. We had 5 different papers, each one on a different topic:
- benchmarking: Axel Ngonga presented GERBIL, a general entity annotator benchmark that can compare up to 10 entity annotations systems (pdf).
- instance matching: Arnab Dutta explained their approach to match instances depending on the schema by using Markov clustering (pdf).
- provenance: Marcin Wylot described their approach for materializing views for representing the provenance of the information. The paper uses TripleProv as a query execution engine, and claims to be the most efficient way to handle provenance enabled queries (pdf).
- RDF2RDB: uncommon topic, as it is usually the other way around. Minh-Duc Pham proposed to obtain a relational schema from an RDF dump in order to exploit the efficiency of typical databases (pdf). However he recognized that if the model is not static this could present some issues.
- triplestores: Philip Stutz introduced TripleRush (pdf) a triplestore that uses sampling and random walks to create a special index structure and be more efficient in clustering and ranking RDF data.
- I liked a paper discussing the gender roles in movies against the actual census (pdf). Gives you an idea of how manipulative the media can be.
- The microposts workshop was fun, although mainly focused on named entity recognition (e.g., Pinar Kagaroz’s approach). I think that “random walk” is the sentence I have heard the most in the conference.
- Check Isabel Colluci’s analysis on contemporary social movements.
- What are the top ten reasons why people lie on the internet? Check out this poster.
Next WWW will be in Montreal, Canada and James Hendler was happy about it. Do you want to see more highlights? Check Paul Groth’s trip report here, Thomas Steiner’s here, Amy Guy’s here and Marcin Wylot’s here.