A few months ago my supervisor told me about the opportunity to join a group of geologists in a field trip to Yosemite. The initiative was driven by the Earthcube community, in an effort to join together experts from different geological domains (tectonics, geochemistry, etc.) and computer scientists. I immediately applied for a place in the trip, and I have just returned back to Spain. It has been an amazing experience, so I want to summarize in this post my views and experiences during the whole week.
Travelling and people
For someone travelling from Europe, the trip was exhausting (2 scales and up to 24 hours of flights + waiting), but I really think it was worth it. I have learnt a lot from the group and the challenges geologists are facing when collecting and sharing their data, samples and methods. All participants were open and had the patience to explain any doubts or concerns on the geological terms being used in the exercises and talks. Also, all the attendants were highly motivated and enthusiastic to learn new technologies and methods that could help them to solve some of their current issues. I think this was crucial for creating the positive environment of discussion and collaboration we got during the whole experience. I hope this trip helps pushing forward best practices and recommendations for the community.
Yosemite National Park
There is little I can say about the park and its surroundings that hasn’t been already told. Therefore, I’ll let the pictures speak for themselves:
What was the rationale behind the trip?
As I said before, the purpose of the fieldtrip was to bring together computer scientists and geologists. The main reason why this could be interesting for geologists is twofold: first, the geologists could show and tell computer scientists how they work and their current struggle with either hardware or software on the field. The second reason is that geologists could connect to other geologists (or computer scientists) in order to foster future collaborations.
From a computer science point of view, I believe this kind of trip is beneficial to raise awareness of current technologies to end users (in many cases we have the technology but we don’t have the users to use it). Also, it always helps seeing by one’s eyes what are the real issues faced by scientists on a particular domain. It makes them easier to understand.
What was I doing there?
Nobody would believe me when I told them that I was going to travel to Yosemite with geologists to do some “field” work. And, to be honest, one of my main concerns preparing the trip was that I had no idea on how I would make myself useful for the rest of the attendants. I felt like I would learn a lot from all the other people, since some of their problems are likely to be similar to other problems in other areas, and I wanted to give something in return. Therefore I talked to everyone and asked a lot of questions. I also gave a 10 minute introductory talk on the Semantic Web (available here), to help them understand the main concepts they had already heard in other talks or project proposals. Finally, I came up with a list of challenges they have from the computational perspective and proposed extending existing standards to address some of them.
Challenges for geologists
I think it is worth describing here some of the main challenges that these scientists are facing when collecting, accessing, sharing and reusing data:
- Sample archival and description: there is no standard way of processing and archiving the metadata related to samples. Sometimes it is very difficult to find the metadata associated to a sample, and a sample with no metadata is worthless. Similarly, it is not trivial to find the samples that were used in a paper. NSF is now demanding a Data Management Plan, but what about the Sample Management Plan? Currently, every scientist is responsible for his/her samples, and some of those might be very expensive to collect (e.g., a sample from an expedition to Mount Everest). If someone retires or changes institutions, the samples are usually lost. Someone told me that the samples used in his work could be found in his parent’s garden, as he didn’t have space for them anymore (at least those could be found 🙂 ).
- Repository heterogeinity and redundancy. Some repositories have started collecting sample data (e.g., SESAR), which shows an effort from the community to address the previous issue. Every sample is given a unique identifier, but it is very difficult to determine if a sample already exists on the database (or other repositories). Similarly, there are currently no applications that allow exploiting the data of the repository. Domain experts perform SQL queries, which will be different for each repository as well. This makes integrating data from different repositories difficult at the moment.
- Licensing: People are not sure about the license which that have to attach to their data. This is key for being attributed correctly when someone reuses your results. I have seen this issue in other areas as well. In this link I think they explain everything with high detail: http://creativecommons.org/choose/.
- Sharing and reusing data: Currently if someone wants to reuse some other researcher’s mapping data (i.e., those geological observations they have written down in a map), they would have to contact the authors and ask them for a copy of their original field book. With luck, there will be a scanned copy or a digitized map, which then will have to be compared (manually) to the observations performed by the researcher. There are no approaches for performing such comparison automatically.
- Trust: Data from other researchers is often trusted, as there are no means to check whether the observations performed by a scientist are true or not unless one goes into the field.
- Sharing methods: I was surprised to hear that the mean reason why the methods and workflows followed on an experiment are not shared is because there is no culture for doing it. Apparently the workflows are there because some people use them as a set of instructions for students, but they are not documented in the scientific publications. This is an issue for the reproducibility of the results. Note that here we define workflow as the set of computational steps that are necessary to produce a research output on a paper. Geologists have also manual workflows for collecting observations on the field. These are described on their notebooks.
- Reliability: This was brought up by many scientists on the field. Many still think that the applications on their phones are often not reliable. In fact we did some experiments with an Iphone and Ipad and you could see differences in their measures due to their sensors. Furthermore, I was told that if a rock is magnetic, they become useless. Most of the scientists still rely on their compasses to perform their measurements.
Why should geologists share their data?
The vans haven’t been just a vehicle to take us to some beautiful places in this trip; they have been a useful means to get people to discuss some of the challenges and issues described above. In particular, I would like to recall the conversation we had one of the last days between Snir, Zach, Basil, Andreas, Cliff and others. After discussing some of the benefits that sharing has to other researchers, Andreas asked about the direct benefit he would obtain for sharing his data. This is crucial in my view, as if sharing data is only going to have benefits for other people and not me, why should I do it? (unless I get funding for it). Below you can find the arguments in favor of doing this practice as a community, tied with some of the potential benefits. (Quoting Cliff Joslyn in points 1 and 2)
- Meta-analysis: or being able to reuse other researcher’s data to analyze and compare new features. This is also beneficial for one’s own research, in case you change your laptop/institution and no longer have access to your previous data.
- Using consumer communities to help curating data: apparently, some geophysicists would love to reuse the data produced by geologists. They could be considered as clients and taken into account for applying into a grant in a collaboration.
- Credit and attribution: Recently some journals like PLOS or Elsevier have started creating data journals. In there you would just upload your dataset as a publication, so people using it can cite it. Additionally, there are data repositories like FigShare, where just by uploading a file you make it citable. This way someone could cite an intermediate result you obtained during part of your experiments!
- Reproducibility: sharing data and methods is a clear sign of transparency. By accessing the data and methods used in a paper, a reviewer would be able to check the intermediate and final results of a paper in order to see if the conclusions hold.
Are these benefits enough to convince geologists to share and annotate their data? In my opinion, the amount of time that one has to spend documenting work is still a barrier for many scientists. The benefits cannot be seen instantly, and in most of the cases people don’t bother after writing the paper. It is an effort that a whole community has to undertake, and make it part of its culture. Obviously, automatic metadata recording will always help.
This trip has demonstrated to be very useful to join together people from different communities. Now, how do we move forward? (again, I do some quoting from Cliff Joslyn, who summarized some of the points discussed during the week):
- Identify motivated people who are willing to contribute with their data.
- Creation of a community database.
- Agree on standards to use as a community, using common vocabularies to relate the main concepts on each domain.
- Analyze whether there are already existing valuable efforts already developed instead of starting from scratch.
- Contact computer scientists, ontologists and user interface experts to create a model that is both understandable and easy to consume from.
- Exploit the community database. Simple visualization in maps is often useful to compare and get an idea of mapped areas.
- Collaborate with computer scientists instead of considering them as merely servants. Computer scientist are interested in challenging real world problems, but they have to be in the loop.
Finally, I would like to thank Matty Mookerjee, Basil Tikoff and all the rest of the people who made this trip possible. I hope it happens again next year. And special thanks to Lisa, our cook. All the food was amazing!
Below I attach a summary of the main activities of the trip by days, in case someone is interested on attending future excursions. Apologies in advance on the incorrect usage of geological terms.
Summary of the trip
Day1: after a short introduction on how to configure your notebook (your convention, narrative, location, legend, etc.) we learnt how to identify the rock we had in front of us by using the hand lens. Rocks can be igneous, metamorphic and sedimentary, and in this case, as can be seen in the pictures below, we were in front of the igneous type. In particular, granite.
Once you know the type of rock you are dealing with and its location, it’s time to sketch, leaving the details and representing just those that are relevant for your observation. Note that different type of geologists might consider relevant different features. Another interesting detail is that observations are always associated with areas, not points, because of a possible error. This might sound trivial but adds a huge difference (and more complexity) when representing the information as a computer scientist.
The day ended with three short talks: one about the Strabo app for easily handling and mapping your data with your phone, and the Fieldmove app (Andrew Bladon) for easily measuring strike and dip, adding annotations and representing them in a map. Shawn Ross wrapped up the session by talking briefly about his collaborations with archaeologists for field data collection.
Day2: We learnt about cross sections in Sierra Nevada, after a short explanation on the evolution of the area from a geological perspective. Apparently geologist think in time when analyzing a landscape, in order to determine which were the main changes that were necessary to produce the current result. In this regard, it is like learning about the provenance of the earth, which I think it is pretty cool.
Unfortunately, Matty’s favorite section was not accessible and had to be explained via a poster. Some flooding had destroyed the road and was too far away to be reached by foot. Therefore we were driven to another place in the Sierra where we were asked to draw a cross section ourselves (with the help of a geologist). It was an area with very clear faults, and most of us drew their direction right. The excursion ended when one of the geologist gave a detailed speech on the rationale behind her sketch, so we could compare.
When we arrived at the research center, Jim Bowing gave a short talk on state, and how geologists should be aware of their observations and the value of the attributes described on them. We as computer scientists can only recreate what we are given. We then divided in groups and thought about use cases, reporting two to the rest of the groups.
Day3: It was time to learn about the gear: GPS, tablet and laptop (which can be heavy). All equipped with long lasting batteries (could last more than 2 days of fieldwork). We went to the Deep Springs Valley, and after locating ourselves on a topological map we followed a contact (i.e., line between two geological units). We could experience some frustration with the devices (the screen was really hard to see) and we poured some acid on the rocks in order to determine whether they were carbonated or not.
The contact finished abruptly in a fault after a few hundred meters (represented as a “v” in a map). We determined its orientation and fault access, which was possible thanks to some of the mobile applications we were using on the field. If done by hand, we would have had to analyze our measurements at home.
After a brief stop on an observatory full of metamorphic rocks, we headed back to the research center. There, Cliff Joslyn and I gave a brief introduction to databases, relational models and the Semantic Web before doing another group activity. In this case, we tried to think about the perfect app for geologists, and what kind of metadata would it need to capture.
Day4: We went to the Caldera, close to a huge crack in the ground, where we learnt a bit more about of its formation. There was a volcanic eruption in two phases, which can be distinguished by the materials that are around the pomez stones.
We then went to the lakes, where we learnt from Matty on how to extract a sample. First you ought to identify properly the rock, annotate it with the appropriate measurements (orientation, strike, dip), label the rock and then extract it. If you use a sample repository like SESAR, you may also ask in advance for identifiers and print stickers for labeling the rock.
We ended the hike with a short presentation by Amanda Vizedom on ontologies and discussing about the future steps for the community.