Data management 101: Insights and highlights from #SciData16

11/2/2016

I attended the #scidata16 meeting last week as an amateur reporter and as part of an award for being selected as a finalist for the SciData writing contest. After the conference I returned to my own office and my own project, searching for the code and datasets I needed to re-make some figures from some data analysis months prior, with the discussions of the previous day all of a sudden feeling even more relevant. I thought it would be worthwhile for our Science with Style readers to provide some highlights from the conference and some tips and tricks for data management and sharing. You’ll be able to read my upcoming report on one of the keynote presentations in a future post on the Nature Jobs blog.

Early career researchers, especially PhD students, tend to focus on their own work and their own project. But as you progress through a career in research, the projects you’ll be involved in will become much larger efforts, with not as much of the project that’s yours and yours alone. Anyone who’s dug through a freezer full of boxes to find some crucial samples that a student who graduated 3 years ago left in a box labelled “E. coli samples” will know the struggles facing those of us in lab management.

But for researchers who are working on large datasets or large collaborative projects, the concepts and importance of data management might not be as evident. As science students we learn how to keep lab notebooks organized and in graduate school we learn how to organize our samples and important reagents, but when your entire project is stored digitally, how should it be organized? When do we learn as Phd students or early career researchers how to manage digital information?

While the conference was focused on quite a few topics related to data science, management, and open data, I’ll focus on just a few of the highlights from the keynotes. You can read more in-depth about the meeting in upcoming posts by myself and other #scidata16 contest winners in the coming weeks.

Reproducibility: When comparing data science with wet lab science, there are more overlaps than you think in how both are conducted and managed. One overlapping concept is that both types of data need to be reproducible. The first keynote speaker, Dr Florian Markowetz of the University of Cambridge, gave an example of a paper which was later retracted after two bioinformaticians noticed that the incredible findings they discovered were only due to Excel copy-paste errors. And those incredible figures you made once but now can’t find the original code? You need to have the data and the plan in order to make them again, or else it’s not a trustworthy result. My favorite quote from this talk was “A project is more than a beautiful result.”

Dr. Markowetz also gave the audience 5 things that data reproducibility can do for you. It can 1) help you avoid disaster, like having a retracted paper, 2) help you write a paper since it’s easier to look up numbers and be confident in your figures, 3) help you during peer-review since you can share your data and let the reviewer take a look for themselves, 4) help you achieve continuity in your work so you can come back to a problem later and you don’t have to start all over again, and 5) it will help you build a better reputation, which will allow you to submit your work to better journals and can establish yourself as a solid scientist.

Dr. Markowetz gave a great talk and emphasized that reproducibility is not a waste of time but is a part of science—think if your lab mate or a future student in your lab could repeat the ground-breaking results you generate in your thesis. The big take-home message here is to make reproducibility a part of your work flow early on in your career.

Data sharing: We started off the second keynote by Dr. Jenny Molloy (also from the University of Cambridge) with an answer to the seemingly apparently question of ‘What is Data?’, which she defined as collected observations and tabular calculations. Explaining what data you have is the first step for data sharing. It’s also important to understand that you can retain ownership and restrict how other uses and reuse data you share, similar to copyright on images and written works.

In another series of 5 items, we also learned the 5 steps for data sharing: 1) get motivated and start early, 2) stay on top of your data, 3) share the way you want to, 4) make the most of your sharing experience, and 5) set an example to your colleagues. If you ask why sharing is important, Dr. Molloy emphasized how open data can lead to better career recognition, connections to new collaborators and employers, and even gave some examples of how open science is creating new jobs for researchers with experience in data management. Other presentations on open data also highlighted tools available to researchers—if you’re interested in learning more, be sure to check out the Open Knowledge Framework website for examples and data management training.

Data management: Dr. Kevin Ashley from the University of Edinburgh discussed tools and infrastructure already in place for data management. He first emphasized that data management is not something that happens at the end of a project but something that begins when you conceptualize an idea and think about what data might look like in the end. The importance of good data collection and management was also highlighted in discussions on astronomy data and their use in research today. Measurements from 8th century astronomers are still being used by researchers today, although for purposes not connected to what the observers originally intended. Dr. Ashley also mentioned the volume of data collection efforts from the Hubble telescope, where numerous publications and observations were made not on data collected by the researcher who wrote the paper. This keynote highlighted the importance of clear and open data management policies that allow researchers to tap into their own ideas without even having collected the data themselves.

Dr. Ashley also mentioned that we’ll be running out of storage space in the long term, based on how quickly storage capacities and the number of datasets are both increasing. Because of that, it’s important for ECRs to consider what needs to be kept and for how long. And for curious ECRs wondering about the details of data management, recommendations for project budgets (5%) as well as the role of institutional infrastructure for data storage were also discussed.

The future of data science: Dr. Andrew Hufton, the editor of Scientific Data, talked about the role of data journals as well as the importance of meeting journal requirements for open data sharing. Data journals are one way to get credit for reproducibility of your results and to have your data cited even when you’re not involved with the new paper itself. Data should also be seen before it can be believed, and it needs to be able to be shared or it’s not science. Dr. Hufton also emphasized how data sharing drives the impact of your work, especially for researchers working in emerging or timely fields (such as zika virus research).

Dr. Hufton also presented an acronym for good data sharing, the type of sharing that allows other authors to replicate and build off of the author’s claims. This includes making data FAIR: findable, accessible, interoperable (i.e. in the right format), and reusable (i.e. having really good descriptors for each header). Dr. Hufton also emphasized that while supplementary materials are great, they are not curate and machine-readable and should not be the only place you put your results.

What’s next? One of the last points discussed really hit home for me: when it comes to being a scientist, we need to take time to remember the reason that we do research: we are tackling the problems facing our world and need to remember that people’s lives can be directly impacted by our work. Any work we do that’s not open, repeatable, or manage properly can negatively impact others, not just our own career, and work poorly done work can be harmful to people who rely on our work for bettering their lives. A recent article about incentives in science highlighted this concept, which again brings up the need for incentivizing well-done, repeated studies instead of just more publications.

While it will take some time for the research culture to change, you can already find Open Science peers through the OSF network as well as reaching out to your institution for support in terms of data management and open data platforms available. With just a few of the potential benefits to your career laid out in this post, there are certainly a number of reasons for having open, repetitive, and well-managed datasets—and if I didn’t manage to convince you in this post, you can catch up on the #scidata16 tweets or see the presentations posted later on the Nature Jobs website.

I greatly enjoyed #scidata16 not only for the experiences as a reporter-in-training but also as a bioinformatician and as a person who is interested in finding ways to improve the research experiences of PhD students and ECRs. The conference had a great set of speakers as well as tips and tricks for researchers at all stages in their careers and across a range of fields. Whether it’s a big or small dataset, making it readable, available, and interpretable by others in the long run is a more powerful tool than I would have thought before attending this conference. Who knows—it could even get you a publication in a Nature journal!

Comments are closed.

Data management 101: Insights and highlights from #SciData16

Archives

Categories