How to disclose code and data for peer-review

1 Why bother?

Starting a PhD is a bit like falling into the deep ocean. You better learn how to swim as fast as possible, you will get a lot of salty feedback, and you should try to avoid any predators by all means. The good news is that there are many awesome people out there who will help you to stay on top (or at least help you not to drown). Taking their advice, I have managed to build a method that allows me to sail smoothly on the deep waters of a submission procedure.
Despite a lot of critique, I actually like the peer-review process. But for it to be helpful, reviewers need access to all files and the raw code that was used for a manuscript. Only then they will be able to tell if the results of the manuscript are reliable or not. Even worse, the code and the data should be in an analysis ready format, so that the reviewers don’t have to wrangle with e.g. absolute directories. Journals will often give you the opportunity to submit some code files along the manuscript, but this will definitely destroy your neat project structure and reviewers will stop to look at your code after some frustrating minutes.
Using Rstudio’s project option, github, and figshare, I have managed a way to provide all of my code and data anonymously and in an analysis ready structure for reviewers. I feel save knowing that reviewers have access to all files but no one else has, reviewers will have an easy time reproducing my results, and editors will have more confidence in the reliability of my manuscript.

2 The basic set up

I always start with a R project on my local computer. It really helps to set up a clean project structure with folders for raw data, results, and so on. The Chapter 8 of the R for data science introduces some basic concepts of projects. The following image shows a R project for a current manuscript.

Fig. 1: Examplary project structure

Note that I have a README text file there, stating what each code-file is doing, where to find data, and so on. It is basically a tutorial for everyone working with the project on how to sucessfully work with it. Remember to navigate between the folders (e.g. when reading in data or when exporting plots) with the awesome here package. This will help the reviewers later on to easily reproduce your results without the need to change any code or directories.
After setting up the infrastructure, and before working on any code, I set my project under version control using github. I will not cover why you should use a version control tool while doing data science, but I strongly recommend it for everything you do. And you will need github for disclosing code and data anyways. Remember to set the repository on private, so no one else can access (or steal) your code before publication. Upon acceptance, you can change the repo to public and everyone can look at your awesome code.

Fig. 2: Remember to set your repo on private

Now, wouldn’t it be sufficient to give only reviewers access to the repo? To do that, you would need to invite the reviewers as contributors, and for that you will need their names. Not really helpful for blind peer-review, though. So we do need another tool to do that. Here figshare.com really shines. Figshare will deposit a snapshot of our repo from github and generate a link. Anyone who has the link can access and download a repo with all the code and data in the same structure as in the original repo.

Related