These days, pretty much everyone acknowledges that scientific analysis should be reproducible. Of course, this is still very often not the case. For the most part, I don’t think that people start out trying to write code and structure their data so no one else can reproduce their results. But I know from experience it’s very easy to start off with the best of intentions and end up with something that works if you cross your fingers and don’t change anything for fear it will break.

There are a ton of resources out there for how to use this or that fancy piece of software to make everything reproducible. But for the most part I think these miss the key point, which is that as scientists we’re all already super busy and don’t have time to invest in learning Docker (or whatever else). More than that, many of these tools and approaches require a degree of computational sophistication that is not needed to be a good scientist.

The conclusion I’ve come to is that the best way to make analysis reproducible is not any piece of software, but to have a few simple and easy to follow rules that you stick to without exception. This is largely based on ~10 years of trial and error. It’s not perfect, but it works for me, requires almost not extra effort, and can be followed by anyone with a basic grasp of the command line.

Really, it’s just the one rule

Keep everything in the one place

The raw data, your code, that excel table someone sent you, the supplementary table from an old project, the picture of the whiteboard you took in lab meeting, everything. If you do that, and nothing else, you’ll at least have a record of all the bits and pieces that you used to generate your results.

The problem with this is that just like the general advice to make your analysis reproducible it’s hard to do in practice. Thankfully, the “everything in one place” rule is easy to follow if you learn a few basic tools.

1. Create a standard project structure and stick to it

It probably doesn’t matter too much what convention you use, as long as it’s sensible and you stick to it. My convention is that each project gets its own directory. That directory will always contain at least three folders:

  1. Code - every script and bit of code I use in the analysis.
  2. Data - all the raw data, tables, etc. Everything that I needed to do an experiment to generate, or information provided by someone else.
  3. Results - everything generated using Code

I have tried having an IntermediateResults folder for … intermediate results at various points. That never worked for me as knowing what is intermediate is a bit like knowing which draft of a paper is the final one.

I’ll also have other folders like Talks, Paper, Admin, as and when they’re needed. But I always have the Code, Data, and Results folders for every project and try to stick to their definitions as defined above.

You know that introductory linux course where you learnt how to use cd to change directory, mv to move things, and cp to copy files. Well forgot about cp, you don’t need it any more (not really). All you need is

ln -s

Rather than copying a file, ln -s source destination will make a symbolic (the -s part) link that points anything that looks for source to the file or directory destination. Why is that so magical? Because it completely removes the barrier to putting everything in the one place.

Suppose my project uses whole genome sequencing BAM files as its starting point. They’re stored somewhere not sensible on the filesystem and there’s no way I can justify copying the whole damn thing into my project directory just to satisfy some preachy guy on the internet. But you’re not going to copy them, you’re just going to create a link to them. Which takes no time, takes up no disk space and creates a permanent record of where you got the file from.

You can and should use this trick for everything you need to include that logically lives somewhere else. That script from an old project you want to reuse? Symbolically link it to the Code directory of your new project. A supplementary table from something you published years ago. Symbolic link. You get the idea.

Of course there will be times you’ll want to make an actual copy of files. Like if you want to use some old code as a starting point, but customise it to the current project.

3. All your paths must be relative

When you are writing code to perform your analysis, and you need to load a file, or save a table of a figure, you need to tell the programming language where to save to / load from. When you need to do this, your paths should never be absolute, but instead must always be relative, usually to the top level of your project folder. That is, every script I write will start with a line like

setwd('~/Projects/MySuperReproducibileProject/')

After that, no file I load or location I save to will begin with a / or a ~. Files will be loaded from Data/that_table_I_need.tsv. Results will be saved to Results/my_super_cool_plot.pdf.

If you find yourself writing read.table('~/path/to/this/table/imma/use/real/quick.tsv'), stop. You’re doing it wrong and you have failed at following the rule 1. It’s easy to fix though. Just run

ln -s ~/path/to/this/table/imma/use/real/quick.tsv ~/Projects/MySuperReproducibileProject/Data/table_imma_use_real_quick.tsv

Then you can happy proceed by writing read.table('Data/table_imma_use_real_quick.tsv'). Hooray!

The benefits

If you follow these basic rules, you’ll find it’s actually not that hard to keep everything all in one place. But even though it’s not much effort, it’s also not zero effort. So why bother?

Because your paper is super important everyone who builds on your work will thank you for it. But on the off chance the world is full of unworthy philistines who don’t appreciate your insights, I guarantee the future version of yourself will be very grateful that you put in the effort to make what you did followable.

There are some other side benefits to this approach too.

Small project folder size

Because everything that is large is not actually in the project directory, but linked into it symbolically from somewhere else, your project folder should not get unmanageably large. Which is great, because that makes backing it up or syncing it between machines super easy.

Most institutions I’ve worked at have a small home area that is automatically backed up, then a larger, not backed up “scratch” space. By using symbolic links, you can ensure that your project folders are small enough to live in your home space and hopefully be backed up automatically for you.

Easily package the whole thing for distribution

Suppose now you want to send your analysis to someone else. Obviously what you want to do is send them your project folder. But they won’t have access to your filesystem, so all those symbolic links are going to point at nothing and they won’t be able to get very far. Oh no, the system has failed! Or has it… Remember when I told you to forget about cp. I lied.

Because the cp command knows what symbolic links are and has options for how to handle them when things are copied. When you want to bundle your whole project up to ship if off to wherever, you need only run

cp -RL ~/Projects/MySuperReproducibileProject completeCopyOfMySuperReproducibileProject

The -R flag will make copy recursive and the -L command will make it go to where each symbolic link is pointing and copy the actual file instead of just a link. So the symbolic links become copies of the files that they point to.

Things to watch out for

Other than being lazy and not following my own rules, the thing that I find usually trips me up is expecting ln -s to behave like copy when that’s not really how it does (or should) behave. To see what I mean by this, think about what would happen if I try and do

cp file_that_does_not_exist someplace

as file_that_does_not_exist does not exist, the command will fail and say No such file or directory. On the other hand if you run,

ln -s file_that_does_not_exist someplace

it will run just fine. That is because ln -s doesn’t care that file_that_does_not_exist does not exist, it’s happy to create a link that points to nothing. Once you realise this, a lot of the other seemingly odd ways ln -s behaves start to make sense. For instance, I’ve often been caught out writing something like

cd ~/Projects/MySuperReproducibileProject
ln -s ../Projects/TheOtherThing/Code/usefulScripts.R Code/codeToUse.R

This seems like it should work. The file ../Projects/TheOtherThing/Code/usefulScripts.R exists relative to the directory where ln is run and so does the directory Code. Unfortunately, this will point Code/codeToUse.R at ~/Projects/MySuperReproducibileProject/Projects/TheOtherThing/Code/usefulScripts.R, which is not what I wanted.

Why did this happen? Because ln doesn’t care if the thing you’re pointing to exists or not, it just does what it’s told. In this case I told it to make a file called codeToUse.R in the directory Code and tell it to point at ../Projects/TheOtherThing/Code/usefulScripts.R. It doesn’t care which directory the command was run, so it points at ../Projects/TheOtherThing/Code/usefulScripts.R relative to the directory ~/Projects/MySuperReproducibileProject/Code not ~/Projects/MySuperReproducibileProject where ln was run.

Fortunately, you mostly don’t need to know about any of this. Just make make sure you always use absolute paths in the source part of ln -s source destination. So if I instead of,

cd ~/Projects/MySuperReproducibileProject
ln -s ../Projects/TheOtherThing/Code/usefulScripts.R Code/codeToUse.R

I did,

cd ~/Projects/MySuperReproducibileProject
ln -s ~/Projects/TheOtherThing/Code/usefulScripts.R Code/codeToUse.R

Things will work as you expect.

Optional extras

When you started reading this you were probably expecting lots about using github and docker, not the worlds longest love letter to symbolic links. There are definitely more things that you could do to make your work more reproducible.

Using some kind of version control like github a certainly a good idea. I try and always have one master script that runs all the other bits of code and runs my analysis from start to finish.

Probably the biggest limitation is that while your project folder will contain all the data and code needed for the analysis, it doesn’t contain the specific setup of the machine you ran it on. Projects like (docker provide one way of trying to keep a record of the machine setup as well.

These are certainly good things to do if you have the time, skills, and patience. But I think it’s best to start with the basics.