These days, pretty much everyone acknowledges that scientific analysis should be reproducible. Of course, this is still very often not the case. For the most part, I don’t think that people start out trying to write code and structure their data so no one else can reproduce their results. But I know from experience it’s very easy to start off with the best of intentions and end up with something that works if you cross your fingers and don’t change anything for fear it will break.
There are a ton of resources out there for how to use this or that fancy piece of software to make everything reproducible. But for the most part I think these miss the key point, which is that as scientists we’re all already super busy and don’t have time to invest in learning Docker (or whatever else). More than that, many of these tools and approaches require a degree of computational sophistication that is not needed to be a good scientist.
The conclusion I’ve come to is that the best way to make analysis reproducible is not any piece of software, but to have a few simple and easy to follow rules that you stick to without exception. This is largely based on ~10 years of trial and error. It’s not perfect, but it works for me, requires almost not extra effort, and can be followed by anyone with a basic grasp of the command line.
Really, it’s just the one rule
Keep everything in the one place
The raw data, your code, that excel table someone sent you, the supplementary table from an old project, the picture of the whiteboard you took in lab meeting, everything. If you do that, and nothing else, you’ll at least have a record of all the bits and pieces that you used to generate your results.
The problem with this is that just like the general advice to make your analysis reproducible it’s hard to do in practice. Thankfully, the “everything in one place” rule is easy to follow if you learn a few basic tools.
1. Create a standard project structure and stick to it
It probably doesn’t matter too much what convention you use, as long as it’s sensible and you stick to it. My convention is that each project gets its own directory. That directory will always contain at least three folders:
Code
- every script and bit of code I use in the analysis.Data
- all the raw data, tables, etc. Everything that I needed to do an experiment to generate, or information provided by someone else.Results
- everything generated usingCode
I have tried having an IntermediateResults
folder for … intermediate results at various points. That never worked for me as knowing what is intermediate is a bit like knowing which draft of a paper is the final one.
I’ll also have other folders like Talks
, Paper
, Admin
, as and when they’re needed. But I always have the Code
, Data
, and Results
folders for every project and try to stick to their definitions as defined above.
2. Symbolic links are the answer
You know that introductory linux course where you learnt how to use cd
to change directory, mv
to move things, and cp
to copy files. Well forgot about cp
, you don’t need it any more (not really). All you need is
ln -s
Rather than copying a file, ln -s source destination
will make a symbolic (the -s
part) link that points anything that looks for source
to the file or directory destination
. Why is that so magical? Because it completely removes the barrier to putting everything in the one place.
Suppose my project uses whole genome sequencing BAM files as its starting point. They’re stored somewhere not sensible on the filesystem and there’s no way I can justify copying the whole damn thing into my project directory just to satisfy some preachy guy on the internet. But you’re not going to copy them, you’re just going to create a link to them. Which takes no time, takes up no disk space and creates a permanent record of where you got the file from.
You can and should use this trick for everything you need to include that logically lives somewhere else. That script from an old project you want to reuse? Symbolically link it to the Code
directory of your new project. A supplementary table from something you published years ago. Symbolic link. You get the idea.
Of course there will be times you’ll want to make an actual copy of files. Like if you want to use some old code as a starting point, but customise it to the current project.
3. All your paths must be relative
When you are writing code to perform your analysis, and you need to load a file, or save a table of a figure, you need to tell the programming language where to save to / load from. When you need to do this, your paths should never be absolute, but instead must always be relative, usually to the top level of your project folder. That is, every script I write will start with a line like
setwd('~/Projects/MySuperReproducibileProject/')
After that, no file I load or location I save to will begin with a /
or a ~
. Files will be loaded from Data/that_table_I_need.tsv
. Results will be saved to Results/my_super_cool_plot.pdf
.
If you find yourself writing read.table('~/path/to/this/table/imma/use/real/quick.tsv')
, stop. You’re doing it wrong and you have failed at following the rule 1. It’s easy to fix though. Just run
ln -s ~/path/to/this/table/imma/use/real/quick.tsv ~/Projects/MySuperReproducibileProject/Data/table_imma_use_real_quick.tsv
Then you can happy proceed by writing read.table('Data/table_imma_use_real_quick.tsv')
. Hooray!
The benefits
If you follow these basic rules, you’ll find it’s actually not that hard to keep everything all in one place. But even though it’s not much effort, it’s also not zero effort. So why bother?
Because your paper is super important everyone who builds on your work will thank you for it. But on the off chance the world is full of unworthy philistines who don’t appreciate your insights, I guarantee the future version of yourself will be very grateful that you put in the effort to make what you did followable.
There are some other side benefits to this approach too.
Small project folder size
Because everything that is large is not actually in the project directory, but linked into it symbolically from somewhere else, your project folder should not get unmanageably large. Which is great, because that makes backing it up or syncing it between machines super easy.
Most institutions I’ve worked at have a small home area that is automatically backed up, then a larger, not backed up “scratch” space. By using symbolic links, you can ensure that your project folders are small enough to live in your home space and hopefully be backed up automatically for you.
Easily package the whole thing for distribution
Suppose now you want to send your analysis to someone else. Obviously what you want to do is send them your project folder. But they won’t have access to your filesystem, so all those symbolic links are going to point at nothing and they won’t be able to get very far. Oh no, the system has failed! Or has it… Remember when I told you to forget about cp
. I lied.
Because the cp
command knows what symbolic links are and has options for how to handle them when things are copied. When you want to bundle your whole project up to ship if off to wherever, you need only run
cp -RL ~/Projects/MySuperReproducibileProject completeCopyOfMySuperReproducibileProject
The -R
flag will make copy recursive and the -L
command will make it go to where each symbolic link is pointing and copy the actual file instead of just a link. So the symbolic links become copies of the files that they point to.
Things to watch out for
Other than being lazy and not following my own rules, the thing that I find usually trips me up is expecting ln -s
to behave like copy when that’s not really how it does (or should) behave. To see what I mean by this, think about what would happen if I try and do
cp file_that_does_not_exist someplace
as file_that_does_not_exist
does not exist, the command will fail and say No such file or directory
. On the other hand if you run,
ln -s file_that_does_not_exist someplace
it will run just fine. That is because ln -s
doesn’t care that file_that_does_not_exist
does not exist, it’s happy to create a link that points to nothing. Once you realise this, a lot of the other seemingly odd ways ln -s
behaves start to make sense. For instance, I’ve often been caught out writing something like
cd ~/Projects/MySuperReproducibileProject
ln -s ../Projects/TheOtherThing/Code/usefulScripts.R Code/codeToUse.R
This seems like it should work. The file ../Projects/TheOtherThing/Code/usefulScripts.R
exists relative to the directory where ln
is run and so does the directory Code
. Unfortunately, this will point Code/codeToUse.R
at ~/Projects/MySuperReproducibileProject/Projects/TheOtherThing/Code/usefulScripts.R
, which is not what I wanted.
Why did this happen? Because ln
doesn’t care if the thing you’re pointing to exists or not, it just does what it’s told. In this case I told it to make a file called codeToUse.R
in the directory Code
and tell it to point at ../Projects/TheOtherThing/Code/usefulScripts.R
. It doesn’t care which directory the command was run, so it points at ../Projects/TheOtherThing/Code/usefulScripts.R
relative to the directory ~/Projects/MySuperReproducibileProject/Code
not ~/Projects/MySuperReproducibileProject
where ln
was run.
Fortunately, you mostly don’t need to know about any of this. Just make make sure you always use absolute paths in the source
part of ln -s source destination
. So if I instead of,
cd ~/Projects/MySuperReproducibileProject
ln -s ../Projects/TheOtherThing/Code/usefulScripts.R Code/codeToUse.R
I did,
cd ~/Projects/MySuperReproducibileProject
ln -s ~/Projects/TheOtherThing/Code/usefulScripts.R Code/codeToUse.R
Things will work as you expect.
Optional extras
When you started reading this you were probably expecting lots about using github and docker, not the worlds longest love letter to symbolic links. There are definitely more things that you could do to make your work more reproducible.
Using some kind of version control like github a certainly a good idea. I try and always have one master script that runs all the other bits of code and runs my analysis from start to finish.
Probably the biggest limitation is that while your project folder will contain all the data and code needed for the analysis, it doesn’t contain the specific setup of the machine you ran it on. Projects like (docker provide one way of trying to keep a record of the machine setup as well.
These are certainly good things to do if you have the time, skills, and patience. But I think it’s best to start with the basics.