I am currently looking for a system which will allow me to version both the code and the data in my research.

I think my way of analyzing data is not uncommon, and this will be useful for many people doing bioinformatics and aiming for the reproducibility.

Here are the requrements:

  • Analysis is performed on multiple machines (local, cluster, server).
  • All the code is transparently synchronized between the machines.
  • Source code versioning.
  • Generated data versioning.
  • Support for large number of small generated files (>10k). These also could be deleted.
  • Support for large files (>1Gb). At some point old generated files can permanently deleted. It would be insane to have transparent synchronization of those, but being able to synchronize them on demand would be nice.

So far I am using git + rsync/scp. But there are several downsides to it.

  • Synchronization between multiple machines is a bit tedious, i.e. you have to git pull before you start working and git push after each update. I can live with that.
  • You are not supposed to store large generated data files or large number of files inside your repository.
  • Therefore I have to synchronize data files manually using rsync, which is error prone.

There is something called git annex. It seems really close to what I need. But:

  • A bit more work than git, but that's ok.
  • Unfortunately it seems it does not work well with the large number of files. Often I have more that 10k small files in my analysis. There are some tricks to improve indexing, but it doesn't solve the issue. What I need is one symlink representing the full contents of directory.

One potential solution is to use Dropbox or something similar (like syncthing) in combination with git. But the downside is there will be no connection between the source code version and the data version.

Is there any versioning system for the code and the data meeting the requirements you can recommend?

Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34

8 Answers8


There is a couple of points to consider here, which I outline below. The goal here should be to find a workflow that is minimally intrusive on top of already using git.

As of yet, there is no ideal workflow that covers all use cases, but what I outline below is the closest I could come to it.

Reproducibility is not just keeping all your data

You have got your raw data that you start your project with.

All other data in your project directory should never just "be there", but have some record of where it comes from. Data processing scripts are great for this, because they already document how you went from your raw to your analytical data, and then the files needed for your analyses.

And those scripts can be versioned, with an appropriate single entry point of processing (e.g. a Makefile that describes how to run your scripts).

This way, the state of all your project files is defined by the raw data, and the version of your processing scripts (and versions of external software, but that's a whole different kind of problem).

What data/code should and should not be versioned

Just as you would not version generated code files, you should not want to version 10k intermediary data files that you produced when performing your analyses. The data that should be versioned is your raw data (at the start of your pipeline), not automatically generated files.

You might want to take snapshots of your project directory, but not keep every version of every file ever produced. This already cuts down your problem by a fair margin.

Approach 1: Actual versioning of data

For your raw or analytical data, Git LFS (and alternatively Git Annex, that you already mention) is designed to solve exactly this problem: add tracking information of files in your Git tree, but do not store the content of those files in the repository (because otherwise it would add the size of a non-diffable file with every change you make).

For your intermediate files, you do the same as you would do with intermediate code files: add them to your .gitignore and do not version them.

This begs a couple of considerations:

  • Git LFS is a paid service from Github (the free tier is limited to 1 GB of storage/bandwidth per month, which is very little), and it is more expensive than other comparable cloud storage solutions. You could consider paying for the storage at Github or running your own LFS server (there is a reference implementation, but I assume this would still be a substantial effort)
  • Git Annex is free, but it replaces files by links and hence changes time stamps, which is a problem for e.g. GNU Make based workflows (major drawback for me). Also, fetching of files needs to be done manually or via a commit hook

Approach 2: Versioning code only, syncing data

If your analytical data stays the same for most of your analyses, so the actual need to version it (as opposed to back up and document data provenance, which is essential) may be limited.

The key to get this this working is to put all data files in your .gitignore and ignore all your code files in rsync, with a script in your project root (extensions and directories are an example only):

cd $(dirname $0)
rsync -auvr \
    --exclude "*.r" \
    --include "*.RData" \
    --exclude "dir with huge files that you don't need locally" \
    yourhost:/your/project/path/* .

The advantage here is that you don't need to remember the rsync command you are running. The script itself goes into version control.

This is especially useful if you do your heavy processing on a computing cluster but want to make plots from your result files on your local machine. I argue that you generally don't need bidirectional sync.

  • 3
    By the way you can use file hash instead of timestamp at least in biomake (doi: 10.1093/bioinformatics/btx306). – Iakov Davydov May 18 '17 at 17:11
  • @IakovDavydov I'm aware of this, but I haven't actually tried if it works – Michael Schubert May 18 '17 at 18:12
  • +1 for all the modification on the raw data should be documented +1 for the file hash. I am a big fan of literate programming myself and I also took advantage of that approach to document the version of my tools/packages I used for the analysis and the hash of the raw data. Hence, when I see a result or a plot I also have its context of production. – Mitra May 19 '17 at 12:33

Your question is somewhat open, but I think it could prove an interesting discussion. I don't believe in many cases it is worth storing the data you have created in git. As you've noted, it isn't designed for large files (although we have git-lfs) and it's definitely not designed for binary formats such as BAM.

I'm of the opinion that how a file was created and what has been done to it since is key. Large files that took much effort to create should be mirrored somewhere (but not necessarily in a version control system). Other less-important (or less difficult to create) files that have been lost, clobbered or otherwise tainted can be regenerated as long as you know how they came to be.

For what it's worth, I've been working on a piece of software called chitin (self described as a shell for disorganised bioinformaticians). I also wrote a long blog post on why I thought this was a necessary project for me, but the main reason was that despite my attempts to organise my filesystem and make good archives of my experiments, over time I forget what my shorthand directory names meant, or exactly what program generated which data.

chitin's goal is to automatically capture changes made to the file system during the execution of a command. It knows what commands to run to re-create a particular file, what commands have used that file and can tell you when and why that file was changed (and by who ;) too).

It's not finished (nothing ever is), but I feel that you might be going down the wrong road by wanting to store all your data and its versions when really, I think most people just want to know the commands that instigated changes. If the data history is important (and your code is well versioned), then you can simply check out any commit and execute your analysis to re-generate data.

Sam Nicholls
  • 782
  • 4
  • 16

First of all, kudos to you for taking versioning seriously. The fact that you're mindful of this issue is a good sign that you want to do responsible research!

For many bioinformatics projects, data files are so large that versioning the data directly with a tool like git is impractical. But your question is really getting at a couple of different issues.

  • How do I do my research reproducibly, and show full provenance for each data point and result I produce?
  • How do I manage and synchronize my research work across multiple machines?

The short answer:

  • Archive the primary data.
  • Place your workflow under version control.
  • Version checksums of large data files.
  • Use GitHub to synchronize your workflow between machines.

The long answer:

Archive the primary data

As far as reproducibility is concerned, what matters most is the primary data: the raw, unprocessed data you collect from the instrument. If you are analyzing data that has been published by others, then write a script that will automate the task of downloading the data from its primary official source, and place that script under version control.

If you or a lab mate or a colleague produced the data and it is not yet published, then you should already have plans for submitting it to an archive. Indeed, most journals and funding agencies now require this prior to publication. I'd even go so far as to say the data should be submitted as soon as it's collected. Scientists worry a lot about having their data stolen and their ideas scooped, but statistically speaking, getting scooped is much less likely than nobody ever touching your data or reading your paper. But if you or an advisor insists, most data archives allow you to keep data private for an extended period of time until a supporting manuscript is published.

Putting (for example) Fastq files in a git repository is a bad idea for a lot of reasons. No hosting service will support files that big, git will be very slow with files that big, but most importantly git/GitHub is not archival! Use a proper data archive!

Place your workflow under version control

Treat your raw data as read-only. Only process the raw data using scripts, and keep these scripts under version control. Vince Buffalo describes this well in his book Bioinformatics Data Skills. Check it out!

Version checksums of large data files

If there are any data files that you want to track but are too big to place under version control, compute checksums and place these under version control. Checksums are very small alphanumeric strings that are, for all practical purposes, unique for each data file. So instead of putting that 5GB trimmed Fastq file or the 7GB BAM file under version control, compute their checksums and put the checksums under version control. The checksums won't tell you the contents of your files, but they can tell you when the file contents change.

This should give full disclosure and complete provenance for every data point in your analysis. The workflow has a scripts/command for downloading the primary data, scripts/commands for processing the data, and checksums that serve as a signature to validate intermediate and final output files. With this, anyone should be able to reproduce you analysis!

Use GitHub to synchronize your workflow between machines

If your workflow is already under version control with git, it's trivial to push this to a hosting service like GitHub, GitLab, or BitBucket. Then it's just a matter of using git push and git pull to keep your code up-to-date on your various machines.

Daniel Standage
  • 5,080
  • 15
  • 50

The Open Science Framework uses versioning for all files and is free to use: https://osf.io

You can integrate data or code from various sources such as github, dropbox, google drive, figshare or amazon cloud

You can also store files on their server using OSF data storage, but I do not know exactly what the file size limit is.

H. Gourlé
  • 439
  • 3
  • 8

The way we deal with this is:

  • All work is done in a single filesystem mounted on the cluster
  • This file system is mounted on local machines via sshfs/samba (depending on the location of the current "local" machine on the network).
  • Code is versioned with git hub
  • All computation is carried out via light-weight automated pipelines. We use ruffus in combination with an in-house utility layer. The system doesn't really matter as long is it no more work to add another step to the pipeline than it would be to execute it manually.
  • All questionably design decisions are encoded in configuration files. These configuration files, along with a very detailed log output by the pipeline (what was run, what was the git commit of the code run, what was the time stamp of the files it was run on, etc) and the initial input files are version controlled.
  • The benefit of this is that code + configuration + time = output. It is not expected that the whole pipeline will be rerun everytime anything is changed, but the pipeline will tell you if something is out of date (it can use timestamps or file hashes), and it can all be run in one go before publication.
  • Any other analysis is carried out in juptyer notebooks. These are version controlled.

To summarise, we don't synchronise because we only ever work from one disk location even if we use multiple CPU locations. We version control:

  • Code
  • Inputs, configuration, logs
  • Juptyer notebooks

Log records the git commits used to produce the current outputs.

Ian Sudbery
  • 3,311
  • 1
  • 11
  • 21
  • Interesting Ian, this is the sort of design I aspire to, but don't really follow into practice. I'm intrigued by this in-house layer on top of ruffus, what is it? – Chris_Rands May 19 '17 at 17:54
  • The trick is to make writing a pipeline task easier than a cluster job submission script. The utilitiy layer is provided by CGATPipelines (www.github.com/CGATOxford/CGATPipelines) – Ian Sudbery May 20 '17 at 00:07

Using Git for version-controlling code is a good practice, but it does not lend itself well to versioning large data files. Manually syncing data across multiple nodes is asking for trouble, you want this syncing to either be handled automatically in a managed environment, or just keep the files on a single network-attached storage device.

One tool you might want to look into is Arvados, which is designed for syncing bioinformatics data and workflows across multiple machines. From the project website:

Arvados is a platform for storing, organizing, processing, and sharing genomic and other big data. The platform is designed to make it easier for data scientists to develop analyses, developers to create genomic web applications and IT administers to manage large-scale compute and storage genomic resources. The platform is designed to run in the cloud or on your own hardware.

  • 439
  • 2
  • 6
  • I would only use Arvados if it really got much better than it was before. – nuin May 18 '17 at 21:42
  • @nuin: What specifically do you think was lacking and/or would make it not appropriate as a solution to OP's problem? – woemler May 19 '17 at 13:15
  • @woemier I haven't tried it lately, and I if I remember correctly it was a PITA to setup and run simple things. But as I said, don't know if got better. – nuin May 19 '17 at 14:58

This answer will sort of only cover the big data parts, i.e. things > 100MB ish, and only if your analysis pipeline ties in with the Python ecosystem. It will require a bit of learning

Try using quilt which is sort of like 'docker for data' (Github page).

$ pip install quilt
$ quilt install uciml/iris -x da2b6f5  #note the short hash
$ python
>>> from quilt.data.uciml import iris
# you've got data


  • There doesn't seem to be a filesize limit for public packages (haven't stress tested this though)
  • Hashes your data to ensure reproducibility
  • Support for versioning and tags


  • Storing private data starts from $7 per user / month up to 1TB of data
  • Pretty much Python only at this point, with some community support for R

More information here.

  • 4,693
  • 1
  • 18
  • 42
  • 131
  • 2

I'm a bit late to answer this question, but we've developed pretty much exactly the system you describe, based on Git and git-annex. Its called DataLad, free and open source software, and currently mostly known within the neurosciences.

DataLad's core data structure are DataLad datasets: Git repositories with an optional annex for large or private data. Within a dataset, code and data can be version controlled in conjunction. You can stick to Git and git-annex commands for handling the dataset, but DataLad also offers a core API that aims to harmonize and simplify version control operations. For example, in a dataset configured according to your preferences, a datalad save commits a file based on name (glob), file type, size, or location into either Git or git-annex (instead of git add, git annex add, and git commit). And likewise, a datalad push can push Git history and annexed data to your clones (instead of git push and git-annex sync). Everything is completely decentral, and DataLad integrates via git-annex special remote mechanism with a variety of hosting providers (S3, OSF, DropBox, ...), where you could store data if you like. A high level overview is here: handbook.datalad.org/r.html?about

Unfortunately it seems it does not work well with the large number of files. Often I have more that 10k small files in my analysis. There are some tricks to improve indexing, but it doesn't solve the issue. What I need is one symlink representing the full contents of directory.

DataLad overcomes this issue by using Git's submodule mechanishm. Datasets can nested, and operated on recursively (to preserver a monorepo-like feel). With this mechanism, we've for example exposed the human connectome project dataset (80TB, 15 million files) to GitHub: github.com/datalad-datasets/human-connectome-project-openaccess

  • 36
  • 1
  • 1
    Wow, this sounds awesome! Is there some kind of a cleanup mechanism available? E.g., the intermediate data is not needed anymore is it possible to delete it permanently? – Iakov Davydov Jun 26 '21 at 15:03
  • 1
    You could use the git-annex dropunused command for this to automatically remove data that isn't referenced anymore: https://git-annex.branchable.com/git-annex-dropunused/. Or, use datalad drop / git annex drop on any annexed files of your choice to remove them permanently. – adswa Jun 30 '21 at 10:03