Background: We're increasingly needing some way of storing lots of variant data associated with lots of subjects: think clinical trials and hospital patients, looking for disease-causing or relevant genes. A thousand subjects is where we'd start, there's talk of millions on the horizon. With various genomic medicine initiatives, this is likely a wider need.

The problem: While there's plenty of platforms out there, it's a rapidly evolving field. It's difficult to get a feel for how (and if) they perform and how they line up against each other:

  • What's scalable and can handle a lot of data? What sort of limits?
  • What's robust and not a teetering pile of hacked-together components?
  • What has a large community behind it and is actually used widely?
  • What makes for easy access and search from another service? (Commandline, REST or software APIs)
  • What sort of variants they handle?
  • What sort of parameters can be used in searching?

Solutions I've seen so far:

  • BigQ: used with i2b2, but its wider use is unclear
  • OpenCGA: looks the most developed, but I've heard complaints about the size of data it spits out
  • Using BigQuery over a Google Genomics db: doesn't seem to be a general solution
  • Gemini: recommended but is it really scalable and accessible from other services?
  • SciDb: a commercial general db
  • Quince
  • LOVD
  • Adam
  • Whatever platform DIVAS & RVD run on: which may not be freely available
  • Several graphical / graph genome solutions: We (and most other people) are probably not dealing with graph genome data at the moment, but is this a possible solution?
  • Roll your own: Frequently recommended but I'm sceptical this is a plausible solution for a large dataset.

Anyone with experience give a review or high-level guide to this platform space?

Daniel Standage
  • 5,080
  • 15
  • 50
  • 788
  • 3
  • 11
  • My two cents: use MongoDB wrapped in a simple REST framework. Allows flexible model and queries and should scale into the billions of records on a single node. Working on a FLOSS project for this at the moment, but it is not production ready yet. – woemler May 23 '17 at 01:12
  • @woemler How is it compared to other approaches? Someone I know tried MongoDB ~5 years ago on 1000g genotypes. He said MongoDB was over 10x slower than bcf2 on parallel queries while having a much larger disk/memory footprint. That said, he was new to MongoDB back then and might not be doing it the optimal way. – user172818 May 23 '17 at 13:18
  • 3
    @user172818: The newer versions of MongoDB (3.2+) are significantly faster than the versions from several years ago. I have benchmarked it against other free RDBMS's, and it typically performs as-well-as or better, especially for complex data representations, like variant calls. – woemler May 23 '17 at 13:28
  • Is storing the data more important here, or is processing statistics (using Python, R, etc..) about the data more important? – JustBeingHelpful Apr 08 '18 at 07:26
  • @macgyver : good observation. The data - it's supposedly people will want to be mining and querying the data, rather than looking at summary stats and analyses. – agapow Apr 11 '18 at 13:43

1 Answers1


An epic question. Unfortunately, the short answer is: no, there are no widely used solutions.

For several thousand samples, BCF2, the binary representation of VCF, should work well. I don't see the need of new tools at this scale. For a larger sample size, ExAC people are using spark-based hail. It keeps all per-sample annotations (like GL, GQ and DP) in addition to genotypes. Hail is at least something heavily used in practice, although mostly by a few groups so far.

A simpler problem is to store genotypes only. This is sufficient to the majority of end users. There are better approaches to store and query genotypes. GQT, developed by the Gemini team, enables fast query of samples. It allows you to quickly pull samples under certain genotype configurations. As I remember, GQT is orders of magnitude faster than google genomics API to do PCA. Another tool is BGT. It produces a much smaller file and provides fast and convenient queries over sites. Its paper talks about ~32k whole-genome samples. I am in the camp who believe specialized binary formats like GQT and BGT are faster than solutions built on top of generic databases. I would encourage you to have a look if you only want to query genotypes.

Intel's GenomicDB approaches the problem in a different angle. It does not actually keep a "squared" multi-sample VCF internally. It instead keeps per-sample genotypes/annotations and generates merged VCF on the fly (this is my understanding, which could be wrong). I don't have first-hand experience with GenomicDB, but I think something in this line should be the ultimate solution in the era of 1M samples. I know GATK4 is using it at some step.

As to others in your list, Gemini might not scale that well, I guess. It is partly the reason why they work on GQT. Last time I checked, BigQuery did not query individual genotypes. It only queries over site statistics. Google genomics APIs access individual genotypes, but I doubt it can be performant. Adam is worth trying. I have not tried, though.

  • 6,515
  • 2
  • 13
  • 29
  • 1
    +1 for hail, clearly The Right Answer at this point – blmoore May 23 '17 at 08:45
  • You can query individual genotypes using BigQuery. The biggest challenge at this point is having to write your own queries to do analysis. – Greg May 23 '17 at 15:11