First, sorry if I am missing something basic - I am a programmer recently turned bioinformatician so I still don't know a lot of stuff. This is a cross post with a Biostars question hope that's not bad form.

While it is obvious that scRNA-seq data contain lots of zeroes, I couldn't find any detailed explanation of why they occur except for short notices along the lines of "substantial technical and biological noise". For the following text, let's assume we are looking at a single gene that is not differentially expressed across cells.

If zeroes were caused solely by low capture efficiency and sequencing depth, all observed zeroes should be explained by low mean expression across cells. This however does not seem to be the case as the distribution of gene counts across cells often has more zeroes than would be expected from a negative binomial model. For Example the ZIFA paper explicitly uses a zero-inflated negative binomial distribution to model scRNA-seq data. Modelling scRNA-seq as zero-inflated negative binomial seems widespread throughout the literature.

However assuming negative binomial distribution for the original counts (as measured in bulk RNA-seq) and assuming that every RNA fragment of the same gene from every cell has approximately the same (low) chance of being captured and sequenced, the distribution across single cells should still be negative binomial (see this question for related math).

So the only remaining possible cause is that inflated zero counts are caused by PCR. Only non-zero counts (after capture) are amplified and then sequenced, shifting the mean of the observed gene counts away from zero while the pre-PCR zero counts stay zero. Indeed some quick simulations show that such a procedure could occasionally generate zero-inflated negative binomial distributions. This would suggest that excessive zeroes should not be present when UMIs are used - I checked one scRNA-seq dataset with UMIs and it seems to be fit well by plain negative binomial.

Is my reasoning correct? Thanks for any pointers.

The question How can we distinguish between true zero and dropout-zero counts in single-cell RNA-seq? is related, but provides no clues to my present inquiry.

Martin Modrák
3 Answers3


It may be necessary to distinguish between methods that use unique molecular identifiers (UMIs), such as 10X's Chromium, Drop-seq, etc, and non-UMI methods, such as SMRT-seq. At least for UMI-based methods, the alternative perspective, that there is no significant zero-inflation in scRNA-seq, is also advocated in the single-cell research community. The argument is straight-forward: the empirical mean expression vs. dropout rate curve matches the theoretically predicted one, given the current levels of capture efficiency.


Svensson Blog

A couple of blog posts from Valentine Svensson argue this point rather pedagogically, and include citations from across the literature:

Droplet scRNA-seq is not zero-inflated

Count-depth variation makes Poisson scRNA-seq data negative binomial


There is a more extensive preprint by Tang, Shahrezaei, et al. (BioRxiv, 2018) that claims to show a binomial model is sufficient to account for the observed dropout noise. Here is a snippet of a relevant conclusion:

Importantly, as bayNorm recovered dropout rates successfully in both UMI-based and non-UMI protocols without the need for specific assumptions, we conclude that invoking zero-inflation models is not required to describe scRNA-seq data. Consistent with this, the differences in mean expression levels of lowly expressed genes observed between bulk and scRNA-seq data, which were suggested to be indicative of zero-inflation, were recovered by our simulated data using the binomial model only.

Multinomial Modeling

There is also a very clearly written preprint by Townes, Irizarry, et al. (BioRxiv, 2019) where the authors consider scRNA-seq as a proper compositional sampling (i.e., multinomial process) and they come to a similar conclusion, though specifically for UMI-based methods. From the paper:

The multinomial model makes two predictions which we verified using negative control data. First, the fraction of zeros in a sample (cell or droplet) is inversely related to the total number of UMIs in that sample. Second, the probability of an endogenous gene or ERCC spike-in having zero counts is a decreasing function of its mean expression (equations provided in Methods). Both of these predictions were validated by the negative control data (Figure 1). In particular, the empirical probability of a gene being zero across droplets was well calibrated to the theoretical prediction based on the multinomial model. This also demonstrates that UMI counts are not zero inflated.

Furthermore, by comparing raw read counts (prior to UMI-based deduplication) and UMI counts, they conclude that PCR is indeed the cause of zero-inflation:

The results suggest that while read counts appear zero-inflated and multimodal, UMI counts follow a discrete distribution with no zero inflation (Figure S1). The apparent zero inflation in read counts is a result of PCR duplicates.

I highly recommend giving this a read, especially because it nicely situates other common generative models (e.g., binomial, Poisson) as valid simplifying assumptions of the multinomial model.

It should be noted that this same group previously published a work (Hicks, Irizarry, et al. 2018), mostly focused on non-UMI-based datasets (SMRT-seq), where they showed evidence that, relative to bulk RNA-seq, there was significant zero-inflation.

    One more preprint arguing that zero inflation is not a good model: https://www.biorxiv.org/content/10.1101/477794v1.abstract – Martin Modrák Mar 02 '19 at 21:12

The Biostars thread turned out helpful. The most interesting possible cause, not mentioned in the Ian Subery's answer, is that due to bursty nature of transcription, the true distribution of transcript counts across cells can be bimodal with a peak at zero even assuming a simple model of transcription such as the random telegraph model. See for example Dattani & Barahona 2017 for a more detailed discussion.

However, the random telegraph model predicts that the even with peak at zero, there should usually be non-negligible probability of having small, but nonzero counts, which is not always the case in zero-inflated scRNA-seq data: see for example the Isl1 gene in Retina dataset (click Explore -> search for "Isl1")

Martin Modrák
I know of no references for this, but in general, I would say that your reasoning is sound. I would just add that in contrast to what I suspect you have simulated, not all transcripts are equally likely to be captured and amplified. We don't really understand what the determinants of this are, but for example, GC content is definitely related.

Ian Sudbery
  • Thanks for the note. I understand there are betweeen-gene differences due to GC content etc. but I am only looking at one gene at a time so it shoud not matter. Do you believe that the sequencing results could be noticeably influenced by within-gene differences - e.g. because a specific fragment of a long transcript is much better captured/amplified than others letting us observe large non-zero counts only in cells where this fragment was captured and zeroes otherwise? – Martin Modrák Oct 21 '17 at 12:36