18

In single-cell RNA-seq data we have an inflated number of 0 (or near-zero) counts due to low mRNA capture rate and other inefficiencies.

How can we decide which genes are 0 due to gene dropout (lack of measurement sensitivity), and which are genuinely not expressed in the cell?

Deeper sequencing does not solve this problem as shown on the below saturation curve of 10x Chromium data:

enter image description here

Also see Hicks et al. (2017) for a discussion of the problem:

Zero can arise in two ways:

  1. the gene was not expressing any RNA (referred to as structural zeros) or

  2. the RNA in the cell was not detected due to limitations of current experimental protocols (referred to as dropouts)

Peter
  • 2,634
  • 15
  • 33

2 Answers2

15

Actually this is one of the main problems you have when analyzing scRNA-seq data, and there is no established method for dealing with this. Different (dedicated) algorithms deal with it in different ways, but mostly you rely on how good the error modelling of your software is (a great read is the review by Wagner, Regev & Yosef, esp. the section on "False negatives and overamplification"). There are a couple of options:

  • You can impute values, i.e. fill in the gaps on technical zeros. CIDR and scImpute do it directly. MAGIC and ZIFA project cells into a lower-dimensional space and use their similarity there to decide how to fill in the blanks.
  • Some people straight up exclude genes that are expressed in very low numbers. I can't give you citations off the top of my head, but many trajectory inference algorithms like monocle2 and SLICER have heuristics to choose informative genes for their analysis.
  • If the method you use for analysis doesn't model gene expression explicitly but uses some other distance method to quantify similarity between cells (like cosine distance, euclidean distance, correlation), then the noise introduced by dropout can be covered by the signal of genes that are highly expressed. Note that this is dangerous, as genes that are highly expressed are not necessarily informative.
  • ERCC spike ins can help you reduce technical noise, but I am not familiar with the Chromium protocol so maybe it doesn't apply there (?)

since we are speaking about noise, you might consider using a protocol with unique molecular identifiers. They remove the amplification errors almost completely, at least for the transcripts that you capture...

EDIT: Also, I would highly recommend using something more advanced than PCA to do the analysis. Software like the above-mentioned Monocle or destiny is easy to operate and increases the power of your analysis considerably.

galicae
  • 349
  • 1
  • 5
1

Some people use imputation to differentiate between true zeros and dropout in single-cell data. Some approaches you can look into:

M__
  • 12,263
  • 5
  • 28
  • 47
burger
  • 2,179
  • 10
  • 21