1

I want to find a cutoff value for each gene, above which we can consider a gene expressed.

The problem is that not all effectively non-expressed genes will have 0 counts due to sequencing errors for example.

My question is related to this one, but is concerned with single-cell data. The technical differences between bulk and single cell RNA-seq warrant a separate question for this problem. Also similar to this one, but focuses on gene expression and not differences between zero counts.

Peter
  • 2,634
  • 15
  • 33
  • This cutoff would be applied to non-transformed values or to raw counts? What kind of counts do you have: at the transcript or at the gene level? – llrs Sep 17 '18 at 15:01
  • You have to be careful when setting these kinds of thresholds in single cell experiments. Using a mean or variance = cutoff may cause you to remove genes that define a rare subset of cells. For example, the genes derived from a subset of cells that represent 2% of your total population may have a low mean expression. – GWW Sep 17 '18 at 15:26
  • @Llopis I don't mind either way. The idea is to assign expressed / non expressed status to each gene (with some confidence). I have counts at the gene level. – Peter Sep 18 '18 at 13:19
  • @GWW Thanks for the comment. Yes I agree a simple approach won't do. Maybe for each gene we can look at the distribution of expression and choose the first minimum, or take a subset (cluster of similar cells, cell type) of cells and work with their mean / variance. – Peter Sep 18 '18 at 13:24

2 Answers2

1

Not sure if this has anything to do with your question but I think the way how Monocle approach this question sort of makes sense to me. The normalization methods for Seurat and Monocle are somehow similar (correct if I am wrong, this is simply my personal experience), as @kohlkopf mentioned.

For monocle specifically, it has this function where you can display dispersion v.s. expression of the genes to decide the cut off. Here is an example of this plot that I generated when processing the PBMC data downloaded from the Seurat tutorial.enter image description here

You can assign the cut-off manually but I doubt I answered your question on how to decide the cut-off value. Maybe you need to try a set the expression cut-off from a range to experiment what value may potentially answer a biologically relevant question in your context.

If you are interested in how monocle approach this, I think the Monocle introduction and Dave Tang's Blog are excellent in explain how it works.

Again, you are right that 0 could be sequencing error. But I would also suggest that, if you are looking at genes at very small expression level, scRNA-seq, at least droplet-based technique, could be really difficult in such context. I happened to hear someone trying to find lncRNA expression talking about how a "0" sometimes may be a true "0" due the low expression nature of lncRNA in scRNA-seq, etc.

Anyhow, not sure if I answer the question but I do hope it helps you somehow. I am not able to make comments but this is the best I can do.

BC Wang
  • 17
  • 4
0

Much of the Seurat workflow incorporates normalized, scaled, and mean-centered expression values.

Raw gene expression counts for a cell are normalized by total expression, multiplied by 10,000, and log transformed. Expression level for each gene is scaled by dividing the centered gene expression levels by their standard deviations, resulting in z-scores. It's not an ideal solution, but I find a threshold of 1 (in the realm of z-scores) to make sense. A z-score of 1 is one standard deviation from the mean, or what may be interpreted as baseline expression.

This is my 2¢. I would love to hear what other people are doing.

Kohl Kinning
  • 1,149
  • 6
  • 26