13

I have a gene expression count matrix produced from bulk RNA-seq data. I'd like to find genes that were not expressed in a group of samples and were expressed in another group.

The problem of course is that not all effectively non-expressed genes will have 0 counts due to sequencing errors, or because they were expressed in a small subset of cells.

I'm interested in solutions using R.

Peter
  • 2,634
  • 15
  • 33

4 Answers4

13

I'd like to find genes that were not expressed in a group of samples and were expressed in another group.

This is, fundamentally, a differential expression analysis, with a twist. To solve this, you’d first use a differential expression library of your choice (e.g. DESeq2) and perform a one-tailed test of differential expression.

Briefly, you’d perform the normal setup and then use

results(dds, altHypothesis = 'greater')

To perform a one-tailed test. This will give you only those genes that are significantly upregulated in one group. Check chapter 3.9 of the vignette for details.

Of course this won’t tell you that the genes are unexpressed in the other group. Unfortunately I don’t know of a good value to threshold the results; I would start by plotting a histogram of the (variance stabilised) expression values in your first group, and then visually choose an expression threshold that cleanly separates genes that are clearly expressed from zeros:

vst_counts = assay(vst(dds))
dens = density(vst_counts[, replicate])
plot(dens, log = 'y')

(This merges the replicates in the group, which should be fine.)

Counts follow a multimodal distribution, with one mode for unexpressed and one or more for expressed genes. The expression threshold can be set somewhere between the clearly unexpressed and expressed peaks:

enter image description here

Here I used identify(dens) to identify the threshold interactively but you could also use an analytical method:

threshold = identify(dens)
quantile = sum(dens$x < dens$x[threshold]) / length(dens$x)

# Using just one replicate here; more robust would be to use a mean value.
nonzero_counts = counts(dds, normalized = TRUE)[, replicates[1]]
nonzero_counts = nonzero_counts[nonzero_counts > 0]

(expression_threshold = quantile(nonzero_counts, probs = quantile))
26.5625%
4.112033
Konrad Rudolph
  • 4,845
  • 14
  • 45
  • I like this approach but I have a few concerns. In the DE step, how does the differential expression test perform with zero count values? If one group has zero values and the other group has near zero, or clearly expressed, but with high variability etc. In the cutoff step, (assuming that the blue line is the threshold) it seems you have a very low cutoff value (~1), so you will basically accept all DE genes (or accept all genes above count 0)? – Peter Jun 14 '17 at 10:35
  • Also the graph is not bimodal, but has 4 local maxima. The second peak corresponds to roughly 1-2 counts, and when you have 20 million reads, they are probably not not expressed. Is it not better to have the cutoff after the 3rd peak, around rlog=3? – Peter Jun 14 '17 at 10:37
  • @Peter Your observation about the different modes is a good one that I glossed over. In fact, the distinction should be between a mode for unexpressed genes, and “all the rest”. In order to determine which mode corresponds to truly unexpressed genes, it might be necessary to plot the density with different bandwidths and see how the modes change. My plot isn’t perfect in this regard since I used simulated data based on averaged counts. Real counts would behave better. – Konrad Rudolph Jun 14 '17 at 10:48
  • 1
    @Peter Your comments made many good points so I redid the example with actual read count data, and used VST instead of rlog, which appears to work somewhat better here. – Konrad Rudolph Jun 14 '17 at 12:24
10

A common method is to use zFPKMs, which you can find implemented in R here.

Having said that, there's an inherent problem in declaring a difference between two things on either side of a given threshold. Given that, I would encourage you to use more than "expressed in one and not in another", likely adding at least a "with a minimal difference of X between them" metric. You may also find the tau metric useful, which is implemented in R here. This is meant to measure tissue-specificity, which is more akin to what you're probably interested in doing.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
5

Depending on how much effort you wish to put into this, here is one suggestion I have see used before (uses more than just R). Steps 1-4 come from this paper (see supplimentary material section "Calculation of per gene Local FDR").

  1. Take your gene models and collapse the introns down to maybe 100bp
    Thus if we have a gene with two exons and a 1kb intron eg. exons (1000, 1100) and (2000,2100), we reduce the intron size so that the exons are (1000,1100) and (1200,1300). This is because we need to find sufficient gene free space to fit the null gene in.

  2. Shift each gene into the nearest genomic space at least 5 kp from a genomic annotation and free of ESTs.

  3. Calculate the FPKM distribution of this shifted null set

  4. From these two distributions it should be possible to calculate a local FDR for expression.
    You will need to use a "mixing proportion" to do this. The above reference used Qvality to do this.

  5. Carry out differential expression analysis as described in Konrad's answer

  6. Subset the differentially expressed genes to only consider those that are not expressed in the control condition

  7. You might want to also recalculate the FDRs for differential expression as you are only considering a subset of the tests you might have.

Ian Sudbery
  • 3,311
  • 1
  • 11
  • 21
  • This sounds like a promising approach for calculating a threshold. However, I don’t entirely understand your step 1, could you expand on that? – Konrad Rudolph Jun 14 '17 at 09:53
  • 1
    Edited to include more detail on step 1. – Ian Sudbery Jun 15 '17 at 09:54
  • Yes the calculation of local FDR is something that has come up lately and been supported both by Gordon Smith and Mike Love. In any case if you do not want to do that one can always start from a list of genes that are above a certain threshold of expression levels to restrict the universe of expressed genes and then perform your DE with DESeq2. – ivivek_ngs Jun 15 '17 at 10:58
0

Unfortunatly that is not doable - there is now way of distinguishing between features not expressed and features not expressed for technical reasons. Furthermore a lot of genes might seem lowly expressed to to technical noise.

This also means that you will have to choose a cutoff and simply use that. For choosing a cutoff looking at the replicate distribution of log10(counts) and log10(FPKM) are probably the best options.

  • 2
    I think you’ve posted your answer on the wrong question: did you mean to post it here instead? https://bioinformatics.stackexchange.com/q/693/29 – Konrad Rudolph Jun 13 '17 at 13:14
  • @KonradRudolph Actually, the linked question did not exist when this answer was posted. – Peter Jun 15 '17 at 10:03