19

The results obtained by running the results command from DESeq2 contain a "baseMean" column, which I assume is the mean across samples of the normalized counts for a given gene.

How can I access the normalized counts proper?

I tried the following (continuing with the example used here):

> dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = col_data, design = ~ geno_treat)
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> res <- results(dds, contrast=c("geno_treat", "prg1_HS30", "prg1_RT"))

Here is what I have for the first gene:

> res["WBGene00000001",]$baseMean
[1] 181.7862
> mean(assays(dds)$mu["WBGene00000001",])
[1] 231.4634
> mean(assays(dds)$counts["WBGene00000001",])
[1] 232.0833

assays(dds)$counts corresponds to the raw counts. assays(dds)$mu seems to be a transformation of these counts approximately preserving their mean, but this mean is very different from the "baseMean" value, so these are likely not the normalized values.

bli
  • 3,130
  • 2
  • 15
  • 36

2 Answers2

13

The normalized counts themselves can be accessed with counts(dds, normalized=T).

Now as to what the baseMean actually means, that will depend upon whether an "expanded model matrix" is in use or not. Given your previous question, we can see that geno_treat has a bunch of levels, which means that expanded models are not in use. In such cases, the baseMean should be the mean of the base factor in geno_treat.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • mean(assays(dds)$mu["WBGene00000001",] / sizeFactors(dds)) gives something very close to res["WBGene00000001",]$baseMean (181.9139). The vignette says that assays(dds)$mu correspond to s_j * q_ij where, according to the DESeq2 paper, s_j is the size factor for library j and q_ij is, if I understand correctly, an estimation of a quantity "proportional to the concentration of cDNA fragments from the gene in the sample". Is this q_ij what you call "base factor" ? – bli May 24 '17 at 07:44
  • Not really. With an "expanded model matrix" the intercept (aka, "baseMean") should be approximately the average of the normalized counts. For non-"expanded model matrices" then one level of the group factor (the "base level") becomes the intercept, so the baseMean is then just the average of its counts. The base level is the alphanumerically first level in R by default, if you have grp = factor(c("WT", "MUT")) then the MUT group is used to calculate the baseMean (assuming no expanded model matrices). – Devon Ryan May 24 '17 at 09:52
  • 1
    Here, I find that the mean of all normalized values (mean(counts(dds, normalized=T)["WBGene00000001",])), is actually equal to the baseMean (res["WBGene00000001",]$baseMean). This is not the case if I compute the mean only for a given level of the geno_treat factor. – bli May 24 '17 at 12:13
  • I guess it using an "expanded model matrix" then. I know Michael used to code in exactly the cases when that would happen, but perhaps he's since generalized the code, which is nice :) – Devon Ryan May 24 '17 at 12:58
6

It depends what you mean by “normalised”. As Devon said, the normalized = TRUE argument to the count function gives you normalised counts. However, these are “only” library-size normalised (i.e. divided by the sizeFactors(dds)).

However, as the vignette explains, downstream processing generally requires more advanced normalisation, to account for the heteroscedasticity of the counts. This is often done by simply logging the counts but this has obvious drawbacks (most trivially, what do we do with 0 counts? A workaround is to add a pseudocount but that’s problematic too).

DESeq2 offers two different methods to perform a more rigorous analysis:

  • rlog — a regularised log, and
  • vst — a variance stabilising transformation.

You’d generally use either of these for downstream analysis, not count(dds, normalized = TRUE).

Konrad Rudolph
  • 4,845
  • 14
  • 45
  • to make my confusion clear let's say im going for any parametric or non parametric test between any comparison for genes ,can i take the rlog or vst ? As rlog and vst i take it for clustering ,pca ,correlation ,wgcna .So is it possible ? – kcm Jan 29 '19 at 05:44
  • @krushnachChandra It depends on the test and the comparison. DESeq for instance does not use either rlog or vst for its test of differential expression, as these would be inappropriate. – Konrad Rudolph Jan 29 '19 at 12:13
  • yes that i read in the paper , for example i got a cluster and i wanted to see what if the genes are having significant difference in their expression so if i have to run say a non parametric test can i use those rlog ? values or i have to use the count normalised ones ? – kcm Jan 29 '19 at 12:25
  • 1
    @krushnachChandra You cannot use count normalised values in this scenario, nor simple rlog values, since they are not normalised for transcript length (and thus longer transcripts will show higher expression on average). You could use something like size-factor normalised TPM instead. You can also log those TPMs but for a nonparametric test it would make no difference — that’s kind of the point of non-parametric tests. – Konrad Rudolph Jan 29 '19 at 13:46
  • So for PCA ,clustering i can use rlog i hope that part im doing correct let me know, i read your post , so can i use the deseq2 value and convert them into TPM ,but there is one part where even deseq2 also calculate size factor "dds <- estimateSizeFactors(dds)" this i guess . So can i use your TPM conversion method on deseq2 values? or just counts . – kcm Jan 30 '19 at 06:10
  • im curious about this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4986529/ Figure 6 last box plot "(Right) Mean expression of the two gene sets in each population. Error bars represent SD. CPM, counts per million" where they use cpm values i guess these are edge R output ,so here they used that value to make comparison ,so i was wondering if it can be done with rlog values ?sorry for mundane question my concept about these are very fuzzy as im not a statistician .. – kcm Jan 30 '19 at 11:07