3

I have seen many posts regarding counts to RPKM and TPM. I haven't seen any post for counts to FPKM.

I have RNA-Seq data which is paired-end reads. Extracted the counts using featureCounts for all the samples.

There is a function to convert counts to RPKM: using the gene_length

rpkm <- function(counts, lengths) {
  rate <- counts / lengths 
  rate / sum(counts) * 1e6
}

I know that RPKM is mainly used for single-end reads data. Do you think I can use the above function for converting counts to FPKM as my data is paired end? [TCGA data: HTSeq counts are converted to HTSeq FPKM]

Somewhere I have seen that the same function applies for single end data which will be RPKM and also for paired end data which will be FPKM. Am Ir right or wrong?

If not can anyone show some function or code to convert counts to FPKM please.

thanq

beginner
  • 631
  • 7
  • 15
  • Why do you need RPKM/FPKM? Haven't you heard that these values are obsolete? – benn Oct 22 '18 at 10:51
  • 1
    I guess OP wants to follow the most up to date procedures, but seeing that in TCGA they used FPKM is trying to work like them. I thought that FPKM were ok: see this https://bioinformatics.stackexchange.com/a/4270/48, or this https://bioinformatics.stackexchange.com/a/69/48. @beginner I hope the links are helpful, update the question otherwise – llrs Oct 22 '18 at 11:09
  • @Llopis thanks for the links. In this https://bioinformatics.stackexchange.com/questions/66/how-to-compute-rpkm-in-r/69#69 I see that FPKM function using effective length...it is gene_length right? – beginner Oct 22 '18 at 11:14
  • @b.nota Yes, but as TCGA data is FPKM I wanted to use FPKM data. – beginner Oct 22 '18 at 11:17
  • @beginner See this question and answers: https://bioinformatics.stackexchange.com/q/367/48 – llrs Oct 22 '18 at 11:31

2 Answers2

4

I have seen many posts regarding counts to RPKM and TPM.

There’s your answer then: FPKM = RPKM. It’s simply a more accurate name.

Speaking of RPKM for paired-end data is discouraged because the reference to “read” in this context lends itself to ambiguity. But mathematically the quantity is the same: we are counting fragments, not individual reads (of which each fragment has two, for paired-end data).

But as mentioned in the comments, there are good reasons against using FPKMs.

Konrad Rudolph
  • 4,845
  • 14
  • 45
  • thanks for the answer. I have a question. I have Ensembl gene ids as rows and samples as columns with counts. I have an other column gene_length. I'm trying to convert those counts to TPM. In this post [https://gist.github.com/slowkow/6e34ccb4d1311b8fe62e] they used gene_length to convert counts to TPM. But I have seen one of your post in Rpubs [http://rpubs.com/klmr/rnaseq-norm] where you used effective length. Which one is the right one for converting counts to TPM? – beginner Oct 23 '18 at 11:43
  • 2
    In every case, the correct number is the effective length, i.e. the transcript/gene length minus the average read length plus 1 (this is the “effective” length because it represents the number of possible alignment positions of a read to the transcript). That said, in practice it should make very little difference. – Konrad Rudolph Oct 23 '18 at 11:47
  • I see that in the post it is given EffectiveLength = Length - mean_fragment_length + 1; And how the mean_fragment_length is 300? One more question can I use this for Ensembl gene ids? or only for transcripts? – beginner Oct 23 '18 at 11:57
  • 3
    @beginner The fragment length is determined by your sequencing protocol. It’s (more or less) the distance between the two paired reads (including the read lengths, minus any overlap). For a typical long RNA-seq library prep it’s often in the order of 300–400 bp. You can run Picard’s CollectInsertSizeMetrics tool on your aligned BAM files to find the mean fragment length. – Konrad Rudolph Oct 23 '18 at 13:35
  • Sure. I will. I'm working with TCGA data. could you please tell about my other question I the previous comment. – beginner Oct 23 '18 at 13:44
  • 1
    You can use this on any annotated feature you like, both genes and transcripts. There are some fundamental issues with using genes (essentially, you lose information about alternative transcripts, and gene-level statistics by necessity need to average across transcripts) but that isn’t specific to quantification, and in practice it’s usually not an issue. – Konrad Rudolph Oct 23 '18 at 13:50
2

You can use countToFPKM package. This package provides an easy to use function to convert the read count matrix into FPKM matrix; following the equation in enter image description here

The fpkm() function requires three inputs to return FPKM as numeric matrix normalized by library size and feature length:

  • counts A numeric matrix of raw feature counts.
  • featureLength A numeric vector with feature lengths that can be obtained using
    biomaRt.
  • meanFragmentLength A numeric vector with mean fragment lengths,
    which can be calculate with Picard using CollectInsertSizeMetrics.

See https://github.com/AAlhendi1707/countToFPKM