21

I used to work with publicly available genomic references, where basic statistics are usually available and if they are not, you have to compute them only once so there is no reason to worry about performance.

Recently I started sequencing project of couple of different species with mid-sized genomes (~Gbp) and during testing of different assembly pipelines I had compute number of unknown nucleotides many times in both raw reads (in fastq) and assembly scaffolds (in fasta), therefore I thought that I would like to optimize the computation.

  • For me it is reasonable to expect 4-line formatted fastq files, but general solution is still prefered
  • It would be nice if solution would work on gzipped files as well

Q : What is the fastest way (performance-wise) to compute the number of unknown nucleotides (Ns) in fasta and fastq files?

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
  • Presumably, you might want the total number of nucleotides as well as the number of Ns (to give proportion of Ns). This is a fairly fast implementation that provides this plus a few other metrics for fasta: https://github.com/ENCODE-DCC/kentUtils/blob/master/src/utils/faSize/faSize.c – Chris_Rands Jun 02 '17 at 11:56

10 Answers10

26

For FASTQ:

seqtk fqchk in.fq | head -2

It gives you percentage of "N" bases, not the exact count, though.

For FASTA:

seqtk comp in.fa | awk '{x+=$9}END{print x}'

This command line also works with FASTQ, but it will be slower as awk is slow.

EDIT: ok, based on @BaCH's reminder, here we go (you need kseq.h to compile):

// to compile: gcc -O2 -o count-N this-prog.c -lz
#include <zlib.h>
#include <stdio.h>
#include <stdint.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

unsigned char dna5tbl[256] = {
    0, 1, 2, 3,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 5, 4, 4,
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 0, 4, 1,  4, 4, 4, 2,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  3, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 0, 4, 1,  4, 4, 4, 2,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  3, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4, 
    4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4
};

int main(int argc, char *argv[]) {
    long i, n_n = 0, n_acgt = 0, n_gap = 0;
    gzFile fp;
    kseq_t *seq;
    if (argc == 1) {
        fprintf(stderr, "Usage: count-N <in.fa>\n");
        return 1;
    }
    if ((fp = gzopen(argv[1], "r")) == 0) {
        fprintf(stderr, "ERROR: fail to open the input file\n");
        return 1;
    }
    seq = kseq_init(fp);
    while (kseq_read(seq) >= 0) {
        for (i = 0; i < seq->seq.l; ++i) {
            int c = dna5tbl[(unsigned char)seq->seq.s[i]];
            if (c < 4) ++n_acgt;
            else if (c == 4) ++n_n;
            else ++n_gap;
        }
    }
    kseq_destroy(seq);
    gzclose(fp);
    printf("%ld\t%ld\t%ld\n", n_acgt, n_n, n_gap);
    return 0;
}

It works for both FASTA/Q and gzip'ed FASTA/Q. The following uses SeqAn:

#include <seqan/seq_io.h>

using namespace seqan;

int main(int argc, char *argv[]) {
    if (argc == 1) {
        std::cerr << "Usage: count-N <in.fastq>" << std::endl;
        return 1;
    }
    std::ios::sync_with_stdio(false);
    CharString id;
    Dna5String seq;
    SeqFileIn seqFileIn(argv[1]);
    long i, n_n = 0, n_acgt = 0;
    while (!atEnd(seqFileIn)) {
        readRecord(id, seq, seqFileIn);
        for (i = beginPosition(seq); i < endPosition(seq); ++i)
            if (seq[i] < 4) ++n_acgt;
            else ++n_n;
    }
    std::cout << n_acgt << '\t' << n_n << std::endl;
    return 0;
}

On a FASTQ with 4-million 150bp reads:

  • The C version: ~0.74 sec
  • The C++ version: ~2.15 sec
  • An older C version without a lookup table (see the previous edit): ~2.65 sec
user172818
  • 6,515
  • 2
  • 13
  • 29
  • based on Devon Ryan's benchmarking you are the fastest so far... – Kamil S Jaron Jun 02 '17 at 08:34
  • Can you add #include <inttypes.h> to you first C example? – Devon Ryan Jun 02 '17 at 08:54
  • I can’t not upvote SeqAn, but I’m seriously concerned that your C++ code is so much slower than your C code. They should be virtually identical. That said, you didn’t specify std::ios::sync_with_stdio(false), which might account for some of the difference; and the loop could be simplified by using iterators, though I’d be surprised if this impacted the speed on any setting above -O1. – Konrad Rudolph Jun 02 '17 at 10:33
  • @KonradRudolph sync_with_stdio helps little when reading from a normal file. I don't know the internal of seqan, so I can't explain why it is slower. That said, I wouldn't worry too much about seqan performance. Gzip decompression is much slower than parsing fastq. @DevonRyan added. – user172818 Jun 02 '17 at 14:35
  • The difference is in readfq. That beast is fast. I know, I tried to replace Heng's code by something less ugly in C++ and couldn't get anywhere near it in simple read performance. Best I could come up with in clean C++ was a speed factor of 2x slower. – BaCh Jun 02 '17 at 15:01
  • you need to add -lz to your compile flags (at least for the current Ubuntu LTS) gcc -O2 -o count-N this-prog.c -lz – Matt Bashton Jun 02 '17 at 20:29
  • Are your numbers based on flushing the cache between samples? Otherwise, those numbers will almost certainly be incorrect. – Alex Reynolds Jun 23 '17 at 14:35
  • Out of curiosity I just checked the performance with a straightforward implementation in seqan3 and to my great disappointment it’s still a factor 3 slower than kseq.h. It’s of course entirely possible that I did something wrong but I don’t think so. – Konrad Rudolph Apr 21 '21 at 09:35
18

5 hours and no benchmarks posted? I am sorely disappointed.

I'll restrict the comparison to just be fasta files, since fastq will end up being the same. So far, the contenders are:

  1. R with the ShortRead package (even if not the fastest, certainly a super convenient method).
  2. A pipeline of grep -v "^>" | tr -cd A | wc -c
  3. A pipeline of grep -v "^>" | grep -io A | wc -l
  4. A pipeline of seqtk comp | awk
  5. Something custom in C/C++

The R code can't actually count N records for some reason when I try it, so I counted A for everything.

I'll exclude all python solutions, because there's no conceivable universe in which python is fast. For the C/C++ solution I just used what @user172818 posted (everything I tried writing took the same amount of time).

Repeating 100 times and taking the average (yes, one should take the median):

  1. 1.7 seconds
  2. 0.65 seconds
  3. 15 seconds
  4. 1.2 seconds
  5. 0.48 seconds

Unsurprisingly, anything in straight C or C++ is going to win. grep with tr is surprisingly good, which surprises me since even though grep is very very optimized I still expected it to have more overhead. Piping to grep -io is a performance killer. The R solution is surprisingly good given the typical R overhead.

Update1: As suggested in the comments

  1. sum(x.count('A') for x in open('seqs.fa','r') if x[0] != '>') in python

This yields a time of

  1. 1.6 seconds (I'm at work now, so I've adjusted the time as best I can to account for the different computers)

Update2: More benchmarks from the comments

  1. awk -F A '!/^>"/ {cnt+=NF-1}END{print cnt}' foo.fa
  2. perl -ne 'if(!/^>/){$count += tr/A//} END{print "$count\n"}' foo.fa

These yield times of:

  1. 5.2 seconds
  2. 1.18 seconds

The awk version is the best hack I could come up with, there are likely better ways.

Update 3: Right, so regarding fastq files, I'm running this on a 3.3GB gzipped file with 10 repetitions (this takes a bit of time to run), so I'm not going to initially limit tests to commands that I can trivially modify to handle compressed files (after all, uncompressed fastq files are an abomination).

  1. seqtk fqchk
  2. sed -n '2~4p' <(zcat foo.fastq.gz) | grep -io A | wc -l
  3. sed -n '1d;N;N;N;P;d' <(zcat Undetermined_S0_R1_001.fastq.gz) | tr -cd A | wc -c
  4. The C example from user1728181 (the comparison to C++ is already provided there, so no need to include it).
  5. bioawk -c fastx '{n+=gsub(/A/, "", $seq)} END {print n}' foo.fastq.gz

The average times are:

  1. 1 minute 44 seconds
  2. 2 minutes 6 seconds
  3. 1 minute 52 seconds
  4. 1 minute 15 seconds
  5. 3 minutes 8 seconds

N.B., these will often not handle fastq files with entries spanning more than 4 lines. However, such files are the dodo birds of bioinformatics, so that's not practically important. Also, I couldn't get ShortRead to work with the compressed fastq file. I suspect there's a way to fix that.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • 1
    Nice benchmark, what is the size of your test fasta file? Also, why not add Python if you have time, yes it will not be fastest, but interesting to see for comparison, I would try: sum(x.count('A') for x in open('seqs.fa','r') if x[0] != '>'), probably BioPython will be slower for this – Chris_Rands Jun 02 '17 at 07:53
  • @Chris_Rands I used dm6 (uncompressed), so it's not terribly large. I figured that python would be so self-evidently slower that I didn't want to bother, but I suppose it wouldn't hurt to add. – Devon Ryan Jun 02 '17 at 07:57
  • 1
    @Chris_Rands I've updated with a python benchmark. It does as well as R, which is better than I would have guessed. – Devon Ryan Jun 02 '17 at 08:08
  • Thanks! I'm not so surprised since python's str.count is implemented in C. I also feel there should be a fast awk only solution, but I'm not able to figure it out – Chris_Rands Jun 02 '17 at 08:14
  • Perl: perl -ne 'if(!/^>/){$count += tr/A//} END{print "$count\n"}' dm6.fa – gringer Jun 02 '17 at 08:20
  • Okay, now that the question has been updated, you can modify these to work for the general FASTQ case (multi-line sequence + quality string, with '@' possibly appearing in the quality line(s)). – gringer Jun 02 '17 at 08:37
  • My awk-fu is sufficient to come up with a working example, but not good enough to come up with one that's not painfully slow. – Devon Ryan Jun 02 '17 at 08:37
  • 1
    @gringer There goes me being productive today... – Devon Ryan Jun 02 '17 at 08:38
  • Can you try my bioawk -c fastx '{n+=gsub(/A/, "", $seq)} END {print n}' ? I'm curious to see how it compares with the other solutions on your data. – bli Jun 02 '17 at 09:39
  • @bli: Yup, I have 5 others running at the moment... – Devon Ryan Jun 02 '17 at 09:39
  • The R code can count Ns once you write it correctly. ;-) (My bad) – Konrad Rudolph Jun 02 '17 at 10:29
  • The awk and perl versions do not give the same results on my test fasta file: The awk version find one more "A" in C. elegans genome than the perl version. I guess it depends whether one of the sequence starts or ends with "A". – bli Jun 02 '17 at 10:45
  • I've added a fastq comparison with some methods (including bioawk). – Devon Ryan Jun 02 '17 at 11:34
  • @bli There should just be a -1 in the awk code. Among the many reasons to use bioawk instead! – Devon Ryan Jun 02 '17 at 11:35
  • 1
    I think the zcat is not necessary with bioawk. Does it improve the speed? – bli Jun 02 '17 at 11:39
  • @bli It's not needed and that's a mistake on my part (I didn't actually use zcat in the benchmark). Thanks for catching that! – Devon Ryan Jun 02 '17 at 11:41
  • Why are you counting the As and not the Ns? Or am I missing something? – terdon Jun 02 '17 at 12:34
  • Because the original ShortRead variant couldn't handle N and the performance comparisons are the same regardless. "The R code can't actually count N records for some reason when I try it, so I counted A for everything." – Devon Ryan Jun 02 '17 at 12:35
  • Yes, sorry, I missed that when I first read the answer. – terdon Jun 02 '17 at 13:04
  • You say: "Repeating 100 times and taking the average (yes, one should take the median):" But the average is the mean not median - so which one did you use? I appreciate the median is less susceptible to skew but in this case surely run time is normally distributed (in which case mean is fine) unless you're running on a system with very high instantaneous CPU or IO load from other users which is creating contention. – Matt Bashton Jun 02 '17 at 21:07
  • The mean, not the median, thus the caveat. Yes, run time should be normally distributed, but occasional weird IO spikes and such do happen. – Devon Ryan Jun 02 '17 at 21:20
  • Did you flush system cache between all these tests? – Torst Jun 03 '17 at 02:14
  • Nah, in practice the files were in the cache for all of the runs. – Devon Ryan Jun 03 '17 at 06:28
  • You should flush cache between tests. Your numbers will be misleading, otherwise, (well, wrong, really) regardless of whether you draw a mean or a median from test results. – Alex Reynolds Jun 23 '17 at 14:32
10

I think that this should be pretty fast:

FASTA:

grep -v "^>" seqs.fa | tr -cd N | wc -c

FASTQ:

sed -n '1d;N;N;N;P;d' seqs.fq | tr -cd N | wc -c

See this answer on SO about how to count characters in BASH using different approaches.

Karel Břinda
  • 1,909
  • 9
  • 19
  • 1
    Your FASTQ answer unfortunately has the same problem, fundamentally, as Iakov’s. That is, FASTQ records have a variable number of lines. – Konrad Rudolph Jun 01 '17 at 17:29
  • 2
    I know that the original standard admits FASTQ line splitting. However, have you ever seen such a file in practice? – Karel Břinda Jun 01 '17 at 17:33
  • I haven’t, but I’ve never myself worked with Sanger sequencing/PacBio data. I think you might encounter them there. For short read data I guess the “record = 4 lines” assumption is safe. – Konrad Rudolph Jun 01 '17 at 17:34
  • Long read data do not use fastq. Currently PacBio uses bam (yes, really bam) and ONT fast5 (h5 archive). I think it is very reasonable to assume 4 line fastq. – Kamil S Jaron Jun 01 '17 at 22:45
8

FASTQ

As it was pointed out, fastq can be complicated. But in a simple case when you have four lines per record, one possible solution in bash is:

sed -n '2~4p' seqs.fastq | grep -io N | wc -l
  • sed -n '2~4p' will print every fourth line
  • grep -o N will output a line with N for every matching symbol
  • wc -l will count the lines

I suspect this python approach will work faster:

cat seqs.fastq | python3 -c "import sys; print(sum(line.upper().count('N') for line in sys.stdin))"

FASTA

Coreutils:

grep -v "^>" seqs.fasta | grep -io N | wc -l

Python:

cat seqs.fasta | python3 -c "import sys; print(sum(line.upper().count('N') for line in sys.stdin if not line.startswith('>')))"
terdon
  • 10,071
  • 5
  • 22
  • 48
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
  • Now you’re counting lines containing Ns, not Ns. Also, FASTQ records can span more than four lines. :-( – Konrad Rudolph Jun 01 '17 at 17:03
  • 1
    @KonradRudolph with -o grep outputs individual symbols, I will think how to correct fastq code – Iakov Davydov Jun 01 '17 at 17:07
  • FASTQ is tricky because @ can also start the quality string line; see http://chat.stackexchange.com/transcript/message/37809200#37809200 and following lines. For reference, here’s a Perl script that parses FASTQ and trims the mate information off of the identifier. It can probably serve as a starting point; I think this is very close to the minimal code: https://gist.github.com/klmr/81a2b86cd93c706d28611f40bdfe2702 – Konrad Rudolph Jun 01 '17 at 17:14
  • @KonradRudolph, I updated the answer to indicate that this works only in the simple case – Iakov Davydov Jun 01 '17 at 18:11
6

Using bioawk

With bioawk (counting "A" in the C. elegans genome, because there seem to be no "N" in this file), on my computer:

$ time bioawk -c fastx '{n+=gsub(/A/, "", $seq)} END {print n}' genome.fa 
32371810

real 0m1.645s user 0m1.548s sys 0m0.088s

bioawk is an extension of awk with convenient parsing options. For instance -c fastx makes the sequence available as $seq in fasta and fastq format (including gzipped versions), so the above command should also handle fastq format robustly.

The gsub command is a normal awk function. It returns the number of substituted characters (got it from https://unix.stackexchange.com/a/169575/55127). I welcome comments about how to count inside awk in a less hackish way.

On this fasta example it is almost 3 times slower than the grep, tr, wc pipeline:

$ time grep -v "^>" genome.fa | tr -cd "A" | wc -c
32371810

real 0m0.609s user 0m0.744s sys 0m0.184s

(I repeated the timings, and the above values seem representative)

Using python

readfq

I tried the python readfq retrieved from the repository mentioned in this answer: https://bioinformatics.stackexchange.com/a/365/292

$ time python -c "from sys import stdin; from readfq import readfq; print sum((seq.count('A') for _, seq, _ in readfq(stdin)))" < genome.fa
32371810

real 0m0.976s user 0m0.876s sys 0m0.100s

"pure python"

Comparing with something adapted from this answer: https://bioinformatics.stackexchange.com/a/363/292

$ time python -c "from sys import stdin; print sum((line.count('A') for line in stdin if not line.startswith('>')))" < genome.fa
32371810

real 0m1.143s user 0m1.100s sys 0m0.040s

(It is slower with python 3.6 than with python 2.7 on my computer)

pyGATB

I just learned (08/06/2017) that GATB includes a fasta/fastq parser and has recently released a python API. I tried to use it yesterday to test another answer to the present question and found a bug. This bug is now fixed, so here is a pyGATB-based answer:

$ time python3 -c "from gatb import Bank; print(sum((seq.sequence.count(b\"A\") for seq in Bank(\"genome.fa\"))))"
32371810

real 0m0.663s user 0m0.568s sys 0m0.092s

(You can also do sequence.decode("utf-8").count("A") but this seems a little slower.)

Although I used python3.6 here (pyGATB seems python3-only), this is faster than the other two python approaches (for which the reported timings are obtained with python 2.7). This is even almost as fast as the grep, tr, wc pipeline.

Biopython

And, to have even more comparisons, here is a solution using SeqIO.parse from Biopython (with python2.7):

$ time python -c "from Bio import SeqIO; print(sum((rec.seq.count(\"A\") for rec in SeqIO.parse(\"genome.fa\", \"fasta\"))))"
32371810

real 0m1.632s user 0m1.532s sys 0m0.096s

This is a bit slower than the "pure python" solution, but perhaps more robust.

There seems to be a slight improvement with @peterjc's suggestion to use the lower level SimpleFastaParser:

$ time python -c "from Bio.SeqIO.FastaIO import SimpleFastaParser; print(sum(seq.count('A') for title, seq in SimpleFastaParser(open('genome.fa'))))"
32371810

real 0m1.618s user 0m1.500s sys 0m0.116s

(I did a series of timings and tried to take one that seemed representative, but there's a lot of overlap with the higher-level parser's timings.)

scikit-bio

I read about scikit-bio today (23/06/2017), which has a sequence reader.

$ time python3 -c  "import skbio; print(sum(seq.count('A') for seq in skbio.io.read('genome.fa', format='fasta', verify=False)))"
32371810

real 0m3.643s user 0m3.440s sys 0m1.228s

pyfastx

I read about pyfastx today (27/08/2020), which has a sequence reader.

$ time python3 -c "from pyfastx import Fasta; print(sum(seq.count('A') for (_, seq) in Fasta('genome.fa', build_index=False)))"
32371810

real 0m0.781s user 0m0.740s sys 0m0.040s

(The timing is done using python 3.6, on the same machine as previous timings from 3 years ago, so it should be comparable.)

fastx_read from mappy

I realize (27/08/2020) that I hadn't added a benchmark for fastx_read from mappy which I regularly use since quite some time already:

$ time python3 -c "from mappy import fastx_read; print(sum(seq.count('A') for (_, seq, _) in fastx_read('genome.fa')))"
32371810

real 0m0.731s user 0m0.648s sys 0m0.080s

Benchmarks on a fastq.gz file

I tested some of the above solutions (or adaptations thereof), counting "N" in the same file that was used here: https://bioinformatics.stackexchange.com/a/400/292

bioawk

$ time bioawk -c fastx '{n+=gsub(/N/, "", $seq)} END {print n}' SRR077487_2.filt.fastq.gz
306072

real 1m9.686s user 1m9.376s sys 0m0.304s

pigz + readfq python module

readfq doesn't complain and is very fast when I pass directly the compressed fastq, but returns something wrong, so don't forget to manually take care of the decompression.

Here I tried with pigz:

$ time pigz -dc SRR077487_2.filt.fastq.gz | python -c "from sys import stdin; from readfq import readfq; print sum((seq.count('N') for _, seq, _ in readfq(stdin)))"
306072

real 0m52.347s user 1m40.716s sys 0m8.604s

The computer has 16 cores, but I suspect the limiting factor for pigz is reading from the disk: the processors are very far from running full speed.

And with gzip:

$ time gzip -dc SRR077487_2.filt.fastq.gz | python -c "from sys import stdin; from readfq import readfq; print sum((seq.count('N') for _, seq, _ in readfq(stdin)))"
306072

real 0m49.448s user 1m31.984s sys 0m2.312s

Here gzip and python both used a full processor. Resource usage balance was slightly better. I work on a desktop computer. I suppose a computing server would take advantage of pigz better.

pyGATB

$ time python3 -c "from gatb import Bank; print(sum((seq.sequence.count(b\"N\") for seq in Bank(\"SRR077487_2.filt.fastq.gz\"))))"
306072

real 0m46.784s user 0m46.404s sys 0m0.296s

biopython

$ time python -c "from gzip import open as gzopen; from Bio.SeqIO.QualityIO import FastqGeneralIterator; print(sum(seq.count('N') for (_, seq, _) in FastqGeneralIterator(gzopen('SRR077487_2.filt.fastq.gz'))))"
306072

real 3m18.103s user 3m17.676s sys 0m0.428s

pyfastx (added 27/08/2020)

$ time python3 -c "from pyfastx import Fastq; print(sum(seq.count('N') for (_, seq, _) in Fastq('SRR077487_2.filt.fastq.gz', build_index=False)))"
306072

real 0m51.140s user 0m50.784s sys 0m0.352s

fastx_read from mappy (added 27/08/2020)

$ time python3 -c "from mappy import fastx_read; print(sum(seq.count('N') for (_, seq, _) in fastx_read('SRR077487_2.filt.fastq.gz')))"
306072

real 0m52.336s user 0m52.024s sys 0m0.300s

Conclusion

pyGATB and bioawk both handle transparently compression (gzipped or not) and format differences (fasta or fastq). pyGATB is quite a new tool, but seems more efficient compared to the other python modules I tested.

(Update 27/08/2020) pyfastx and mappy seem quite convenient too, and are only slightly slower than pyGATB.

bli
  • 3,130
  • 2
  • 15
  • 36
  • 1
    Could you add the Biopython low-level string FASTA parser numbers? time python -c "from Bio.SeqIO.FastaIO import SimpleFastaParser; print(sum(seq.count('A') for title, seq in SimpleFastaParser(open('genome.fa'))))" – Peter Cock Jun 09 '17 at 11:38
  • Why do all your examples count the letter A, rather than N (and perhaps n) as per the question? – Peter Cock Jun 09 '17 at 11:39
  • I count letter "A" because the first large file I thought about for my testing happened to have no "N". – bli Jun 09 '17 at 12:43
  • I would think that all the Python examples will score badly on this small example file due to the overhead of loading Python and importing libraries. But if the user only wants to count one genome, that is a fair test all the same. – Peter Cock Jun 10 '17 at 21:59
6

Decompression of gzipped FASTQ is the main issue

If we take real world gzipped FASTQ (which as the OP suggested would be beneficial) rather than trivial FASTA as the starting point then the real issue is actually decompressing the file not counting the Ns and in this case the C program count-N is no longer the fastest solution.

Additionally it would be good to use a specific file for benchmarking which actually has Ns, because you'll get some quite interesting execution time differences with some methods counting the more frequently occurring As rather than Ns.

One such file is:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/SRR077487_2.filt.fastq.gz

It's also worth checking the various solutions return the correct answer there should be 306072 Ns in the above file.

Next note that decompression of this file redirected to /dev/null is slower with zcat and gzip (which are both gzip 1.6 on my system) than say a parallel implementation of gzip like Mark Adler's pigz, which appears to use 4 threads for decompression. All timings represent an average of 10 runs reporting real (wall clock time).

time pigz -dc SRR077487_2.filt.fastq.gz > /dev/null

real    0m29.0132s

time gzip -dc SRR077487_2.filt.fastq.gz > /dev/null

real    0m40.6996s

There is an ~11.7 second difference between the two. Next if I then try to benchmark a one-liner which performs on FASTQ and gives the correct answer (Note I've yet to encounter FASTQ which is not 4 line, and seriously who generates these files!)

time pigz -dc SRR077487_2.filt.fastq.gz | awk 'NR%4==2{print $1}' | tr -cd N | wc -c
306072

real    0m34.793s

As you can see the counting adds ~5.8 second to the total run time versus pigz based decompression. Additionally this time delta is higher when using gzip ~6.7 seconds above gzip decompression alone.

time gzip -dc SRR077487_2.filt.fastq.gz | awk 'NR%4==2{print $1}' | tr -cd N | wc -c                                                                
306072

real    0m44.399s

The pigz awk tr wc based solution is however ~4.5 seconds faster than the count-N based C code solution:

time count-N SRR077487_2.filt.fastq.gz 
2385855128      306072  0

real    0m39.266s

This difference appears to be robust to re-running as many times as I like. I expect if you could use pthread in the C based solution or alter it to take the standard out from pigz it would also show an increase in performance.

Benchmarking another alternative pigz grep variant appears to take more or less the same time as the tr based variant:

time pigz -dc SRR077487_2.filt.fastq.gz | awk 'NR%4==2{print $1}' | grep -o N | wc -l
306072

real    0m34.869s

Note that the seqtk based solution discussed above is noticeably slower,

time seqtk comp  SRR077487_2.filt.fastq.gz | awk '{x+=$9}END{print x}'
306072

real    1m42.062s 

However it's worth noting seqtk comp is doing a bit more than the other solutions.

Matt Bashton
  • 1,069
  • 6
  • 16
5

Honestly, the easiest way (especially for FASTQ) is probably to use a dedicated parsing library, such as R/Bioconductor:

suppressMessages(library(ShortRead))

seq = readFasta(commandArgs(TRUE)[1]) # or readFastq
cat(colSums(alphabetFrequency(sread(seq))[, 'N', drop = FALSE]), '\n')

This may not be the fastest, but it’s pretty fast, since the relevant methods are highly optimised. The biggest overhead is actually probably from loading ShortRead, which is an unjustifiable slog.

Konrad Rudolph
  • 4,845
  • 14
  • 45
5

If it is raw speed you're after, then writing an own little C/C++ program is probably what you need to do.

Fortunately, the worst part (a fast and reliable parser) has already been tackled: the readfq from Heng Li is probably the fastest FASTA/FASTQ parser around.

And it's easy to use, the example on GitHub can easily be expanded to do what you need. Just add in a parameter parser (for the filename you want to analyse), program a simple 'N'-counter loop and have the results printed to stdout. Done.

Else, if it is simplicity you are after, see Konrad's answer for an R based solution. Or a very simple script using Biopython.

BaCh
  • 734
  • 4
  • 9
  • God that’s absolutely atrocious code. ☹️ I think that for raw speed, SeqAn (C++) is probably on par, and a lot higher quality. – Konrad Rudolph Jun 01 '17 at 17:19
  • It's not the prettiest around. Then again, it does what it needs to do and doesn't come with a whole library dependency. Do you have a benchmark with SeqAn & readfq? – BaCh Jun 01 '17 at 17:26
  • I'm curious to see how readfq ans SeqAn compare with the fasta/fastq parser from GATB: http://gatb-core.gforge.inria.fr/doc/api/classgatb_1_1core_1_1bank_1_1impl_1_1BankFasta.html#details I'm not really able to code in C/C++ though. – bli Jun 08 '17 at 11:57
3

For FASTA files, I've implemented a relatively efficient method in pyfaidx v0.4.9.1. This post made me realize that my previous code was quite slow and easy to replace:

$ pip install pyfaidx
$ time faidx -i nucleotide ~/Downloads/hg38.fa
name    start   end A   T   C   G   N   others
chr1    1   248956422   67070277    67244164    48055043    48111528    18475410    
chr10   1   133797422   38875926    39027555    27639505    27719976    534460  
chr11   1   135086622   39286730    39361954    27903257    27981801    552880  
chr11_KI270721v1_random 1   100316  18375   19887   31042   31012   0   
...
chrX    1   156040895   46754807    46916701    30523780    30697741    1147866 
chrY    1   57227415    7886192 7956168 5285789 5286894 30812372    
chrY_KI270740v1_random  1   37240   8274    12978   7232    8756    0   

real    2m28.251s
user    2m15.548s
sys 0m9.951s
Matt Shirley
  • 141
  • 2
0

Here are perl 1-liners for fasta and fastq files that are fast enough if you just want totals:

$ perl -ne 'BEGIN { my $n=0 }; /^>/ or $n += s/N//gi; END { print "N=$n\n" }' contigs.fasta

For each line of a FASTA file, if not header (/^>/) then it's a sequence line. Substitute Ns, which returns the number of substitutions made, which is added to the total.

$ zcat reads.fastq | perl -ne 'BEGIN { my $n=0 }; /^[ATCGN]+$/i and $n += s/N//gi; END { print "N=$n\n" }'

For each sequence line of a FASTQ file, matched by /^[ATCGN]+$/i (i.e. all NTs), then count substitutions made in the same way.

You may use pigz instead of zcat (or gunzip -c) if you have pigz installed.

An uncompressed fastq file of 14M 250bp Illumina reads completed in 40sec.

  • If you could explain what each perl line do, it would improve the answer, specially commenting the differences between both approaches – llrs Mar 01 '18 at 09:02