How to maximize fastq parsing with FastqGeneralIterator (Bio.SeqIO.QualityIO)

Question

I'm contributing to a python-based project that uses Biopython to analyze fastq files. It currently uses SeqIO.parse, which populates various structures with all of the fastq information (including converting quality scores). There is apparently a faster (lighter-weight) parser called FastqGeneralIterator that doesn't populate all of these items.

I'd like to use FastqGeneralIterator, but be able to perform some of the functions SeqIO.parse offers on the fly (not for every sequence). For example, I'd like to convert base quality to PHRED scores for specific sequences, but I don't see that function available.

Is there an easy way to take a tuple output by FastqGeneralIterator and convert it into a proper Bio.SeqIO SeqRecord?

Not sure where is your problem exactly. Just in case it might be of interest to you, you can find some examples of python code parsing fastq (including FastqGeneralIterator, and faster alternatives) in the last part of this answer: https://bioinformatics.stackexchange.com/a/380/292 — bli, Jun 23 '17 at 12:02
You may also be interested in this: http://biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO.FastqPhredWriter-class.html (There's a FastqPhredWriter in Biopython) — bli, Jun 23 '17 at 12:07

winni2k · Accepted Answer · 2019-07-01T07:35:48.570

I had not come across FastqGeneralIterator before, but I will start using it for some of my work!

One answer is to replicate the code in the FastqPhredIterator, which does exactly what you are looking to do.

However, My personal preference would be to use the biopython infrastructure as much as possible, especially when it comes to the tricky aspects of record parsing. If the records you want extra info on are uncommon, then using FastqPhredIterator for phred quality extraction might work for you:

import io
from Bio.SeqIO.QualityIO import FastqGeneralIterator, FastqPhredIterator

with open("input.fastq", "rU") as handle:
    for title, sequence, quality in FastqGeneralIterator(handle):
        # do something with title, sequence, and quality
        # Set special_case to True when you want more info on the record
        special_case = False
        if special_case:
            record_stream = io.StringIO("\n".join([title, sequence, "+", quality]))
            record = next(FastqPhredIterator(record_stream))
            # do something special with record

What we are doing here is combining the record strings into one and dumping them into a StringIO object, which implements the file handle duck-type. We then hand the in-memory handle to FastqPhredIterator to parse the FASTQ record again. Make sure to set special_case to True only when you want extra info on a record, as the special_case branch makes at least two extra copies of the FASTQ record.

Very useful, thank you! I was hoping there were dedicated methods for these tasks. — Mark Ebbert, Jun 23 '17 at 18:05

How to maximize fastq parsing with FastqGeneralIterator (Bio.SeqIO.QualityIO)

1 Answers1