How to remove sequences from a fasta file using a sequence ID list which contains a space within the id?

Question

I have a fasta file that contains sequence reads and sequence id file that needed to be removed from the fasta file. I have done this earlier, but since id contains a space my piece of code is not working and I don't know how to change it since I'm still learning. fasta file looks like this.

>m64041_200717_231916/74/12723_24868 id=30
CCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCA
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/105/20691_65844 id=488
AGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCAAGACTTTAT
>m64041_200717_231916/108/17414_67048 id=4956
TGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG

IDList file is like this.

m64041_200717_231916/74/12723_24868 id=30
m64041_200717_231916/108/17414_67048 id=4956
m64041_200717_231916/105/20691_65844 id=488

I need to remove these sequences from that fasta file. I have a code I used earlier. But it is not working since id contains a space (I think). Previous code is like this.

# use as follows
# ./filter_seq.bash <fastafile> <list_of_ids_you_want_to_remove>
# ./filter_seq.bash all.fasta bacterial_IDlist
grep ">" $1 | sed 's/>//g' | sort | uniq > all_del
sort $2 | uniq > query_del
comm -23 all_del query_del | sort | uniq > filtered_del
perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' filtered_del $1 > filtered.fasta
rm *_del

I would be grateful if someone can help me.

I would highly recommend getting rid of the spaces first using something like sed 's, ,_,g' -I in.fa. Spaces in any kind of genetic data file are bad news and will cause downstream errors. It will make further analysis much easier. — user438383, Feb 04 '21 at 18:49
@user438383 why do you say that? Spaces in fasta headers are incredibly common, allowed by the standard and should not cause any issues at all. I can't think of any tool that would have trouble parsing a fasta header like that and would even say that if a tool cannot do this, it is not fit for purpose. As an example, have a look at pretty much any fasta sequence from refseq. Here's human tp53: https://www.ncbi.nlm.nih.gov/nuccore/NG_017013.2?report=fasta. — terdon, Feb 05 '21 at 14:02

score 1 · Answer 1 · answered Feb 04 '21 at 20:55

Install (pip) biopython if u don't have it.

from Bio import SeqIO
with open('/path/ids.txt', 'r') as fin:
    to_keep = set([])
    for line in fin:
        to_keep.add(line.strip().split(' ')[0])
with open('/path/output.fa', 'w') as fout:
    for record in SeqIO.parse('/path/sequences.fa','fasta'):
        if record.id in to_keep:
            SeqIO.write(record, fout, 'fasta')

score 0 · Answer 2 · answered Feb 05 '21 at 13:55

I tend to do this using the FastaToTbl and TblToFasta scripts I have posted before. FastaToTbl will convert fastq sequences to single-line, tab-separated recors which makes them far easier to parse. Using these two scripts, you can do:

$ FastaToTbl file.fa | grep -vwf idList | TblToFasta 
>m64041_200717_231916/77/1941_50622 id=3115 
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCC
TGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/162/0_6615 id=857 
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCAC
ACTTG

The -v option will tell grep to return non-matching lines, and the -f option tells it to read the patterns to search for from a file. In this case, we give it the file with the IDs to skip (idList). The -w ensures we only find "whole word" matches, so that something like m64041_200717_231916/77/1941_50622 id=31151111 will not count as a match for m64041_200717_231916/77/1941_50622 id=31151111.

Another approach would be to just use awk directly:

$ awk '(NR==FNR){s[">"$0]++} (/^>/){ s[$0] ? a=0 : a=1}a' idList file.fa 
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG

Here is the same awk script slightly expanded to make it easier to understand:

 awk '{
        if (NR==FNR){idsToSkip[">"$0]++} 
        else if (/^>/){ 
            if($0 in idsToSkip){
                wantThisId = 0
            }
            else{
                wantThisId = 1
            }
        }
        if(wantThisId == 1){
            print
        }
      }' idList file.fa

How to remove sequences from a fasta file using a sequence ID list which contains a space within the id?

2 Answers2

Linked