1

a python beginner here. I have a fasta file with 2500+ sequences, and after doing some analysis I want to remove around 200+ sequences based on the matching IDs. Now, I have one fasta file (as sample.fa) and a text file with a list of IDs for the sequences that I want to remove from this fasta file. Below is an example of the fasta sequences. For simplicity, I have shortened the sequences and their respective IDs. The ID in the text file (for the sequences to remove, and in the fast file will be exactly same). The sequences were originally downloaded form NCBI and had very long headers so I have them shortened as at this point I don't need all that info, usually just the sequence ID in the headers is enough.

Sample fasta sequences in fasta file:

>WP_017417394.1-Bacillus_amyloliquefaciens_gp.
MRGKKVWISLLFALALIFTMAFGSTSPAQAAGKSNGEKKYIVGFKQTMSTMSAAKKKDVISEKGGKVQKQFKYVDAASAT
>WP_268291751.1-Bacillus_inaquosorum
MRRKKLWISLLFALTLIFTMAFSNMSAQAAGKSSTEKKYIVGFKETMSAMSSAKKKDVISEKGGKVQKQFKYVNAATATL
>WP_003327717.1-Bacillus_atrophaeus
MRSRKLWIGLLFAFTLVFTMAFSSMSPAQAAGKSSEEKKYIVGFKQTMGAMSAAAKKDVIAENGGKLQKQFKYVDAATAT
>WP_206702175.1-Bacillus_vallismortis
MAFSNMSAQAAGKSSTEKKYIVGFKQTMSVMSSAKKKDVISEKGGKVQKQFKYVNAAAATLDAKAVKELKQDPSVAYVEE
>KAF1680298.1-Bacillus_sp.
MKSKKLWISLLFALTLVFTMAFSNMSAQASGKNSTEKKYIVGFKQTMSTAKKKDIISEKGGKVQKQFKYVNAASATLDQK
>WP_172293960.1-Bacillus_sp.
MGKKRLWYSLLTVLLLVFSMAFSHPANAVKSSAKDVKKDYIVGFKSSLKTTSAKKDVIKENGGKVDKHFRSINAAKVTLD
>WP_082194748.1-Bacillus_xiamenensis
MCVKKKNVMTSVLLAVPLLFSAGFGGSMANAETASKSESEKSYIVGFKASATTNSSKKQAVTQNGGTLEKQYRLINAAQV
>WP_048353666.1-Bacillus_glycinifermentans
MKKRSLWLSVLTALLLVFTMAFSNPASAEQPAKDVEKDYIVGFKSSVKTASVKKDVIKESGGKVDKQFKIINAAKATLDQ
>WP_206702858.1-Bacillus_sonorensis
MTALLLVLSTAFSSPASAAQPAKDVEKDYIVGFKSSVKTAAVKKDVIKENGGKVDKQFKIINAAKATLDQEEVKALKKD

The list of IDs in a separate text file that need to be removed from the above fasta file:

>WP_017417394.1-Bacillus_amyloliquefaciens_gp.
>WP_003327717.1-Bacillus_atrophaeus
>KAF1680298.1-Bacillus_sp.
>WP_048353666.1-Bacillus_glycinifermentans
>WP_206702858.1-Bacillus_sonorensis

I searched and tried a few scripts shared elsewhere (e.g. from here), but I couldn't make it work with my files. So all the help will be really appreciated. Thanks in advance.

PS. I am using using python usually with its own IDLE on windows 10 PC.

I did one thing with this code. I removed the '>' character from the headers file and then this code works but it keeps those headers and their respective sequences. I think it works as intended in this code, but I want to remove those from the sequence file not to keep.

Update: as asked below, I copied this code from another post.

from Bio import SeqIO
with open('headers.txt', 'r') as fin:
    to_keep = set([])
    for line in fin:
        to_keep.add(line.strip().split(' ')[0])
with open('output.fa', 'w') as fout:
    for record in SeqIO.parse('sample.fa','fasta'):
        if record.id in to_keep:
            SeqIO.write(record, fout, 'fasta')
gringer
  • 14,012
  • 5
  • 23
  • 79
Irfan
  • 81
  • 5
  • 2
    And how did they not work? I was about to write an answer but it is essentially what had already written as an answer to the other question. Please [edit] and clarify why those other solutions didn't work, since right now, this looks like a duplicate question. – terdon Dec 20 '23 at 17:29
  • This is not a good FASTA file format. IDs and description should be separate by space. I suggest to take a look at the official documentation. Good practice will make your life easier here an example or – Supertech Dec 25 '23 at 20:15
  • yes, I agree. but I do that on purpose b/c after some manipulation, I make a phylogenetic tree and visualize (usually on ITOL) which messes up the formatting, sometimes it is very long and I am interested to see only the accession ID and the bacterial name. So I delete the rest and make it as one continuous string so everything I need stays in and still within a short limit to be visible. – Irfan Dec 28 '23 at 05:55

4 Answers4

5

You can use something like seqkit: https://bioinf.shenwei.me/seqkit/

seqkit has a grep sub-command that is one of the most flexible I've ever seen. Your use case is quite simple, you can use seqkit grep -v like so:

seqkit grep -v -n -f list_of_IDs_to_exclude.txt in.fasta > out.fasta
gringer
  • 14,012
  • 5
  • 23
  • 79
Ram RS
  • 2,297
  • 10
  • 29
2

This should work

import argparse

ap = argparse.ArgumentParser(description=doc, formatter_class=argparse.ArgumentDefaultsHelpFormatter) ap.add_argument('--fasta', required=True, type=str, help='Fasta file to extract from') ap.add_argument('--headers', required=True, type=str, help='Headers to remove, one per line.')

conf = ap.parse_args() f = conf.fasta h = conf.headers header_set = set()

with open(h) as header_file: for line in header_file: line = line.rstrip() header_set.add(line)

with open(f) as fasta_file: remove_sequence = False for line in fasta_file: if '>' in line: header_id = line.split()[0].replace(">", "") if header_id in header_set: remove_sequence = True else: remove_sequence = False print(line.rstrip()) elif not remove_sequence: print(line.rstrip())

exit()

Save it as python script and then run

python remove_fasta.py --fasta sample.fa --headers listofheaders.txt > cleaned.fasta

CorteZero
  • 109
  • 1
  • 6
  • 2
    Thanks. I tried this first thing in the morning. First it ran but didn't remove anything, then I removed '>' character from the headers list and it gave the correct expected cleaned file. What if I want to run it directly form the python IDLE and not from cmd? – Irfan Dec 21 '23 at 02:23
  • I'm glad it worked. As you can see, there are other options that will get the job done, but I understood that you want to learn how to do this in Python. I don't personally use the Python IDLE on Windows, but I believe you can save your code, open it in IDLE, and click on "run module". Make sure to modify the file locations as needed. I hope this helps, and good luck with your coding – CorteZero Dec 21 '23 at 10:29
2

I would argue writing a whole python script for this is overkill. Assuming you are using a*nix based operating system like Linux or macOS, you can use the two scripts I have posted here, FastaToTbl and TblToFasta, and just grep out the ones you do not want:

FastaToTbl sample.fa | grep -v -xFf listofheaders.txt | TblToFasta > cleaned.fa
terdon
  • 10,071
  • 5
  • 22
  • 48
  • Hi terdon, I am only using windows 10 PC with python. I am trying the above suggestions as well. Writing my own code is a big opportunity to learn as well. – Irfan Dec 21 '23 at 01:57
  • Oh. Windows? OK. But then please add that to your question. Windows isn't really used by bioinformaticians at all so nobody would guess you are trying this in windows. If you want to learn bioinformatics, I would urge you to use Linux or macOS or basically anything other than Windows. – terdon Dec 21 '23 at 09:43
  • I agree with you. I am already on to it. ") – Irfan Dec 22 '23 at 00:59
2

After modifying the code above in the original post, this one works the best for me. I got it working after putting 'not' in the second last line instead of "if record.id in to keep" to "if record.id not in to keep":

from Bio import SeqIO
with open('headers.txt', 'r') as fin:
    to_remove = set([])
    for line in fin:
        to_remove.add(line.strip().split(' ')[0])
with open('output.fa', 'w') as fout:
    for record in SeqIO.parse('sample.fa','fasta'):
        if record.id in to_remove:
            pass
        else:
            SeqIO.write(record, fout, 'fasta')
gringer
  • 14,012
  • 5
  • 23
  • 79
Irfan
  • 81
  • 5