a python beginner here. I have a fasta file with 2500+ sequences, and after doing some analysis I want to remove around 200+ sequences based on the matching IDs. Now, I have one fasta file (as sample.fa) and a text file with a list of IDs for the sequences that I want to remove from this fasta file. Below is an example of the fasta sequences. For simplicity, I have shortened the sequences and their respective IDs. The ID in the text file (for the sequences to remove, and in the fast file will be exactly same). The sequences were originally downloaded form NCBI and had very long headers so I have them shortened as at this point I don't need all that info, usually just the sequence ID in the headers is enough.
Sample fasta sequences in fasta file:
>WP_017417394.1-Bacillus_amyloliquefaciens_gp.
MRGKKVWISLLFALALIFTMAFGSTSPAQAAGKSNGEKKYIVGFKQTMSTMSAAKKKDVISEKGGKVQKQFKYVDAASAT
>WP_268291751.1-Bacillus_inaquosorum
MRRKKLWISLLFALTLIFTMAFSNMSAQAAGKSSTEKKYIVGFKETMSAMSSAKKKDVISEKGGKVQKQFKYVNAATATL
>WP_003327717.1-Bacillus_atrophaeus
MRSRKLWIGLLFAFTLVFTMAFSSMSPAQAAGKSSEEKKYIVGFKQTMGAMSAAAKKDVIAENGGKLQKQFKYVDAATAT
>WP_206702175.1-Bacillus_vallismortis
MAFSNMSAQAAGKSSTEKKYIVGFKQTMSVMSSAKKKDVISEKGGKVQKQFKYVNAAAATLDAKAVKELKQDPSVAYVEE
>KAF1680298.1-Bacillus_sp.
MKSKKLWISLLFALTLVFTMAFSNMSAQASGKNSTEKKYIVGFKQTMSTAKKKDIISEKGGKVQKQFKYVNAASATLDQK
>WP_172293960.1-Bacillus_sp.
MGKKRLWYSLLTVLLLVFSMAFSHPANAVKSSAKDVKKDYIVGFKSSLKTTSAKKDVIKENGGKVDKHFRSINAAKVTLD
>WP_082194748.1-Bacillus_xiamenensis
MCVKKKNVMTSVLLAVPLLFSAGFGGSMANAETASKSESEKSYIVGFKASATTNSSKKQAVTQNGGTLEKQYRLINAAQV
>WP_048353666.1-Bacillus_glycinifermentans
MKKRSLWLSVLTALLLVFTMAFSNPASAEQPAKDVEKDYIVGFKSSVKTASVKKDVIKESGGKVDKQFKIINAAKATLDQ
>WP_206702858.1-Bacillus_sonorensis
MTALLLVLSTAFSSPASAAQPAKDVEKDYIVGFKSSVKTAAVKKDVIKENGGKVDKQFKIINAAKATLDQEEVKALKKD
The list of IDs in a separate text file that need to be removed from the above fasta file:
>WP_017417394.1-Bacillus_amyloliquefaciens_gp.
>WP_003327717.1-Bacillus_atrophaeus
>KAF1680298.1-Bacillus_sp.
>WP_048353666.1-Bacillus_glycinifermentans
>WP_206702858.1-Bacillus_sonorensis
I searched and tried a few scripts shared elsewhere (e.g. from here), but I couldn't make it work with my files. So all the help will be really appreciated. Thanks in advance.
PS. I am using using python usually with its own IDLE on windows 10 PC.
I did one thing with this code. I removed the '>' character from the headers file and then this code works but it keeps those headers and their respective sequences. I think it works as intended in this code, but I want to remove those from the sequence file not to keep.
Update: as asked below, I copied this code from another post.
from Bio import SeqIO
with open('headers.txt', 'r') as fin:
to_keep = set([])
for line in fin:
to_keep.add(line.strip().split(' ')[0])
with open('output.fa', 'w') as fout:
for record in SeqIO.parse('sample.fa','fasta'):
if record.id in to_keep:
SeqIO.write(record, fout, 'fasta')