I am using the following command:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile.seq) > 250:
gene_id = myfile.id
mylist = re.match(r"H149xcV_[^\W_]+[^\W]+[^\W])[^\W]+", gene_id)
print (">"+mylist.group(1))
And this is providing me with duplicates of the same gene:
How can I reformat my command so that I only receive unique gene id's:
within commandline to write my python command. @RamRS – andnowmywatchbegins Nov 24 '20 at 00:23sort
but I am not sure as to how I would use it? I must apologize, I am kind of new to bioinformatics, so I am still getting the hang of it. I know I can output my python into another file, but is there a way to obtain the unique id's afterwards? Thanks! – andnowmywatchbegins Nov 24 '20 at 02:46seqkit rmdup
instead reinvent the wheel. https://bioinf.shenwei.me/seqkit/usage/#rmdup – zorbax Nov 24 '20 at 09:47