How to remove duplicates from a fasta file using python

Question

I am using the following command:

from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile.seq) > 250:
    gene_id = myfile.id
    mylist = re.match(r"H149xcV_[^\W_]+[^\W]+[^\W])[^\W]+", gene_id)
    print (">"+mylist.group(1))

And this is providing me with duplicates of the same gene:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2
>H149xcV_Fge342_r3_h2
>H149xcV_Fge342_r3_h2

How can I reformat my command so that I only receive unique gene id's:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2

I am using nano within commandline to write my python command. @RamRS — andnowmywatchbegins, Nov 24 '20 at 00:23
I'm asking why, not how. There are command line tools that de-dup fasta, so why reinvent the wheel? — Ram RS, Nov 24 '20 at 02:28
@RamRS The only command-line tools that I am aware of are sort|uniq but I am not sure as to how I would use it? I must apologize, I am kind of new to bioinformatics, so I am still getting the hang of it. I know I can output my python into another file, but is there a way to obtain the unique id's afterwards? Thanks! — andnowmywatchbegins, Nov 24 '20 at 02:46
You can use seqkit rmdup instead reinvent the wheel. https://bioinf.shenwei.me/seqkit/usage/#rmdup — zorbax, Nov 24 '20 at 09:47

terdon · Answer 1 · 2022-02-24T12:09:38.957

I wouldn't use Python for this, myself. Instead you can use the FastaToTbl and TblToFasta scripts I have posted before, and pipe to standard *nix utilities:

FastaToTbl file.fa | sort -u | TblToFasta > file.uniq.fa

Or, if you don't have GNU sort:

FastaToTbl file.fa | sort | uniq | TblToFasta > file.uniq.fa

Alternatively, you can also do something like this:

awk -v RS='>' -v ORS='>' '++a[$0]==1' file.fa | sed 's/>$//'

For example, given this input file:

$ cat file.fa
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
>seq2
ATGGCGCGCTTAGGAGCGCTAGGACT
>seq2
ATGGCGCGCTTAGGAGCGCTAGGACT
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA

You can do:

$ awk -v RS='>' -v ORS='>' '++a[$0]==1' file.fa | sed 's/>$//'
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
>seq2
ATGGCGCGCTTAGGAGCGCTAGGACT

Both of the above approaches have the advantage of not needing to store the entire file in memory.

There is also an application at UCSC utilities that converts Fasta to Tab separated file. It's called faToTab and can be downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/ — Supertech, Feb 24 '22 at 21:19

score 1 · Answer 2 · edited Feb 23 '22 at 21:36

1

from Bio import SeqIO
import sys
import re 

unique = []
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
    if len(myfile) > 250:
        gene_id = myfile.id
        if gene.id not in unique:
            unique.append(gene.id)
            mylist = re.match(r"H149xcV_[^\W_]+_[^\W_]+_[^\W_])_[^\W_]+", gene_id)
            print (">"+mylist.group(1))

edited Feb 23 '22 at 21:36

user438383

1,679
1
8
21

answered Dec 28 '20 at 11:46

pippo1980

1,088
3
14

Yeah, I would do something along these lines too. Though, I would use a set or dict insead of list if the fasta file is big... – Kamil S Jaron Jan 27 '21 at 15:55

score 1 · Answer 3 · answered Jan 28 '21 at 08:57

Perhaps itertools might be an option? From the itertools page / moreitertools project https://docs.python.org/3/library/itertools.html#itertools-recipes

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

score 1 · Answer 4 · answered Jan 28 '21 at 19:29

I would use a dict so that the lookup time is constant. If you are creating an array with all the unique gene IDs, then checking if a gene is in that array, your runtime is O(N²). When you create an entry in the dict, the time to access is constant, and if the entry is already present, it will just overwrite it. Something like this should do it for you:

from Bio import SeqIO
genes = {}
fasta_file = sys.argv[1]
for entry in SeqIO.parse( fasta_file, "fasta" ):
    genes[ entry.id ] = entry.seq

Then you can write a new, unique fasta file:

with open('unique_genes.fasta', 'w') as out:
    for key,value in genes.items():
        out.write( '>%s\n%s\n'%(key,value) )

score 0 · Answer 5 · answered Jan 28 '21 at 03:33

Your question isn't well formatted, for example list is used as a variable name where it should be reserved as a keyword...

print (">"+list.group(1))

However, the answer you're after is going to be something like:

print (">"+set(name_of_list.group(1)))

Anytime you want to get the unique elements of a list, just wrap it in set()

juanjo75es · Answer 6 · 2021-01-29T18:24:15.830

I use the following code. It doesn't only remove repeated sequences but similar sequences (with a quite simplistic approach). I use it just sporadically and I guess it can be largely depurated. Remove the debug prints if these annoy you.

import sys
import requests 
import shlex, subprocess
import random
from Bio import SeqIO
import editdistance
if len(sys.argv)<3:
    print("Uso: "+sys.argv[0]+" input_fasta output_fasta")
    sys.exit(0)
savefile = sys.argv[2]
content_file = sys.argv[1]
class Sequence:
    id=""
    sequence=""
def are_similar(seq1, seq2):
    l1 = len(seq1.sequence)
    l2 = len(seq2.sequence)
    if abs(l1-l2) < l1/40:
        d = editdistance.eval(seq1.sequence[0:250], seq2.sequence[0:250])
        #d = editdistance.eval(seq1.sequence, seq2.sequence)
        print(seq1.id+" "+seq2.id+" -> "+str(d))
        #if d < l1/20:
        if d < 50:
            return True
    return False
def find_repeats(sequences, seq, repeated):
    b=False
    for seq2 in sequences:
        if(b and not seq2.id in repeated):
            if are_similar(seq, seq2):
                repeated.add(seq2.id)

        if seq2.id==seq.id:
            b=True
sequences=[]
res=""
for record in SeqIO.parse(content_file, "fasta"):
    seq=Sequence()
    seq.id=record.id
    seq.sequence=record.seq
    sequences.append(seq)
repeated_sequences=set()
for seq in sequences:
    find_repeats(sequences, seq, repeated_sequences)
print(repeated_sequences)
def mySortFunc(e):
    return len(e.sequence)
sequences.sort(key=mySortFunc)
for seq in sequences:
    b= seq.id in repeated_sequences
    if not b:
        res+=">"+seq.id+"\n"
        res+=str(seq.sequence)+"\n"
output_file = open(savefile, "w")
output_file.write(res)
output_file.close()

EDIT: I see that's not exactly what OP is asking ...

How to remove duplicates from a fasta file using python

6 Answers6