1

I am using the following command:

from Bio import SeqIO
import sys
import re

fasta_file = (sys.argv[1]) for myfile in SeqIO.parse(fasta_file, "fasta"): if len(myfile.seq) > 250: gene_id = myfile.id mylist = re.match(r"H149xcV_[^\W_]+[^\W]+[^\W])[^\W]+", gene_id) print (">"+mylist.group(1))

And this is providing me with duplicates of the same gene:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2
>H149xcV_Fge342_r3_h2
>H149xcV_Fge342_r3_h2

How can I reformat my command so that I only receive unique gene id's:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2
M__
  • 12,263
  • 5
  • 28
  • 47

6 Answers6

2

I wouldn't use Python for this, myself. Instead you can use the FastaToTbl and TblToFasta scripts I have posted before, and pipe to standard *nix utilities:

FastaToTbl file.fa | sort -u | TblToFasta > file.uniq.fa

Or, if you don't have GNU sort:

FastaToTbl file.fa | sort | uniq | TblToFasta > file.uniq.fa

Alternatively, you can also do something like this:

awk -v RS='>' -v ORS='>' '++a[$0]==1' file.fa | sed 's/>$//'

For example, given this input file:

$ cat file.fa
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
>seq2
ATGGCGCGCTTAGGAGCGCTAGGACT
>seq2
ATGGCGCGCTTAGGAGCGCTAGGACT
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA

You can do:

$ awk -v RS='>' -v ORS='>' '++a[$0]==1' file.fa | sed 's/>$//'
>seq1
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
ACTTCGCAGAGGCTTCGGAGAGA
>seq2
ATGGCGCGCTTAGGAGCGCTAGGACT

Both of the above approaches have the advantage of not needing to store the entire file in memory.

terdon
  • 10,071
  • 5
  • 22
  • 48
  • There is also an application at UCSC utilities that converts Fasta to Tab separated file. It's called faToTab and can be downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/ – Supertech Feb 24 '22 at 21:19
1
from Bio import SeqIO
import sys
import re 

unique = []
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
    if len(myfile) > 250:
        gene_id = myfile.id
        if gene.id not in unique:
            unique.append(gene.id)
            mylist = re.match(r"H149xcV_[^\W_]+_[^\W_]+_[^\W_])_[^\W_]+", gene_id)
            print (">"+mylist.group(1))
user438383
  • 1,679
  • 1
  • 8
  • 21
pippo1980
  • 1,088
  • 3
  • 14
  • Yeah, I would do something along these lines too. Though, I would use a set or dict insead of list if the fasta file is big... – Kamil S Jaron Jan 27 '21 at 15:55
1

Perhaps itertools might be an option? From the itertools page / moreitertools project https://docs.python.org/3/library/itertools.html#itertools-recipes

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element
1

I would use a dict so that the lookup time is constant. If you are creating an array with all the unique gene IDs, then checking if a gene is in that array, your runtime is O(N2). When you create an entry in the dict, the time to access is constant, and if the entry is already present, it will just overwrite it. Something like this should do it for you:

from Bio import SeqIO

genes = {} fasta_file = sys.argv[1] for entry in SeqIO.parse( fasta_file, "fasta" ): genes[ entry.id ] = entry.seq

Then you can write a new, unique fasta file:

with open('unique_genes.fasta', 'w') as out:
    for key,value in genes.items():
        out.write( '>%s\n%s\n'%(key,value) )
Sevy
  • 201
  • 2
  • 4
0

Your question isn't well formatted, for example list is used as a variable name where it should be reserved as a keyword...

print (">"+list.group(1))

However, the answer you're after is going to be something like:

print (">"+set(name_of_list.group(1)))

Anytime you want to get the unique elements of a list, just wrap it in set()

Liam McIntyre
  • 579
  • 4
  • 11
0

I use the following code. It doesn't only remove repeated sequences but similar sequences (with a quite simplistic approach). I use it just sporadically and I guess it can be largely depurated. Remove the debug prints if these annoy you.

import sys
import requests 
import shlex, subprocess
import random
from Bio import SeqIO
import editdistance

if len(sys.argv)<3: print("Uso: "+sys.argv[0]+" input_fasta output_fasta") sys.exit(0)

savefile = sys.argv[2] content_file = sys.argv[1]

class Sequence: id="" sequence=""

def are_similar(seq1, seq2): l1 = len(seq1.sequence) l2 = len(seq2.sequence) if abs(l1-l2) < l1/40: d = editdistance.eval(seq1.sequence[0:250], seq2.sequence[0:250]) #d = editdistance.eval(seq1.sequence, seq2.sequence) print(seq1.id+" "+seq2.id+" -> "+str(d)) #if d < l1/20: if d < 50: return True return False

def find_repeats(sequences, seq, repeated): b=False for seq2 in sequences: if(b and not seq2.id in repeated): if are_similar(seq, seq2): repeated.add(seq2.id)
if seq2.id==seq.id: b=True

sequences=[]

res=""

for record in SeqIO.parse(content_file, "fasta"): seq=Sequence() seq.id=record.id seq.sequence=record.seq sequences.append(seq)

repeated_sequences=set() for seq in sequences: find_repeats(sequences, seq, repeated_sequences)

print(repeated_sequences)

def mySortFunc(e): return len(e.sequence)

sequences.sort(key=mySortFunc)

for seq in sequences: b= seq.id in repeated_sequences if not b: res+=">"+seq.id+"\n" res+=str(seq.sequence)+"\n"

output_file = open(savefile, "w")

output_file.write(res)

output_file.close()

EDIT: I see that's not exactly what OP is asking ...

juanjo75es
  • 359
  • 1
  • 10