How to find common sequences among 6 multi-fasta files

Question

I have 6 multi-fasta files, every of them contains ca 1500 sequences like that:

>Haladaptatus sp. R4
MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAES
PGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTD
AERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGR
GTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTES
ALEEVADR
>Methanosphaera sp. WGK6
MAKVNVYSLKGDITEEIELPEIFEEEYRPDVIKRAVISTQTARIQPWGANPMAGKRTTAE
SFGSGRGAAMVPRVKGSSKAAFVPQAIGGRKAHPPRVNTIYHEKINKKERILAIRSAIAA
TANKEIVEQRGHAVANLEQVPFVVDDELETIKTTKETREIFKDLGIMDDILRAKKGRKIK
SGKGKLRGRKYRTPKGPLVVVGNDRGISLGARNHAGVEVVEVNNINAELLAPGTHAGRLT
IYTKSAVEKLADLFQQNRS

and so on. I need to create a new result file containing sequences that are common in these multi-fasta files. Example: If sequence

>Haladaptatus sp. R4
    MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAES
    PGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTD
    AERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGR
    GTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTES
    ALEEVADR

occures in 1 and 2 and 3 and 4 and 5 and 6 file it shoud be written to the result file. May anoyone help?

Edited: this is my Python script which should work on 2 files (not 6 for now to make it simplier):

for z in f1:
if z[0] == '>': 
    znaleziony = False
    gatunek1 = z[1:len(z)]
    #print(gatunek1)
    for x in g1:
        #print(x)
        if x[0] == '>':

            gatunek2 = x[1:len(x)]
            if gatunek1 == gatunek2:
                wynik.write('>')                    
                wynik.write(gatunek1)
                znaleziony = True
                continue



elif znaleziony == True:
    wynik.write(z)

wynik.close()

Unfortunately it repeats some identifiers of sequence and does compare only identifiers (not whole sequence).

Welcome to the site. What approach have you tried to do this? In which step are you unsure how to continue? Could you post your code so that we can advice how to continue? — llrs, Aug 17 '19 at 16:31
@MichaelG. I woudnt say so because there shouldn be any duplicates among one separate file — Patrycja, Aug 17 '19 at 17:36
@Ilrs I tried to run a script in Python. I`ve begun with finding common sequences between two files (not six for now) to make it simplier. It finds common sequences but also writes only names of several sequences many times in the file like this: >Natrialbaceae

Natrialbaceae Natrialbaceae

MEATVRDLDGSDAGSTELPAVFETTYRPDLIARAVRVAQANRKQDYGADEFAGMRTPAES FGSGRGMAHVPRQEGRGRRVPQTIKGRKAHPPKAEKDQSESINTKEKKLAVRSAIAATTD AELVADRGHQFDDDAEIPVVVSDEFEDLVKTKEVVEFLEAAGLEADVERADEGRSIRSGR GKTRGRKYKQPKSILFVTSSESGPSRAARNLAGADVTTAAEVNAEDLAPGAQAGRLTVWT ESALEEVADR — Patrycja, Aug 17 '19 at 17:44
`for z in f1:
if z[0] == '>': znaleziony = False gatunek1 = z[1:len(z)] #print(gatunek1) for x in g1: #print(x) if x[0] == '>': gatunek2 = x[1:len(x)] if gatunek1 == gatunek2: wynik.write('>') wynik.write(gatunek1) znaleziony = True continue
elif znaleziony == True: wynik.write(z)

wynik.close()` — Patrycja, Aug 17 '19 at 17:57
Please edit your post and add the code there - comments don't allow for good code formatting and that makes your code difficult to understand. — Ram RS, Aug 17 '19 at 19:26
Based on your question it is not entirely clear if you are looking for common header + sequence or also sequences that are identical too each other but have been mislabeled. "Unfortunately it repeats some identifiers of sequence and does compare only identifiers (not whole sequence)." I would assume that identical identifiers have the same sequence. Can you elaborate on what exactly you are looking to extract? — Mack123456, Aug 18 '19 at 17:05
Hello @Patrycja, your code is not well indented. Please edit the post and indent the code properly as Python blocks are identified by indentation and bad indentation makes the code impossible to read. — Ram RS, Aug 18 '19 at 18:55
Hi @Patryvia if I get.a.minute I'll do it in python pandas. The duplicated function will make short work of the problem — M__, Aug 19 '19 at 11:41

terdon · Answer 1 · 2019-08-19T12:44:47.937

If both the ID lines and the sequences are duplicated, you can do this using the FastaToTbl and TblToFasta scripts I have posted before and just normal Linux utilities:

FastaToTbl file1.fa file2.fa ... file6.fa | 
    sort | uniq -dc | awk '$1==6' | sed 's/^  *6 //' | TblToFasta > dupes.fa

To understand this, let's look at the output of each step. The FastaToTbl script just converts fasta to a tab-delimited format where the ID is the first field and the sequence is the second, all on one line:

$ FastaToTbl file*fa | head -n1
Haladaptatus sp. R4 MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAESPGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTDAERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGRGTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTESALEEVADR

Then, we pass this through sort and uniq -dc which prints out the number of times a line was repeated (only for duplicated lines):

$ FastaToTbl file*fa | sort | uniq -dc
      4 some other sequence ALASOOPAKASKAPPPQQCDDREETAYSSLDAS
      6 Haladaptatus sp. R4 MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAESPGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTDAERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGRGTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTESALEEVADR

Since we only want those that were present in all 6 files, we can use awk telling it to only print lines whose first field was 6:

$ FastaToTbl file*fa | sort | uniq -dc | awk '$1==6'
      6 Haladaptatus sp. R4 MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAESPGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTDAERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGRGTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTESALEEVADR

But then we don't want the 6 as part of the final output, so we need to remove it:

$ FastaToTbl file*fa | sort | uniq -dc | awk '$1==6' | sed 's/^  *6 //'
Haladaptatus sp. R4 MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAESPGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTDAERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGRGTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTESALEEVADR

Finally, convert to fasta again:

$ FastaToTbl file*fa | sort | uniq -dc | awk '$1==6' | sed 's/^  *6 //' | TblToFasta 
>Haladaptatus sp. R4 
MQATVRDLNGEDADTVDLPDVFETTVRTDLIKRAVLAAQANRKQDYGTDPHAGMRTSAES
PGSGRGMAHVPQTNGRGARVPFTVGGRVAHPPKAEKDRSRSINKKERKLAVRSAIAATTD
AERVSERGHRFDEDTELPLVVSDDFEDLVKTQEVVSFLEAVGIDADIARAEDNKKVRAGR
GTTRGRKYKTPKSILFVTSEEPSRAARNLAGADVATAREVNTEDLAPGTQAGRLTVWTES
ALEEVADR

Alternatively, if your files are small enough to be loaded into memory, you could do this instead:

FastaToTbl file*fa | 
    awk '{a[$0]++}END{for(s in a){if(a[s]==6){print s}}}' | 
        TblToFasta > dupes.fa

Or you could just do the whole thing in a single awk operation:

awk '{ 
        if(/^>/){
            a[seq]++; 
            seq=$0; 
        }
        else{
            seq=seq"\n"$0
        }
      }
      END{
        a[seq]++;
        for(s in a){
            if(a[s]==6){
                print s
            }
        }
       }' file*.fa

M__ · Answer 2 · 2019-08-20T00:08:57.383

For pandas you simply load the fasta names and fasta sequences into separate arrays. The "fasta names" array becomes the index in a pandas Series, whilst the sequence are the values. The dataframe will correctly re-associated the names with each sequence. You load all the files into a single dataframe and use the duplicated() function then write out.

The easiest way to do this is concatenate all your files into one via shell ...

cat file*.fasta > concat_file.fa

Python script (biopython + pandas) ...

from Bio import SeqIO
import pandas as pd

path = "/Users/username/location/concat_file.fa"
outpath = "/Users/username/Desktop/duplicates.fasta"
fasta_name = []
fasta_seq = []

for record in SeqIO.parse(path, "fasta"):
    fasta_name.append('>' + record.id)
    fasta_seq.append(record.seq)
df = pd.Series(fasta_seq, index = fasta_name)
df_dups = df.index.duplicated()
mydups = df[df_dups]
mydups.to_csv(outpath, sep="\n", index=True, header=False)

...

I checked it on a ~4000 sequence fasta file of unique sequence ids (dengue), duplicated the last sequence id + sequence and the only sequence output was the very last sequence.

How to find common sequences among 6 multi-fasta files

2 Answers2