1

I have seven fasta files, one per gene, with more than 400 fasta entries per file, like this: Input:

Gene1.fasta

>1721_1 gene name  
ATG   
>10_14 gene name  
GTT

Gene2.fasta

>1721_1 gene name  
TGA  
>10_14 gene name   
GAT

Output:

>1721_1  
ATGTGA  
>10_14   
GTTGAT

What I want to do is create one fasta (continuously) file containing seven genes per sample. As you can see, the sample name is not the same before the _ in the fasta headers.

Is there any way to do this with awk or other command-line tools? Thank you very much.

Maximilian Press
  • 3,989
  • 7
  • 23
Saruul B
  • 11
  • 2
  • 1
    cat Gene*.fasta > all_genes.fasta ? – Timur Shtatland Oct 04 '21 at 03:46
  • @TimurShtatland Thank you. But it should contain all genes sequences of each sample. Each sample should be separate files. – Saruul B Oct 04 '21 at 11:25
  • I would suggest using samtools faidx to pull the sample's gene out of each file, and then removing the header. It will ignore anything after whitespace so gene name after space should not be problem. You could iterate across sample names in a bash loop. Of course, it requires that every sample have every gene. – Maximilian Press Oct 05 '21 at 11:57
  • okay. I will try – Saruul B Oct 11 '21 at 04:31

3 Answers3

2

The concat command from seqkit could concatenate sequences with same ID from multiple files.

seqkit concat [file1] [file2] ...
Forrest Vigor
  • 387
  • 1
  • 4
0

I think it can be done using python dictionary. You can split the file using the header and then make the fasta id as your key and the sequence as your value. Then using simple loop and conditions you can concatenate them.

Lazy PhD
  • 1
  • 1
0

Assuming your files are small enough to allow you to fit one of them into memory, you can do this:

$ awk '{ 
        if(/^>/){ name=$0 }
        else{ 
            if(NR==FNR){seq[name]=$0 } 
            else{
                print name; 
                printf "%s%s\n", seq[name],$0
            }
        }
      }' file1.fa file2.fa 
>1721_1 gene name
TGA  
>10_14 gene name
GTTGAT

If you cannot fit the files in memory, you can try this:

$ join -o1.1,1.2,2.2 -t $'\t' <(FastaToTbl file1.fa ) <(FastaToTbl file2.fa) | sed 's/\t//2' | TblToFasta 
>1721_1 gene name 
ATGTGA
>10_14 gene name 
GTTGAT

Note that this approach requires the FastaToTbl and TblToFasta scripts I have posted here: https://bioinformatics.stackexchange.com/a/2651/298.

Or, if brevity is more your thing, the same approach but more compact:

terdon
  • 10,071
  • 5
  • 22
  • 48