17

I have a fasta file like

>sample 1 gene 1
atgc
>sample 1 gene 2
atgc
>sample 2 gene 1 
atgc

I want to get the following output, with one break between the header and the sequence.

>sample 1 gene 1   atgc
>sample 1 gene 2   atgc
>sample 2 gene 1   atgc
AudileF
  • 955
  • 8
  • 25
  • Thanks so much everyone. You're making it hard to choose. I wanted something for a multi line fasta so both terdon and Chris scripts are correct. So I will defer to the saying first come first served. – AudileF Oct 18 '17 at 12:29

12 Answers12

13

If you have multi-line fasta files, as is very common, you can use these scripts1 to convert between fasta and tbl (sequence_name <TAB> sequence) format:

  • FastaToTbl

    #!/usr/bin/awk -f
    {
            if (substr($1,1,1)==">")
        		if (NR>1)
                        	printf "\n%s\t", substr($0,2,length($0)-1)
        		else 
        			printf "%s\t", substr($0,2,length($0)-1)
                else 
                        printf "%s", $0
    }END{printf "\n"}
    
  • TblToFasta

    #! /usr/bin/awk -f
    {
      sequence=$NF
    
      ls = length(sequence)
      is = 1
      fld  = 1
      while (fld < NF)
      {
         if (fld == 1){printf ">"}
         printf "%s " , $fld
    
         if (fld == NF-1)
          {
            printf "\n"
          }
          fld = fld+1
      }
      while (is <= ls)
      {
        printf "%s\n", substr(sequence,is,60)
        is=is+60
      }
    }
    

Save those in your $PATH, make them executable, and you can then do:

$ cat file.fa
>sequence1 
ATGCGGAGCTTAGATTCTCGAGATCTCGATATCGCGCTTATAAAAGGCCCGGATTAGGGC
TAGCTAGATATCGCGATAGCTAGGGATATCGAGATGCGATACG
>sequence2 
GTACTCGATACGCTACGCGATATTGCGCGATACGCATAGCTAACGATCGACTAGTGATGC
ATAGAGCTAGATCAGCTACGATAGCATCGATCGACTACGATCAGCATCAC
$ FastaToTbl file.fa 
sequence1   ATGCGGAGCTTAGATTCTCGAGATCTCGATATCGCGCTTATAAAAGGCCCGGATTAGGGCTAGCTAGATATCGCGATAGCTAGGGATATCGAGATGCGATACG
sequence2   GTACTCGATACGCTACGCGATATTGCGCGATACGCATAGCTAACGATCGACTAGTGATGCATAGAGCTAGATCAGCTACGATAGCATCGATCGACTACGATCAGCATCAC

And, to get the Fasta back:

$ FastaToTbl file.fa | TblToFasta
>sequence1 
ATGCGGAGCTTAGATTCTCGAGATCTCGATATCGCGCTTATAAAAGGCCCGGATTAGGGC
TAGCTAGATATCGCGATAGCTAGGGATATCGAGATGCGATACG
>sequence2 
GTACTCGATACGCTACGCGATATTGCGCGATACGCATAGCTAACGATCGACTAGTGATGC
ATAGAGCTAGATCAGCTACGATAGCATCGATCGACTACGATCAGCATCAC

This can be a very useful trick when searching a fasta file for a string:

TblToFasta file.fa | grep 'foo' | FastaToTbl

If you really want to keep the leading > of the header (which doesn't seem very useful), you could do something like this:

$ perl -0pe 's/\n//g; s/.>/\n>/g; s/$/\n/;' file.fa 
>sequence1 ATGCGGAGCTTAGATTCTCGAGATCTCGATATCGCGCTTATAAAAGGCCCGGATTAGGGCTAGCTAGATATCGCGATAGCTAGGGATATCGAGATGCGATAC
>sequence2 GTACTCGATACGCTACGCGATATTGCGCGATACGCATAGCTAACGATCGACTAGTGATGCATAGAGCTAGATCAGCTACGATAGCATCGATCGACTACGATCAGCATCAC

But that will read the entire file into memory. If that's an issue, add an empty line between each fasta record, and then use perl's paragraph mode to process each "paragraph" (sequence) at a time:

perl -pe  's/>/\n>/' file.fa | perl -00pe 's/\n//g; s/.>/\n>/g; s/$/\n/;'

1Credit to Josep Abril who wrote these scripts more than a decade ago.

terdon
  • 10,071
  • 5
  • 22
  • 48
  • I added an answer just to show possible improvements to the 2 awk scripts which IMHO are over-complicated with too much code duplication and other issues. – Ed Morton May 14 '23 at 13:26
12

There is a very simple BioPython solution, that is minimal, readable, and handles multi-line fasta:

from Bio import SeqIO

for record in SeqIO.parse('example.fa', 'fasta'):
    print('>{}\t{}'.format(record.description, record.seq))
Chris_Rands
  • 3,948
  • 12
  • 31
9

assuming there is only one sequence line per record, use paste with two 'stdin'

cat your.fasta | paste - -
Pierre
  • 1,536
  • 7
  • 11
  • 2
    Note that this will fail if you have multi line sequences (as Pierre pointed out), but also if you have any blank lines in the file. You might also want to remove the UuOC: paste - - < file.fa. – terdon Oct 17 '17 at 12:24
8

You can use these commands:

perl -pe 's/>(.*)/>\1\t/g; s/\n//g; s/>/\n>/g' sequences.fa | grep -v '^$'

Explanation:

  1. Append a tab to every header line
  2. Join all lines
  3. Split the single obtained line by the '>' character
  4. Remove the empty line (the first line is empty due to the fact that '>' is the first character of the FASTA file)
terdon
  • 10,071
  • 5
  • 22
  • 48
Karel Břinda
  • 1,909
  • 9
  • 19
7

A very useful tool for this kind of data manipulation is bioawk:

$ bioawk -c fastx '{print ">"$name" "$comment"\t"$seq}' test.fa
>sample 1 gene 1    atgc
>sample 1 gene 2    atgc
>sample 2 gene 1    atgc

bioawk is based on awk, with added parsing capabilities. Here, we tell that the format is fasta or fastq with -c fastx, and this makes the $name (between ">" and the first blank character), $comment (after the first blank character) and $seq (the sequence, in one line) variables available within awk instructions.

See for instance this answer for another use case.

bli
  • 3,130
  • 2
  • 15
  • 36
4

Where possible, I recommend using a dedicated parsing library, rather than hacking a parser together: as you can see in the other answers, parsing even simple formats gets complex pretty quickly if you value correctness.

Here’s a small R script that does what we need, using ‘seqinr’:

#!/usr/bin/env Rscript
suppressPackageStartupMessages(library(seqinr))
parsed = read.fasta(file('stdin'), as.string = TRUE)
table = data.frame(unlist(parsed), row.names = sapply(parsed, attr, 'Annot'))
write.table(table, stdout(), sep = '\t', quote = FALSE, col.names = FALSE)

Save it as fasta-to-tsv, make it executable, and use it as follows:

fasta-to-tsv < input.fasta > output.tsv

Equivalent code of similar length can be written in Python or Perl.

Konrad Rudolph
  • 4,845
  • 14
  • 45
  • Could you also explain what packages need to be installed in order for R to do this? seqinr isn't part of vanilla R. – terdon Jun 14 '20 at 18:03
  • @terdon I’m slightly confused: ‘seqinr’ is the package name, so that’s the one to install. – Konrad Rudolph Jun 14 '20 at 18:08
  • Well, sometimes the name of the package isn't the same as the library you load from it. And in any case, R being the horribly complicated mess that it is, you can never know (or I can't, anyway) whether the package can be installed with install.package or needs to be done through bioconductor or whatever else may be the case. R package management is very far from trivial, so I thought it would be useful to explain how to install the packages necessary to run your code. I've edited the command in now. I'd upvote, but I already did so back in 2017 :) – terdon Jun 14 '20 at 18:26
  • 1
    @terdon “sometimes the name of the package isn't the same as the library you load from it” — No, for R it always is, by definition. As for installation, I’ll give you that there’s a special case for packages not on CRAN but it only really makes sense to spell out the exceptions, not the rule. You wouldn’t mention installation instructions for Perl packages on CPAN, nor for Python on PyPI, NPM for JavaScript or Rust crates. I’ve undone your edit because (a) it’s redundant, and (b) it was wrong. Never install packages via sudo R unless you’re the sysadmin maintaining a cluster. – Konrad Rudolph Jun 14 '20 at 19:29
  • Ah, I tried it without but it failed when trying to install to my user's local library, so I assumed the package could only be installed system wide. But that just proves why it is helpful to include the installation command. I don't know R (and hate it with a passion) so all these things that you mention which are obvious to you, are not to me. And possibly others, which is why yes, I always include commands to install packages in my answers. – terdon Jun 14 '20 at 23:49
3

This could be easily done by seqkit fx2tab

seqkit fx2tab seq.fa

However, seqkit will not print the "greater than" symbol (">"). If you do need the symbol:

seqkit fx2tab seq.fa | sed 's/^/>/g'
Forrest Vigor
  • 387
  • 1
  • 4
2

Just suggested improvements to the awk scripts in @terdon's answer, using any POSIX awk:

$ cat FastaToTbl
#!/usr/bin/env bash

awk -v OFS='\t' ' { if ( /^>/ ) { out = (NR>1 ? ORS : "") substr($0,2) OFS } else { out = $0 } printf "%s", out } END { print "" } ' "${@:--}"

$ cat TblToFasta
#!/usr/bin/env bash

awk -F'\t' ' { gsub(/.{60}/,"&"ORS,$2) sub(ORS"$","",$2) print ">" $1 ORS $2 } ' "${@:--}"

$ ./FastaToTbl file.fa
sequence1       ATGCGGAGCTTAGATTCTCGAGATCTCGATATCGCGCTTATAAAAGGCCCGGATTAGGGCTAGCTAGATATCGCGATAGCTAGGGATATCGAGATGCGATACG
sequence2       GTACTCGATACGCTACGCGATATTGCGCGATACGCATAGCTAACGATCGACTAGTGATGCATAGAGCTAGATCAGCTACGATAGCATCGATCGACTACGATCAGCATCAC

$ ./FastaToTbl file.fa | ./TblToFasta
>sequence1
ATGCGGAGCTTAGATTCTCGAGATCTCGATATCGCGCTTATAAAAGGCCCGGATTAGGGC
TAGCTAGATATCGCGATAGCTAGGGATATCGAGATGCGATACG
>sequence2
GTACTCGATACGCTACGCGATATTGCGCGATACGCATAGCTAACGATCGACTAGTGATGC
ATAGAGCTAGATCAGCTACGATAGCATCGATCGACTACGATCAGCATCAC

FastaToTbl will actually work in any awk but TblToFasta requires a POSIX awk for support of regexp intervals like {60}. If you have a very old, pre-POSIX awk that doesn't support regexp intervals then get a new awk but if that's impossible for some reason then change TblToFasta to the following and then it'll also work in any awk:

$ cat TblToFasta
#!/usr/bin/env bash

awk -F'\t' ' BEGIN { dots = sprintf("%"60"s","") gsub(/ /,".",dots) } { gsub(dots,"&"ORS,$2) sub(ORS"$","",$2) print ">" $1 ORS $2 } ' "${@:--}"

Ed Morton
  • 176
  • 5
  • 2
    Heh, I have been using these two scripts for decades, long before I knew the first thing about awk, and I've never bothered to go and review them. Your versions are much simpler, thanks! – terdon May 14 '23 at 13:30
  • 1
    It's hard to figure out why that second one especially was so complicated. I feel like it maybe was intended to handle additional types of input, not just what's output by the first script, e.g. maybe lines which have multiple chunks of sequence string separated by spaces? That loop on field numbers but doing something specific for field numbers 1 and NF-1 (which could also be 1?) and printing every other field, then falling into a second loop that's again printing substrings of the last field is baffling. – Ed Morton May 14 '23 at 14:09
  • 1
    Ohhhhhh hang on - I get it. That first loop is trying to handle a sequence name that can contain spaces because they didn't set FS to \t. There's obviously far more concise ways they could have done that (but if we just set FS to \t then we don't have to). I'm assuming sequence names can't contain tabs. – Ed Morton May 14 '23 at 14:14
2

Remove empty records (description without sequence):

awk '$2{print RS}$2' FS='\n' RS=\> ORS= f1.fa > f2.fa

Remove blank lines:

sed '/^$/d' f2.fa > f3.fa

Convert multi-line fasta to single-line fasta:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' f3.fa > f4.fa

Finally, @Pierre solution:

cat f4.fa | paste - - > f.txt
burger
  • 2,179
  • 10
  • 21
2

In cases where there is no sequence wrapping and each sequence occupies only a single line, the following shell command is probably going to be fastest, easiest, and most convenient.

paste - - < your.fasta > your.new.fasta
Daniel Standage
  • 5,080
  • 15
  • 50
1

This is an old post I have noticed and there are many offered solutions. Since it’s a frequently asked question, I thought it’s worth for me to mention that there is an overlooked tool set which contains a stand-alone program called faToTab in addition to many other useful bioinformatics tools.

faToTab inputFile.fasta outFileFasta_tab.txt

It’s a gold-chest in my opinion. Here are the links to utilities folder and details: Description and Download instructions - Binaries by machine type - Link to the github page.

Anaconda installation is:

conda install -c bioconda ucsc-fatotab
conda install -c bioconda/label/cf201901 ucsc-fatotab
M__
  • 12,263
  • 5
  • 28
  • 47
Supertech
  • 606
  • 2
  • 10
1

In python I would do:

#...Suppose you have header information and 
#...sequences stored in lists

header_info1 = [elements] header_info2 = [elements] sequences = [sequences] index = enumerate(sequences) table = open(pathtoyourfile.tsv,'w+') for h1,h2,s,i in zip(header_info1,header_info2,sequences,index): table.write(f">{h1}\t{h2}\t{s}\n") if i+1==len(sequences): table.write(f">{h1}\t{h2}\t{s}") table.close()

So basically I use f strings and I iterate over these three vectors that are of the same length. At the end of the iteration, I remove new line(\n) since it won't be needed anymore since you have not to write anything further.

Spartan 117
  • 137
  • 8
  • How is this any different / better than the other pythonic answer? This looks a lot more verbose but doing actually less than the original one (assuming data is loaded already). It could be I am just missing something, but it would be good to explain why your solution might be favorable in some cases. – Kamil S Jaron Feb 28 '22 at 17:13
  • It is not that my solution is favorable in comparison to the other. As a developer, I found myself very often involved in situations where you have to bend the codes to my needs. Rather than understanding pre-made functions and losing time, I prefer to go on my own. Moreover, I saw that, commonly, f strings are not very well known yet, and in python-format function is used. I thought only it was a good occasion to share something that I commonly use. – Spartan 117 Mar 21 '22 at 16:37