2

I have the following two small FASTAs, dna1.fa and dna2.fa:

# dna1.fa
>chr1
ACTGACTAGCTAGCTAACTG
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG

# dna2.fa
>chr1
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG

If I concatenate these two files using cat, the piped result is as follows:

$ cat dna1.fa dna2.fa > total1.fa
$ cat total1.fa
>chr1
ACTGACTAGCTAGCTAACTG
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
>chr1
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG

Here, as expected, these are two files which have concatenated together. It seems this is usually what is done when one concatenates several FASTAs together in bioinformatics.

Question1: Is it possible to concatenate these "based on the chromosome"? Here is the desired result:

>chr1
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG

Note that these are exceptionally small FASTAs---usually one could be dealing with files of several GB in size.

Question2: The meta question here is, are there any standard bioinformatic analyses whereby the "concatenated by chromosome" FASTA would be more useful than the straightforward concatenated FASTA via cat alone?

EB2127
  • 1,413
  • 2
  • 10
  • 23

1 Answers1

2

Using the TblToFasta and FastaToTbl scripts I have posted before, combined with classic *nix utils, you can do:

$ join -a1 -a2 <(FastaToTbl dna1.fa | sort) <(FastaToTbl dna2.fa | sort) | 
    sed 's/ /=/2' | TblToFasta | tr = '\n'
>chr1 
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2 
GCATCGTAGCTAGCTACGAT
>chr3 
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4 
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5 
CATCGATCGTACGTACGTAG

Explanation

The FastaToTbl will put everything on the same line, with the sequence's ID at the beginning of the line, then a tab and then the sequence (no matter how many lines of sequence you have). So, we do this to both files, sort them and pass that as input to join, which will join its input on lines where the 1st field is the same (by default). The -a1 -a2 means "also print non-matching lines from both files" so we don't miss cases where a sequence is in only one of the two files.

Since the join will add a space between the two joined lines, we deal with that by replacing the second space on each line with a = (sed 's/ /=/2') and then pass the whole thing to TblToFasta to revert to fasta format. Finally, we use tr to replace the = with a newline so you can see where each sequence came from as you show in your example.

For longer sequences where you want to avoid sorting the input, you can do:

$ FastaToTbl dna1.fa dna2.fa | 
    awk '{
            seq[$1] ? seq[$1]=seq[$1]"\n"$2 : seq[$1]=$2 
         }
         END{
              for(chr in seq){
                printf ">%s\n%s\n", chr, seq[chr]
              }
         }'
>chr1
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG

Explanation

Here, the tbl format sequences (see above) are passed to an awk script which does the work:

  • seq[$1] ? seq[$1]=seq[$1]"\n"$2 : seq[$1]=$2 : if the 1st field ($1, the sequence identifier) is already in the array seq, then set the value of seq[$1] to the current value (so whatever sequence has been already seen for this ID), a newline and the 2nd field ($2, the sequence). If it hasn't been seen before, then set the value of seq[$1] to the 2nd field, the sequence.
  • END{ for(chr in seq){ printf ">%s\n%s\n", chr, seq[chr]} : after processing everything, go through each element in the array seq and print out the data in fasta format.

However, that will mean keeping both files in memory which may be a problem for huge sequences. You can avoid that by using temp files:

$ FastaToTbl dna1.fa dna2.fa | 
    awk '{ 
            if(!seen[$1]++){
                print ">"$1 > $1".tmp"
            } 
            print $2 >> $1".tmp"
        }' && cat *.tmp > out.fa && rm *.tmp
$ cat out.fa 
>chr1
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG

Explanation

This is the same basic idea as the script above, only instead of storing the information in an array, we print it as it comes into temp files.

  • if(!seen[$1]++){print ">"$1 > $1".tmp"} : if this is the first time we've seen this ID, print it as a fasta header into the file $1".tmp" (e.g. chr1.tmp).
  • print $2 >> $1".tmp" append the current sequence to the corresponding temp file.

Once the temp files have been created, we concatenate them (cat *.tmp > out.fa) and delete them (rm *tmp).


As for why you would want to do this, an obvious example is reconstructing a reference genome from short sequences. You would want all the sequences of the same schromosome together. Of course, you would also need to assemble them in the right order, which you aren't doing here.

Alternatively, you might want to concatenate them this way so as to avoid having the extra, unneeded lines (>chrN) which would make your file larger. Or you would want to be able to retrieve all sequences with the same ID at once.

But I don't really know why you would want to do it. You're the one asking for it, after all! :)

terdon
  • 10,071
  • 5
  • 22
  • 48