Using the TblToFasta and FastaToTbl scripts I have posted before, combined with classic *nix utils, you can do:
$ join -a1 -a2 <(FastaToTbl dna1.fa | sort) <(FastaToTbl dna2.fa | sort) |
sed 's/ /=/2' | TblToFasta | tr = '\n'
>chr1
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG
Explanation
The FastaToTbl
will put everything on the same line, with the sequence's ID at the beginning of the line, then a tab and then the sequence (no matter how many lines of sequence you have). So, we do this to both files, sort them and pass that as input to join
, which will join its input on lines where the 1st field is the same (by default). The -a1 -a2
means "also print non-matching lines from both files" so we don't miss cases where a sequence is in only one of the two files.
Since the join
will add a space between the two joined lines, we deal with that by replacing the second space on each line with a =
(sed 's/ /=/2'
) and then pass the whole thing to TblToFasta
to revert to fasta format. Finally, we use tr
to replace the =
with a newline so you can see where each sequence came from as you show in your example.
For longer sequences where you want to avoid sorting the input, you can do:
$ FastaToTbl dna1.fa dna2.fa |
awk '{
seq[$1] ? seq[$1]=seq[$1]"\n"$2 : seq[$1]=$2
}
END{
for(chr in seq){
printf ">%s\n%s\n", chr, seq[chr]
}
}'
>chr1
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG
Explanation
Here, the tbl format sequences (see above) are passed to an awk script which does the work:
seq[$1] ? seq[$1]=seq[$1]"\n"$2 : seq[$1]=$2
: if the 1st field ($1
, the sequence identifier) is already in the array seq
, then set the value of seq[$1]
to the current value (so whatever sequence has been already seen for this ID), a newline and the 2nd field ($2
, the sequence). If it hasn't been seen before, then set the value of seq[$1]
to the 2nd field, the sequence.
END{ for(chr in seq){ printf ">%s\n%s\n", chr, seq[chr]}
: after processing everything, go through each element in the array seq
and print out the data in fasta format.
However, that will mean keeping both files in memory which may be a problem for huge sequences. You can avoid that by using temp files:
$ FastaToTbl dna1.fa dna2.fa |
awk '{
if(!seen[$1]++){
print ">"$1 > $1".tmp"
}
print $2 >> $1".tmp"
}' && cat *.tmp > out.fa && rm *.tmp
$ cat out.fa
>chr1
ACTGACTAGCTAGCTAACTG
GCATCGTAGCTAGCTACGAT
>chr2
GCATCGTAGCTAGCTACGAT
>chr3
CATCGATCGTACGTACGTAG
CATCGATCGTACGTACGTAG
>chr4
ATCGATCGATCGTACGATCG
ATCGATCGATCGTACGATCG
>chr5
CATCGATCGTACGTACGTAG
Explanation
This is the same basic idea as the script above, only instead of storing the information in an array, we print it as it comes into temp files.
if(!seen[$1]++){print ">"$1 > $1".tmp"}
: if this is the first time we've seen this ID, print it as a fasta header into the file $1".tmp"
(e.g. chr1.tmp
).
print $2 >> $1".tmp"
append the current sequence to the corresponding temp file.
Once the temp files have been created, we concatenate them (cat *.tmp > out.fa
) and delete them (rm *tmp
).
As for why you would want to do this, an obvious example is reconstructing a reference genome from short sequences. You would want all the sequences of the same schromosome together. Of course, you would also need to assemble them in the right order, which you aren't doing here.
Alternatively, you might want to concatenate them this way so as to avoid having the extra, unneeded lines (>chrN
) which would make your file larger. Or you would want to be able to retrieve all sequences with the same ID at once.
But I don't really know why you would want to do it. You're the one asking for it, after all! :)