RNA seq fasta file annotation from alignment to reference matches

Question

I've got a fasta file with some RNA seq data and another csv file with the output from plast where I've aligned it to a reference using plastn. I'm struggling with figuring out a command to append my original RNA seq data fasta headers with the annotations.

I've looked at awk and sed but can't wrap my head around how to use them in this instance.

CSV output example:

Contig11    ref|XM_021487230.1|_PREDICTED:_Mizuhopecten_yessoensis_UNC93-like_protein_MFSD11_(LOC110443183),_mRNA   78.95   95  20  0   371 465 4627    4721    1e-14   82.4    75  625 1   0   15.2    0   5692    1   0   15.2    0

And example fasta contig:

>Contig11
GACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTA

How would I be able to parse through the csv to find a matching contig with the fasta file, and append the second csv column to the fasta header. Ideally to produce:

>Contig11 ref|XM_021487230.1|_PREDICTED:_Mizuhopecten_yessoensis_UNC93-like_protein_MFSD11_(LOC110443183),_mRNA
GACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTA....

Cheers for any help!

What format is your "csv" in? It isn't a "real" csv (that means "comma separated file" where columns are defined by commas), so what is the column separator? Is it a space or a tab? — terdon, Jul 09 '18 at 15:54

terdon · Accepted Answer · 2018-07-09T16:24:50.520

An awk approach (assuming your input "csv" file is space-separated):

$ awk '(NR==FNR){a[">"$1]=$2; next;} (/>/){$1=$1" "a[$1]}1;' foo.csv foo.fa 
>Contig11 ref|XM_021487230.1|_PREDICTED:_Mizuhopecten_yessoensis_UNC93-like_protein_MFSD11_(LOC110443183),_mRNA ref|XM_021487230.1|_PREDICTED:_Mizuhopecten_yessoensis_UNC93-like_protein_MFSD11_(LOC110443183),_mRNA
GACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTA
GACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTA

Explanation

NR==FNR : FNR is the line number of the current input file and NR is the line number of all input. When processing more than one file, the two will only be equal while reading the first file.
(NR==FNR){a[">"$1]=$2; next;} : if we're reading the first file, the csv, save the second field as the value in the associative array a, whose indices are a > followed by the 1st field. I use the > only to simplify matching it later in the fasta file. The next moves on to the next line so the rest of the command isn't run for the 1st file.
(/>/){$1=$1" "a[$1]}: Now we're in the 2nd file. If the line starts with a >, then set the 1st field to be itself, a space and then whatever was stored in the array a for this 1st field.
1; : this is awk shorthand for "print this line".

You could expand the above into:

awk '{
        if(NR==FNR){
            a[">"$1]=$2; 
        }
        else{
            if(/>/){
                $1=$1" "a[$1]
            }
        }
        print;
      }' foo.csv foo.fa

If your files are small enough to be sorted easily, you can use the FastaToTbl and TblToFasta scripts I have posted before and then join:

$ join -o 1.1,2.2,1.2 <(FastaToTbl foo.fa | sort ) <(sort  foo.csv ) | TblToFasta 
>Contig11 ref|XM_021487230.1|_PREDICTED:_Mizuhopecten_yessoensis_UNC93-like_protein_MFSD11_(LOC110443183),_mRNA 
GACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTA
GACTCTCTTAAGGTAGCCAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATTA
>Contig12 mamamama|_PREDICTED:_Mizuhopecten_yessoensis_UNC93-like_protein_MFSD11_(LOC110443183),_mRNA 
AKHSALKJHDLKAJBDLKAJDBADADKJALDKJBAD

RNA seq fasta file annotation from alignment to reference matches

1 Answers1

Explanation