How to generate a weblogo image that takes into account sequence frequencies?

Question

I have a large aligned amino acid sequences file (fasta file) from next generation sequencing data. The unique sequences have been aligned but these all have certain read counts (frequencies associated with them).

I would like to generate a weblogo image that has the true probability of amino acids at each position. Is there a way to input frequencies/read counts with the aligned files of sequences using weblogo? For example, in the title of the fasta file or as a csv/tsv file?

So that when generating the weblogo image it takes into account the abundances of each sequence?

can't you just duplicate the sequences to match the original frequencies? — heathobrien, Jan 24 '18 at 10:50

score 3 · Accepted Answer · answered Jan 24 '18 at 11:43

Don't use the unique sequences. The whole point of sequence logos is that the height of the letter represents its frequency at a given position. You remove this information when you make your sequences unique. So, while it makes sense to use unique sequences when aligning, you shouldn't when building your logo.

Generate a simple mapping file (counts) with the number of times each sequence was found:

seq1 3
seq2 10
seq3 2

Then, given an aligned protein fasta file (foo.aln) like this:

>seq1
MEEPQS---DPSVEPP-LSQETFSDLWKLCFLPENNVLSPLPSQAM-DDL
>seq2
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq3
MEEPCS---DPSVEPPQLSQETFSDLRKL--LPENNFLSPLPSQAMCDDL

You can simply repeat the sequences as needed (see my answer here for the FastaToTbl and TblToFasta commands):

$ awk '
        (NR==FNR){
            count[$1]=$2; 
            next
        } 
        {
            for(k=1;k<=count[$1];k++){
                print $1"."k"\t"$2
            }
        }' counts <(FastaToTbl foo.aln) | TblToFasta
>seq1.1 
MEEPQS---DPSVEPP-LSQETFSDLWKLCFLPENNVLSPLPSQAM-DDL
>seq1.2 
MEEPQS---DPSVEPP-LSQETFSDLWKLCFLPENNVLSPLPSQAM-DDL
>seq1.3 
MEEPQS---DPSVEPP-LSQETFSDLWKLCFLPENNVLSPLPSQAM-DDL
>seq2.1 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.2 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.3 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.4 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.5 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.6 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.7 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.8 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.9 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq2.10 
MEEPQSCFGDPSVEPPPLSQETFKDLWKL--LCENNVLS---SQAM-DDL
>seq3.1 
MEEPCS---DPSVEPPQLSQETFSDLRKL--LPENNFLSPLPSQAMCDDL
>seq3.2 
MEEPCS---DPSVEPPQLSQETFSDLRKL--LPENNFLSPLPSQAMCDDL

Note that the above assumes your shell supports the <(command) syntax for process substitution. Bash, the default shell on most Linux and macOSX installations does, but if you are using a shell that doesn't, you might need to run FastaToTbl foo.aln > foo.tbl first, and then use foo.tbl as input.

That's great thanks. I have my tabular file containing the read counts and my aligned fasta file although I am only familar with using R not c++. Can this script be adapted/used in R? — clfrankling, Jan 29 '18 at 12:27
That isn't C++! It's just a shell command using awk. Just copy/paste the command above directly into a *nix terminal. You will have to also take the two scripts from the other answer I link to. — terdon, Jan 29 '18 at 12:28
Ok sorry I did not recognise this. I have my two files: sequence.tbl (ID with aligned sequences already converted to tabular form) and counts.tbl (ID with read counts). I set up the directory to these files and entered your code into my ubuntu terminal and nothing happens. I doesn't print the output or come up with any errors? Do I need to add a command to print into an output tabular file? — clfrankling, Jan 31 '18 at 11:46
I copy/paste the exact command above just changing the last line to, }' counts.tbl < sequence.tbl — clfrankling, Jan 31 '18 at 11:54
@clfrankling yeah, that's a big change :) The < is wrong. If you already have them in tbl format, you need the command to end with }' counts.tbl sequence.tbl. Ping me (@terdon) in our chat room if you need more help. It will be easier to debug that way — terdon, Jan 31 '18 at 16:49

How to generate a weblogo image that takes into account sequence frequencies?

1 Answers1