Is there an easy way to create a summary of a VCF file (v4.1) with structural variations?

Question

I got a bunch of vcf files (v4.1) with structural variations of bunch of non-model organisms (i.e. there are no known variants). I found there are quite a some tools to manipulate vcf files like VCFtools, R package vcfR or python library PyVCF. However none of them seems to provide a quick summary, something like (preferably categorised by size as well):

type    count
DEL     x
INS     y
INV     z
....

Is there any tool or a function I overlooked that produces summaries of this style?

I know that vcf file is just a plain text file and if I will dissect REF and ALT columns I should be able to write a script that will do the job, but I hoped that I could avoid to write my own parser.

--- edit ---

So far it seems that only tool that aims to do summaries (@gringer answer) is not working on vcf v4.1. Other tools would provide just partial solution by filtering certain variant type. Therefore I accept my own parser perl/R solutions, till there will be a working tool for stats of vcf with structural variants.

What sort of summary are you looking for? Just raw counts (of number of insertions/deletions)? I'm just curious as I think personally I'd be interested to know how that variation is actually spread along a genome/contig/sequence spatially, would that be of interest to you? — Sam Nicholls, May 28 '17 at 21:49
I am comparing couple of relative species. However, I have not identified homologous regions yet (the annotation is so far under construction), therefore positioning of SVs is right now not very relevant. However, what can I do already is take a look if there is at least a difference in counts / sizes of SVs between species. — Kamil S Jaron, May 28 '17 at 21:53
Maybe even better than summary would be to get type of SV and its size for every SV call. Then I could a summary in R would be trivial. — Kamil S Jaron, May 28 '17 at 21:57
You've probably seen this already, but vcftools has a sibling called bcftools, which has a query function, that allows users to query a VCF/BCF to pull out fields and information and output their own format. It might not do exactly what you want, but might get you very close (enough to maybe just need a little post-processing in R?). — Sam Nicholls, May 28 '17 at 22:05

gringer · Answer 1 · 2022-11-04T03:21:17.493

According to the bcftools man page, it is able to produce statistics using the command bcftools stats. Running this myself, the statistics look like what you're asking for:

# This file was produced by bcftools stats (1.2-187-g1a55e45+htslib-1.2.1-256-ga356746) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats  OVLNormalised_STARout_KCCG_called.vcf.gz
#
# Definition of sets:
# ID    [2]id   [3]tab-separated file names
ID  0   OVLNormalised_STARout_KCCG_called.vcf.gz
# SN, Summary numbers:
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  108
SN  0   number of records:  333
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 313
SN  0   number of MNPs: 0
SN  0   number of indels:   20
SN  0   number of others:   0
SN  0   number of multiallelic sites:   0
SN  0   number of multiallelic SNP sites:   0
# TSTV, transitions/transversions:
# TSTV  [2]id   [3]ts   [4]tv   [5]ts/tv    [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV    0   302 11  27.45   302 11  27.45
# SiS, Singleton stats:
...
# IDD, InDel distribution:
# IDD   [2]id   [3]length (deletions negative)  [4]count
IDD 0   -9  1
IDD 0   -2  4
IDD 0   -1  6
IDD 0   1   4
IDD 0   2   1
IDD 0   3   3
IDD 0   4   1
# ST, Substitution types:
# ST    [2]id   [3]type [4]count
ST  0   A>C 2
ST  0   A>G 78
ST  0   A>T 2
ST  0   C>A 5
ST  0   C>G 0
ST  0   C>T 66
ST  0   G>A 67
ST  0   G>C 0
ST  0   G>T 1
ST  0   T>A 1
ST  0   T>C 91
ST  0   T>G 0
# DP, Depth distribution
# DP    [2]id   [3]bin  [4]number of genotypes  [5]fraction of genotypes (%)    [6]number of sites  [7]fraction of sites (%)
DP  0   >500    0   0.000000    333 100.000000

This is annotating what is explicitly in the VCF file, and that's SNVs (and INDELs). If you want a structural variant analysis (i.e. on a larger scale than single nucleotides), then you'll need to use something that does more than a summary of the VCF file.

While Inversions, large-scale deletions, and breakpoints are part of the 4.1 specifications, they are unfortunately not currently supported by bcftools.

score 6 · Accepted Answer · edited Nov 04 '22 at 03:23

The "my own parser" solutions. The information I was searching for in part of column INFO, namely in variables SVLEN and SVTYPE.

very fast SV types + counts (by @user172818 in comment) :

zcat var.vcf.gz | perl -ne 'print "$1\n" if /[;\t]SVTYPE=([^;\t]+)/' | sort | uniq -c

quite slow SV types + counts + sizes :

SV_colnames <- c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE1')
ssplit <- function(s, split = '='){
    unlist(strsplit(s, split = split))
}
note, capital letters just respect original naming conventions of the VCF file
getSVTYPE <- function(info){
    ssplit(grep("SVTYPE", info, value = T))[2]
}
getSVLEN <- function(info){
    SVLEN <- ssplit(grep("SVLEN", info, value = T))
    ifelse(length(SVLEN) == 0, NA, as.numeric(SVLEN[2]))
}
load_sv <- function(file){
    vcf_sv_table <- read.table(file, stringsAsFactors = F)
    colnames(vcf_sv_table) <- SV_colnames
    # possible filtering
    # vcf_sv_table <- vcf_sv_table[vcf_sv_table$FILTER == 'PASS',]
    vcf_sv_table_info <- strsplit(vcf_sv_table$INFO, ';')
    vcf_sv_table$SVTYPE <- unlist(lapply(vcf_sv_table_info, getSVTYPE))
    vcf_sv_table$SVLEN <- unlist(lapply(vcf_sv_table_info, getSVLEN))
    return(vcf_sv_table)
}
sv_df <- load_sv('my_sv_calls.vcf')
table(sv_df$SVTYPE)

score 2 · Answer 3 · answered Sep 16 '19 at 20:19

2

For structural variants I think you can use SURVIVOR like SURVIVOR stats which was design specifically for such purpose (statistics on SV file).

answered Sep 16 '19 at 20:19

Medhat Helmy

121
3

score 2 · Answer 4 · answered May 29 '17 at 05:23

2

You might want to try Bio-VCF. From the authors description

Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf is not only very fast for genome-wide (WGS) data, it also comes with a really nice filtering, evaluation and rewrite language and it can output any type of textual data, including VCF header and contents in RDF and JSON.

In addition it is a very fast parser. The rewrite DSL might help you customise your filtering and needs.

answered May 29 '17 at 05:23

eastafri

319
1
3

Has Bio-vcf got any functionality for structural variation as the OP requested? A quick scan of the GitHub page reveals nothing. Whilst many of these tools are useful for filtering SNVs and short indels most fail to do anything of worth for SVs. – Matt Bashton Jun 03 '17 at 10:55

score 1 · Answer 5 · answered Oct 21 '19 at 23:01

Let's say you have an INFO field called SVTYPE to indicate the structure variant type.

Here is how you get the stats easily:

Install vcfstats: pip install vcfstats
Define a macro to extract the SVTYPE information, say in svtype.py:

from vcfstats.macros import cat
@cat
def SVTYPE(variant):
    return variant.INFO['SVTYPE']

Generate the statistics:

vcfstats --vcf <your vcf file> \
   --outdir <the output directory> \
   --macro svtype.py \
   --formula 'COUNT(1) ~ SVTYPE[DEL,INS,INV...]' \ # desired svtypes
   --title 'Number of SV types'

Then in the output directory you will have the stats file named Number_of_SV_Types.txt and a plot for it as well.

Check the details at https://github.com/pwwang/vcfstats

The advantage of this solution is that it using a standard parser. The disadvantage is that it requires several steps (unlike zcat var.vcf.gz | perl -ne 'print "$1\n" if /[;\t]SVTYPE=([^;\t]+)/' | sort | uniq -c that will make the summary in one go). — Kamil S Jaron, Oct 24 '19 at 08:59

Is there an easy way to create a summary of a VCF file (v4.1) with structural variations?

5 Answers5

note, capital letters just respect original naming conventions of the VCF file