17

I got a bunch of vcf files (v4.1) with structural variations of bunch of non-model organisms (i.e. there are no known variants). I found there are quite a some tools to manipulate vcf files like VCFtools, R package vcfR or python library PyVCF. However none of them seems to provide a quick summary, something like (preferably categorised by size as well):

type    count
DEL     x
INS     y
INV     z
....

Is there any tool or a function I overlooked that produces summaries of this style?

I know that vcf file is just a plain text file and if I will dissect REF and ALT columns I should be able to write a script that will do the job, but I hoped that I could avoid to write my own parser.

--- edit ---

So far it seems that only tool that aims to do summaries (@gringer answer) is not working on vcf v4.1. Other tools would provide just partial solution by filtering certain variant type. Therefore I accept my own parser perl/R solutions, till there will be a working tool for stats of vcf with structural variants.

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59

5 Answers5

9

According to the bcftools man page, it is able to produce statistics using the command bcftools stats. Running this myself, the statistics look like what you're asking for:

# This file was produced by bcftools stats (1.2-187-g1a55e45+htslib-1.2.1-256-ga356746) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats  OVLNormalised_STARout_KCCG_called.vcf.gz
#
# Definition of sets:
# ID    [2]id   [3]tab-separated file names
ID  0   OVLNormalised_STARout_KCCG_called.vcf.gz
# SN, Summary numbers:
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  108
SN  0   number of records:  333
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 313
SN  0   number of MNPs: 0
SN  0   number of indels:   20
SN  0   number of others:   0
SN  0   number of multiallelic sites:   0
SN  0   number of multiallelic SNP sites:   0
# TSTV, transitions/transversions:
# TSTV  [2]id   [3]ts   [4]tv   [5]ts/tv    [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV    0   302 11  27.45   302 11  27.45
# SiS, Singleton stats:
...
# IDD, InDel distribution:
# IDD   [2]id   [3]length (deletions negative)  [4]count
IDD 0   -9  1
IDD 0   -2  4
IDD 0   -1  6
IDD 0   1   4
IDD 0   2   1
IDD 0   3   3
IDD 0   4   1
# ST, Substitution types:
# ST    [2]id   [3]type [4]count
ST  0   A>C 2
ST  0   A>G 78
ST  0   A>T 2
ST  0   C>A 5
ST  0   C>G 0
ST  0   C>T 66
ST  0   G>A 67
ST  0   G>C 0
ST  0   G>T 1
ST  0   T>A 1
ST  0   T>C 91
ST  0   T>G 0
# DP, Depth distribution
# DP    [2]id   [3]bin  [4]number of genotypes  [5]fraction of genotypes (%)    [6]number of sites  [7]fraction of sites (%)
DP  0   >500    0   0.000000    333 100.000000

This is annotating what is explicitly in the VCF file, and that's SNVs (and INDELs). If you want a structural variant analysis (i.e. on a larger scale than single nucleotides), then you'll need to use something that does more than a summary of the VCF file.

While Inversions, large-scale deletions, and breakpoints are part of the 4.1 specifications, they are unfortunately not currently supported by bcftools.

gringer
  • 14,012
  • 5
  • 23
  • 79
6

The "my own parser" solutions. The information I was searching for in part of column INFO, namely in variables SVLEN and SVTYPE.

very fast SV types + counts (by @user172818 in comment) :

zcat var.vcf.gz | perl -ne 'print "$1\n" if /[;\t]SVTYPE=([^;\t]+)/' | sort | uniq -c

quite slow SV types + counts + sizes :

SV_colnames <- c('CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE1')

ssplit <- function(s, split = '='){ unlist(strsplit(s, split = split)) }

note, capital letters just respect original naming conventions of the VCF file

getSVTYPE <- function(info){ ssplit(grep("SVTYPE", info, value = T))[2] }

getSVLEN <- function(info){ SVLEN <- ssplit(grep("SVLEN", info, value = T)) ifelse(length(SVLEN) == 0, NA, as.numeric(SVLEN[2])) }

load_sv <- function(file){ vcf_sv_table <- read.table(file, stringsAsFactors = F) colnames(vcf_sv_table) <- SV_colnames # possible filtering # vcf_sv_table <- vcf_sv_table[vcf_sv_table$FILTER == 'PASS',] vcf_sv_table_info <- strsplit(vcf_sv_table$INFO, ';') vcf_sv_table$SVTYPE <- unlist(lapply(vcf_sv_table_info, getSVTYPE)) vcf_sv_table$SVLEN <- unlist(lapply(vcf_sv_table_info, getSVLEN)) return(vcf_sv_table) }

sv_df <- load_sv('my_sv_calls.vcf') table(sv_df$SVTYPE)

gringer
  • 14,012
  • 5
  • 23
  • 79
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
2

For structural variants I think you can use SURVIVOR like SURVIVOR stats which was design specifically for such purpose (statistics on SV file).

Medhat Helmy
  • 121
  • 3
2

You might want to try Bio-VCF. From the authors description

Bio-vcf is a new generation VCF parser, filter and converter. Bio-vcf is not only very fast for genome-wide (WGS) data, it also comes with a really nice filtering, evaluation and rewrite language and it can output any type of textual data, including VCF header and contents in RDF and JSON.

In addition it is a very fast parser. The rewrite DSL might help you customise your filtering and needs.

eastafri
  • 319
  • 1
  • 3
  • Has Bio-vcf got any functionality for structural variation as the OP requested? A quick scan of the GitHub page reveals nothing. Whilst many of these tools are useful for filtering SNVs and short indels most fail to do anything of worth for SVs. – Matt Bashton Jun 03 '17 at 10:55
1

Let's say you have an INFO field called SVTYPE to indicate the structure variant type.

Here is how you get the stats easily:

  1. Install vcfstats: pip install vcfstats
  2. Define a macro to extract the SVTYPE information, say in svtype.py:
from vcfstats.macros import cat
@cat
def SVTYPE(variant):
    return variant.INFO['SVTYPE']
  1. Generate the statistics:
vcfstats --vcf <your vcf file> \
   --outdir <the output directory> \
   --macro svtype.py \
   --formula 'COUNT(1) ~ SVTYPE[DEL,INS,INV...]' \ # desired svtypes
   --title 'Number of SV types'

Then in the output directory you will have the stats file named Number_of_SV_Types.txt and a plot for it as well.

Check the details at https://github.com/pwwang/vcfstats

Panwen Wang
  • 111
  • 2
  • The advantage of this solution is that it using a standard parser. The disadvantage is that it requires several steps (unlike zcat var.vcf.gz | perl -ne 'print "$1\n" if /[;\t]SVTYPE=([^;\t]+)/' | sort | uniq -c that will make the summary in one go). – Kamil S Jaron Oct 24 '19 at 08:59