23

When you look at all the genome files available from Ensembl. You are presented with a bunch of options. Which one is the best to use/download?

You have a combination of choices.

First part options:

  • dna_sm - Repeats soft-masked (converts repeat nucleotides to lowercase)
  • dna_rm - Repeats masked (converts repeats to to N's)
  • dna - No masking

Second part options:

  • .toplevel - Includes haplotype information (not sure how aligners deal with this)

  • .primary_assembly - Single reference base per position

Right now I usually use a non-masked primary assembly for analysis, so in the case of humans: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Does this make sense for standard RNA-Seq, ChIP-Seq, ATAC-Seq, CLIP-Seq, scRNA-Seq, etc... ?

In what cases would I prefer other genomes? Which tools/aligners take into account softmasked repeat regions?

story
  • 1,573
  • 1
  • 8
  • 15
  • 1
    Relevant blog post: http://genomespot.blogspot.ch/2015/06/mapping-ngs-data-which-genome-version.html – Chris_Rands Jun 07 '17 at 14:02
  • What kind of "alignments"? Protein tblastn? Whole genome alignments? NGS read alignments? Gene-level alignments? – terdon Jun 07 '17 at 14:52

4 Answers4

13

There's rarely a good reason to use a hard-masked genome (sometimes for blast, but that's it). For that reason, we use soft-masked genomes, which only have the benefit of showing roughly where repeats are (we never make use of this for our *-seq experiments, but it's there in case we ever want to).

For primary vs. toplevel, very few aligners can properly handle additional haplotypes. If you happen to be using BWA, then the toplevel assembly would benefit you, but only if you use a dedicated wrapper to handle the ALT information, see bwakit. If you use BWA (bwa-mem) right from the command line without this wrapper then do not use the toplevel assembly. For STAR/hisat2/bowtie2/BBmap/etc. the haplotypes will just cause you problems due to increasing multimapper rates incorrectly. Note that none of these actually use soft-masking.

ATpoint
  • 1,207
  • 2
  • 10
Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
9

Generally, you should use the soft-masked or unmasked primary assembly. Cross-species whole-genome aligners, especially older ones, do need to know soft-masked regions; otherwise they can be impractically slow for mammalian genomes. Modern read aligners are designed to work with repeats efficiently and therefore they don't need to see the soft mask.

For GRCh38, though, I would recommend to use the official build at GRC FTP. Most people will probably choose "no_alt_analysis_set". Using the Ensembl version is discouraged due to its chromosome naming. We more often use "chr1" instead of "1" for GRCh38. At one point, Ensembl actually agreed to use "chr1" as well, but didn't make that happen due to technical issues, I guess.

As to alternate haplotypes, most aligners can't work with them; no variant callers can take the advantage of these sequences, either. When you align to a reference genome containing haplotypes with an aligner not supporting these extra sequences, you will get poor mapping results.

user172818
  • 6,515
  • 2
  • 13
  • 29
4

Which tools/aligners take into account softmasked repeat regions?

If you're doing whole genome - whole genome alignment (rather than read alignment) then using the softmasked genome is definitely best. Tools suitable for such large scale alignments task tend to skip marked repeats completely in their initial steps to prevent the build up of bogus short alignments that can have a massive performance impact in terms of time and memory usage. For example, LASTZ skips lower case letters during the seeding stage.

Chris_Rands
  • 3,948
  • 12
  • 31
2

TOPLEVEL

These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

E.g: I used the soft masked assemblies for genome annotation pipelines like MAKER, also toplevel unmasked ones for RNA-seq, ChipSeq analysis

PRIMARY ASSEMBLY

Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis.