When you look at all the genome files available from Ensembl. You are presented with a bunch of options. Which one is the best to use/download?
You have a combination of choices.
First part options:
- dna_sm - Repeats soft-masked (converts repeat nucleotides to lowercase)
- dna_rm - Repeats masked (converts repeats to to N's)
- dna - No masking
Second part options:
.toplevel - Includes haplotype information (not sure how aligners deal with this)
.primary_assembly - Single reference base per position
Right now I usually use a non-masked primary assembly for analysis, so in the case of humans: Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Does this make sense for standard RNA-Seq, ChIP-Seq, ATAC-Seq, CLIP-Seq, scRNA-Seq, etc... ?
In what cases would I prefer other genomes? Which tools/aligners take into account softmasked repeat regions?