I'm currently trying to assembly a genome from a rodent parasite, Nippostrongylus brasiliensis. This genome does have an existing reference genome, but it is highly fragmented. Here are some continuity statistics for the scaffolds of the current Nippo reference genome (assembled from Illumina reads):

Total sequences: 29375
Total length: 294.400206 Mb
Longest sequence: 394.171 kb
Shortest sequence: 500 b
Mean Length: 10.022 kb
Median Length: 2.682 kb
N50: 2024 sequences; L50: 33.527 kb
N90: 11638 sequences; L90: 4.263 kb

This genome is most likely difficult to assemble because of the highly repetitive nature of the genomic sequences. These repetitive sequences come in (at least) three classes:

  1. Tandem repeats with a repeat unit length greater than the read length of Illumina sequencers (e.g. 171bp)
  2. Tandem repeats with a cumulative length greater than the fragment length of Illumina sequencers, or the template length for linked reads (e.g. 20kb)
  3. Complex (i.e. non-repetitive) sequence that appears at multiple places throughout the genome

Canu seems to deal quite well with the first two types of repeats, despite the abundance of repetitive structure in the genome. Here's the unitigging summary produced by Canu on one of the assemblies I've attempted. Notice that about 30% of the reads either span or contain a long repeat:

category            reads     %          read length        feature size or coverage  analysis
----------------  -------  -------  ----------------------  ------------------------  --------------------
middle-missing        694    0.07     7470.92 +- 5552.00        953.06 +- 1339.13    (bad trimming)
middle-hump           549    0.05     3770.05 +- 3346.10         74.23 +- 209.86     (bad trimming)
no-5-prime           3422    0.33     6711.32 +- 5411.26         70.92 +- 272.99     (bad trimming)
no-3-prime           3161    0.30     6701.35 +- 5739.86         87.41 +- 329.42     (bad trimming)

low-coverage 27158 2.59 3222.51 +- 1936.79 4.99 +- 1.79 (easy to assemble, potential for lower quality consensus) unique 636875 60.76 6240.20 +- 3908.44 25.22 +- 8.49 (easy to assemble, perfect, yay) repeat-cont 48398 4.62 4099.55 +- 3002.72 335.54 +- 451.43 (potential for consensus errors, no impact on assembly) repeat-dove 135 0.01 16996.33 +- 6860.08 397.37 +- 319.52 (hard to assemble, likely won't assemble correctly or even at all)

span-repeat 137927 13.16 9329.94 +- 6906.27 2630.06 +- 3539.53 (read spans a large repeat, usually easy to assemble) uniq-repeat-cont 155725 14.86 6529.83 +- 3463.16 (should be uniquely placed, low potential for consensus errors, no impact on assembly) uniq-repeat-dove 28248 2.70 12499.99 +- 8446.95 (will end contigs, potential to misassemble) uniq-anchor 5721 0.55 8379.86 +- 4575.71 3166.22 +- 3858.35 (repeat read, with unique section, probable bad read)

However, the third type of repeat is giving me a bit of grief. Using the above assembly, here are the continuity parameters from the assembled contigs:

Total sequences: 3505
Total length: 322.867456 Mb
Longest sequence: 1.762243 Mb
Shortest sequence: 2.606 kb
Mean Length: 92.116 kb
Median Length: 42.667 kb
N50: 417 sequences; L50: 194.126 kb
N90: 1996 sequences; L90: 35.634 kb

It's not a bad assembly, particularly given the complexity of the genome, but I feel like it could be improved by tackling the complex genomic repeats in some fashion. About 60Mb of the contigs in this assembly are linked with each other in a huge web (based on the GFA output from Canu):

60Mb linked structure from Canu GFA

The repetitive regions are typically over 500bp in length, average about 3kb, and I've seen at least one case which seems to be a 20kb sequence duplicated in multiple regions.

The Canu defaults seem to give the best assembly results for the few parameters that I've tried, with one exception: trimming. I've tried playing around a little bit with the trimming parameters, and curiously a trimming coverage of 5X (with overlap of 500bp) seems to give a more contiguous assembly than with a trimming coverage of 2X (with the same overlap).

If anyone is interested in having a look at these data themselves, called FASTQ files from Nippo sequencing runs can be found here. Raw nanopore signal files are available within ENA project PRJEB20824. There's also a Zenodo archive here that contains the GFA and assembly contigs.

I can use Illumina data to correct the Canu assembly, but that doesn't help with resolving the "type 3" repeats. The regions are sufficiently similar that illumina reads get mapped to multiple points in the genome. The Illumina contigs are high quality (i.e. they have good BUSCO scores, indicating few variant errors), but quite short. Any sniff of a repeat and the contig ends. I've got more than a few examples of regions that would make an Illumina read (even 10x linked reads) cower in fear.

Does anyone have any other suggestions on how I could resolve these complex repeats?

Computational solutions would be preferred, but resequencing is not out of the question.

  • 14,012
  • 5
  • 23
  • 79

1 Answers1


"A few" 100kb reads won't help much. You need to apply the ultra-long protocol, which is different from the standard protocol.

You can't resolve 20kb near identical repeats/segdups with 10kb reads. All you can do is to bet your luck on a few excessively long reads spanning some units by chance. For divergent copies, it is worth looking at this paper. It uses Illumina reads to identify k-mers in unique regions and ignores non-unique k-mers at the overlapping stage. The paper said that this strategy is better than using standard overlappers, which I buy, but probably it can't resolve a 20kb segdup with a handful of mismatches, either.

Such mismatch-based approaches always have limitations and may not work for recent segdups/repeats. The ultimate solution is to get long reads, longer than your repeat/segdup units. The ~100kb reads in the recent preprint will be a game changer for you. If your ~20kb repeats are not tandem, 10X's ~100kb linked reads may help, too.

  • 14,012
  • 5
  • 23
  • 79
  • 6,515
  • 2
  • 13
  • 29