19

All the long-read sequencing platforms are based on single-molecule sequencing which causes higher per-base error rates. For this reason a polishing step was added to genome assembly pipelines - mapping raw reads back to assembly and correcting details of the assembly.

I have decent PacBio RSII dataset of single individual genome of heavily heterozygous non-model species. Assembly went well, but when I tried to polish the assembly using quiver it could not converge over a couple of iterations and I bet it is because of too great divergence of haplotypes.

Is there any other way to polish a genome with such properties? For instance, is there a way to separate long reads by haplotype, so I could polish using one haplotype only?

gringer
  • 14,012
  • 5
  • 23
  • 79
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59

3 Answers3

4

A few possibilities:

Falcon

Try falcon and falcon-unzip. These are designed exactly for your problem and your data: https://github.com/PacificBiosciences/FALCON

Not Falcon

If you think you have assembled haplotypes (which seems reasonable to expect given enough coverage), you should be able to see the two haplotypes by just doing all pairwise alignments of your contigs. Haplotypes should show up as pairs of contigs that are MUCH more similar (even with a lot of between-haplotype divergence) than other pairs. Once you have all such pairs, you can simply select one of each pair to polish.

roblanf
  • 962
  • 7
  • 15
  • I indeed do have both haplotype sequences. I got them using tool called haplomerger2. But this tool produces a chimeric haploid assembly, therefore they are not really correctly phased haplotypes. Falcon-unzip is indeed software that could work. It was too young to try at the time, but I could try to give it another shot now. – Kamil S Jaron May 22 '17 at 09:39
3

You could also have a go at Canu. It's designed for long-read assembly (both PacBio and Nanopore), although not specifically for complex population sequencing. It tries to strip a genome down into its unique components, and generates paths from those components that are well-supported from the reads.

With regards to polishing, it seems to be the case that polishing doesn't converge, and there will be a lot of variants that just oscillate between two possibilities. For me and at least one other person at London Calling this year, there was basically no gain in accuracy for polishing past the third iteration. I used my own error correction algorithm, but they used the more "standard" polishing with Pilon. For what it's worth, the nanopore WGS consortium used Racon for polishing their Canu assemblies.

gringer
  • 14,012
  • 5
  • 23
  • 79
  • I actually have assembled the genome using Canu, I got ~2x haploid size of the genome, which I collapsed to haplotypes the using HaploMerger2.

    I know that globally the assembly is good. It just need to be polished.

    – Kamil S Jaron May 22 '17 at 09:34
  • Oh, yes. Sorry, I looked at the first answer and had assumed that this was just about assembly. I realise now that the question was discussing polishing, rather than assembly. – gringer May 23 '17 at 14:19
  • @gringer I was also trying to polish a highly heterozygous genome assembly (generated by canu), using Racon (Quiver would collapse haplotypes), but couldn't get a satisfying output (basically, no statistics has changed). any advice? – aechchiki Jul 31 '18 at 07:19
  • My general recommendation at the moment would be to use nanopolish in methylation mode to correct, then Pilon with Illumina reads to only correct the homopolymer fragments (i.e. no SNP correction, and no long-range scaffolding). Based on this:

    https://github.com/rrwick/Basecalling-comparison#methylation

    – gringer Jul 31 '18 at 11:05
0

There is a new tool called Hapo-G for diploid-aware polishing. Unfortunately, it seems it's designed for short reads only. It seems to be a more practical choice with respect to inferring the phase information on individual reads.

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59