Short-read depth recommendations for polishing

ONT-only bacterial genome assemblies now regularly have <10 errors (see my last post), which makes short-read polishing less crucial than it used to be. George Bouras and Matthew Croxen were chatting on the µbioinfo Slack, and the question came up about how much short-read depth is necessary. If there are only a few errors to fix, can you use shallow short-read sequencing¹ to save money?

I started experimenting with this in January 2024 with the intention of putting the results on this blog, and I used my current preferred polishing method: Polypolish followed by Pypolca. Pypolca is a Python-based reimplementation of the POLCA polisher made by George that’s easier to install and run. However, I soon noticed that Pypolca could introduce a lot of errors at low depths. I also noticed some cases where Polypolish introduced errors at low depths, and this really bothered me, because I explicitly designed Polypolish to not introduce errors.

Why am I so hung up on introduced errors? A few years ago, a good ONT-only bacterial genome assembly would contain hundreds to thousands of errors. If a polisher could fix those but introduced a few new errors in the process, that wasn’t a big deal – it would still make the assembly much better. But with today’s much-more-accurate ONT-only assemblies, a polisher that introduces errors can easily make an assembly worse, not better. So introduced errors now are a big deal!

It became clear that both Polypolish and Pypolca would need enhancements to avoid introducing errors when short-read depth is low. At this point, I decided the topic probably deserved a whole manuscript, not just a blog post. Since Pypolca is central to this, I teamed up with George. The paper introduces Pypolca and describes improvements to both Pypolca and Polypolish that help in low-depth scenarios, and we benchmarked these tools against FMLRC2, HyPo, NextPolish and Pilon². George and I split the analytical work (with help from the other authors, of course) and George did most of the writing. The manuscript is now published in Microbial Genomics:

How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies

As is often the case, this paper grew into something larger than originally planned, and it answers some other interesting questions as well. So please check it out for the full story! But in this post, I’d like to come back to the original question: when polishing a modern ONT-only assembly, is shallow short-read sequencing good enough?

The answer is… not really. 5× short-read depth can probably fix about ⅓ of the errors, and 10× depth can probably fix about ⅔ of the errors, but you need 25× or more to fix most of the errors. So I sadly cannot recommend shallow short-read sequencing for polishing. These results have, however, made me refine my recommended short-read depth. In the past, I’ve suggested 100–300× short-read depth when aiming for a perfect assembly. Assuming all goes well on the ONT side, I now feel comfortable reducing that recommendation to 50× short-read depth.

Footnotes

I was in the habit of saying ‘Illumina sequencing’ because that used to be the only short-read game in town. But now that other platforms (e.g. MGI) are becoming more common, I’m trying to retrain myself to say ‘short-read sequencing’ instead. ↩
I didn’t include ntEdit, Racon and wtpoa because those polishers performed worse than the rest when I benchmarked them in the Polypolish paper. ↩