Yet another ONT accuracy test: Dorado v0.5.0
This is my third ONT-only accuracy update in 2023 – it’s been a busy year for ONT! This one is motivated by the early-December release of Dorado v0.5.0 and its new v4.3 basecalling models.
Last time, I tested the three standard Dorado models (fast, hac and sup) along with a research model on Rerio which was fine-tuned for bacteria. However, that research model now seems to be obsolete, as the new v4.3 basecalling models incorporate this bacterial training1. This time, I only tested the sup model (not fast or hac), both to save myself time and also because I think of sup-basecalling as the default choice. Given sufficient compute (which we have with Onion), why would you use anything but sup?
Methods
I used the same genomes and methods as my last accuracy post, so look there for the details. The results below compare the previous dna_r10.4.1_e8.2_400bps_sup@v4.2.0
model to the current dna_r10.4.1_e8.2_400bps_sup@v4.3.0
model.
Results: read accuracy
This table shows the simplex read identity (left) and qscore (right) across all nine genomes:
average | sup v4.2.0 | sup v4.3.0 |
---|---|---|
mean | 97.1%, Q15.4 | 97.7%, Q16.4 |
median | 98.6%, Q18.5 | 99.1%, Q20.5 |
mode | 99.0%, Q20.0 | 99.4%, Q22.2 |
Results: assembly accuracy
This table shows the error count (left) and qscore (right) for each assembly:
Genome | sup v4.2.0 | sup v4.3.0 |
---|---|---|
Campylobacter jejuni | 27, Q48.2 | 5, Q55.5 |
Campylobacter lari | 28, Q47.3 | 18, Q49.2 |
Escherichia coli | 70, Q48.7 | 1, Q67.2 |
Listeria ivanovii | 9, Q55.1 | 5, Q57.7 |
Listeria monocytogenes | 0, Q∞ | 0, Q∞ |
Listeria welshimeri | 2, Q61.5 | 1, Q64.5 |
Salmonella enterica | 8, Q57.8 | 3, Q62.0 |
Vibrio cholerae | 22, Q52.7 | 2, Q63.2 |
Vibrio parahaemolyticus | 7, Q58.7 | 2, Q64.1 |
total, average | 173, Q52.6 | 37, Q59.3 |
Since there are only 37 remaining errors in the sup v4.3.0 assemblies, I’ve included them all below. Each comparison shows the ONT-only assembly (top) vs the Illumina-polished reference (bottom):
All sup v4.3.0 errors
Discussion and conclusions
The v4.3.0 model is clearly a big improvement over v4.2.0! Read accuracy got noticeably better: v4.3.0 reads had roughly 2/3 of the errors compared to v4.2.0 reads. And there was an even bigger improvement in consensus accuracy: v4.3.0 assemblies had less than 1/4 of the errors compared to v4.2.0 assemblies. The remaining assembly errors are mostly in long homopolymers. The exception was Campylobacter lari which had more errors than the other genomes, mostly in the GATC
motif.2 It’s clear that sup v4.3.0 is now the best basecalling model – much better than the bacterial research model on Rerio.
While there is also a new version of Medaka with polishing models to match the v4.3.0 basecalling models, it once again failed to help much. Three of the nine genomes got better with Medaka polishing, three got worse and three didn’t change.3 So my opinion on Medaka polishing of sup assemblies remains the same: don’t bother.4
I should again emphasise that these ONT-only genomes came from very deep read sets and careful Trycycler assembly, so a more typical ONT-only genome (e.g. 50–100× depth assembled with Flye) probably won’t be quite this good. Another caveat is these genomes are from well-studied taxa, so their DNA modifications/motifs are more likely to be represented in ONT’s basecaller training set.5 How well does ONT perform on Rubeoparvulum, Abyssicoccus, Zhihengliuella and Haloactinopolyspora?6 It would be interesting to assess ONT-only accuracy on a really diverse set of genomes.
Overall, I’m very impressed with these results, and perfect ONT-only bacterial assemblies are now looking closer than ever. And as always, there are plenty of developments on the horizon which may improve accuracy even further. The closest is the E8.2.1 motor protein, scheduled for release next year, which ONT claims will increase read accuracy by another couple qscores.7 The steady year-after-year improvement of ONT sequencing, while occasionally frustrating (when I need to rebasecall all my data), has been very fun to witness.
Footnotes
-
As reported by Chris Seymour and Mike Vella. ↩
-
I haven’t checked, but I assume this is due to Dam methylation, which occurs in many species, including Campylobacter. ↩
-
The sup v4.3.0 + Medaka error counts were: 3, 28, 1, 4, 0, 1, 6, 3 and 1 (same genome order as the above table) for a total of 47 errors. ↩
-
While I didn’t test the hac v4.3.0 model this time, I previously found that Medaka is quite beneficial for hac assemblies. ↩
-
I don’t exactly know what’s in ONT’s basecaller training set. Presumably there’s amplified DNA (no mods) and native DNA from humans, plants and various bacterial species. Perhaps also synthetic DNA with randomly placed modifications? And what else? I’ve asked ONT in the past, but they seem tight-lipped about it. Inquiring minds want to know! ↩
-
I’ve never heard of these taxa either. I just went to GTDB and grabbed some random obscure ones. ↩
-
They talked about this in their recent tech update presentation. At T=20min, Lakmal talks about motor proteins and E8.2.1. At T=42min, Rosemary talks about the rollout, including the lot number which will indicate whether a kit contains E8.2.1. ↩