ONT-only accuracy: 5 kHz and Dorado

About six months ago, I tested ONT-only assembly accuracy using R10.4.1 reads:
ONT-only accuracy with R10.4.1.

In that post, I found that while the median ONT-only accuracy was quite good (~Q50), it varied a lot between bacterial species. Some genomes (Salmonella and Listeria) had very few errors, while others (E. coli and Campylobacter) had many. DNA methylation motifs seemed to be the biggest cause of errors, presumably because the basecaller wasn’t trained on native DNA that included those methylation patterns.

There have been a couple of ONT developments since that post: the move to a 5 kHz sampling rate¹ and the maturation of Dorado². So I decided it was time to once again quantify ONT-only assembly accuracy on bacterial genomes.

Methods

I used the following nine genomes, sequenced by my colleagues Louise and Hasini:

Campylobacter jejuni (ATCC-33560)
Campylobacter lari (ATCC-35221)
Escherichia coli (ATCC-25922)
Listeria ivanovii (ATCC-19119)
Listeria monocytogenes (ATCC-BAA-679)
Listeria welshimeri (ATCC-35897)
Salmonella enterica (ATCC-10708)
Vibrio cholerae (ATCC-14035)
Vibrio parahaemolyticus (ATCC-17802)

These include the same five genomes I used last time plus four additional ones. Each had deep ONT and Illumina reads, and I produced ground-truth genomes using my usual approach³.

Unlike last time where I only tested sup basecalling, this time I tested a range of basecalling models:

dna_r10.4.1_e8.2_400bps_fast@v4.2.0
dna_r10.4.1_e8.2_400bps_hac@v4.2.0
dna_r10.4.1_e8.2_400bps_sup@v4.2.0
res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation

For brevity, I’ll refer to these as ‘fast’, ‘hac’, ‘sup’ and ‘res’. The first three are the current standard Dorado models at different levels of speed/accuracy⁴. The last one is interesting – it’s a sup-sized research model available on Rerio that has been fine-tuned for native bacterial DNA⁵. All basecalling was simplex only, and I demultiplexed and trimmed using Dorado. I then did a bit of read QC with Filtlong: first --min_length 10000 to remove short reads⁶ then --keep_percent 90 to discard the worst 10% of each read set. The post-QC read sets were nice and deep (ranging from about 1 Gbp to 2.5 Gbp per genome).

I made a Trycycler ONT-only assembly from each read set, also trying Medaka when possible⁷. This resulted in seven assemblies for each genome: fast, hac, hac+Medaka, sup, sup+Medaka, res and res+Medaka. To quantify read accuracy, I aligned the ONT reads to my ground-truth genomes⁸ and calculated identity from all alignments >10 kbp in length. To quantify assembly accuracy, I counted the number of differences⁹ between each assembly and my ground-truth genome.

Results: read accuracy

This table shows the simplex read identity (top) and qscore¹⁰ (bottom) across all nine genomes:

average	fast	hac	sup	res
mean	91.0% Q10.5	96.2% Q14.1	97.1% Q15.4	97.3% Q15.7
median	92.1% Q11.0	97.6% Q16.2	98.6% Q18.5	98.7% Q19.0
mode¹¹	93.0% Q11.5	98.2% Q17.4	99.0% Q20.0	99.2% Q21.0

Results: assembly accuracy

This table shows the error count (top) and qscore (bottom) for each assembly:

Genome	fast	hac	hac+ Medaka	sup	sup+ Medaka	res	res+ Medaka
Campylobacter jejuni	2539 Q28.4	159 Q40.5	137 Q41.1	27 Q48.2	73 Q43.8	13 Q51.3	23 Q48.9
Campylobacter lari	2739 Q27.4	208 Q38.6	139 Q40.4	28 Q47.3	65 Q43.7	18 Q49.2	17 Q49.5
Escherichia coli	5510 Q29.8	110 Q46.8	96 Q47.3	70 Q48.7	72 Q48.6	46 Q50.5	44 Q50.7
Listeria ivanovii	936 Q34.9	9 Q55.1	6 Q56.9	9 Q55.1	4 Q58.6	8 Q55.6	5 Q57.7
Listeria monocytogenes	795 Q35.7	2 Q61.7	0 Q∞	0 Q∞	0 Q∞	0 Q∞	0 Q∞
Listeria welshimeri	668 Q36.2	4 Q58.5	1 Q64.5	2 Q61.5	3 Q59.7	2 Q61.5	1 Q64.5
Salmonella enterica	4132 Q30.7	45 Q50.3	35 Q51.4	8 Q57.8	13 Q55.7	5 Q59.8	7 Q58.4
Vibrio cholerae	4664 Q29.5	50 Q49.2	24 Q52.4	22 Q52.7	14 Q54.7	0 Q∞	4 Q60.2
Vibrio parahaemolyticus	4202 Q30.9	42 Q50.9	7 Q58.7	7 Q58.7	17 Q54.8	7 Q58.7	7 Q58.7
total/average¹²	26185 Q30.8	629 Q47.0	445 Q48.5	173 Q52.6	261 Q50.8	99 Q55.0	108 Q54.6

Discussion and conclusions

As expected, the fast reads/assemblies were pretty rough and had a lot of errors, the hac reads/assemblies were much better, and the sup reads/assemblies were the best. You should really only use fast basecalling if you’re computationally constrained or doing an analysis that isn’t sensitive to errors, e.g. species identification.

Overall, assembly accuracy definitely improved compared to the last time I tested it. The current hac accuracy is similar to my previous sup accuracy, and the current sup accuracy is very good: less than 100 errors for all genomes and less than 10 errors for most. There are still some E. coli errors in M1.EcoMI methylation motifs and Campylobacter errors in CtsM methylation motifs, and long homopolymers (e.g. 10+ bp) sometimes have indel errors. So ONT accuracy still struggles with the same things it used to, but it’s moving in the right direction.

Regarding Medaka, I’ve been less impressed by its polishing than I used to be. It clearly helped with the hac assemblies, but it performed erratically with the sup/res assemblies, making things worse more often than better. My recommendation is therefore to use Medaka if you are assembling hac reads but skip it for sup/res reads.

Going into this, I was most curious about how Rerio’s research bacterial basecalling model would perform, and it did well! Read accuracy was slightly improved over sup, and assembly accuracy was the same or better than sup for all nine genomes. At the time of writing, this seems to be the best basecalling model for native bacterial DNA. ONT has been trying to simplify their sequencing/analysis lately¹³, so I understand that they are reluctant to provide additional basecalling models in Dorado. But I hope they continue to host organism-specific fine-tuned models on Rerio, for users who want to get the most out of ONT-only sequencing.

Overall, I’m very happy with these results – they show a big accuracy improvement since earlier this year! For bacterial genomes, near-perfect ONT-only assemblies are now often possible, and truly perfect ONT-only assemblies are occasionally possible.

Read availability

1 Dec 2023 update: I got a number of requests for the FASTQs, and I’m happy to announce that they are now available on NCBI: PRJNA1042815.

Note that each isolate has multiple associated SRA runs, so check the library name to make sure you’re getting the one you want. The runs used in this blog post contain the following text in their library name: 2023‑09_illumina, 2023‑09_nanopore_fast, 2023‑09_nanopore_hac, 2023‑09_nanopore_sup or 2023‑09_nanopore_res.

Also note that these are pre-QC reads, so many of the read sets are very large and have a poor N50. You therefore might want to do some read-length-based QC after downloading.

Footnotes

Previously, ONT sequencing generated raw data at a 4 kHz rate (~10 samples/nucleotide). Now they use 5 kHz sampling (~12.5 samples/nucleotide), giving basecallers a bit more data to work with. ↩
While Dorado has been available since 2022, it has only recently gotten to the point where it’s a full replacement for Guppy. It’s now the default basecaller in MinKNOW and can do barcode demultiplexing/trimming. ↩
Trycycler, Polypolish, POLCA and manual curation. See this tutorial for more details. ↩
The fast model is small: high speed and low accuracy. The hac model is medium: slower speed and higher accuracy. And the sup model is big: slowest speed and highest accuracy. ↩
If you have access to the ONT community site, you can read more about it here: community.nanoporetech.com/posts/research-release-basecall. ↩
Discarding all reads <10 kbp in size is quite aggressive. I wouldn’t normally do this, but the unfiltered read sets were very deep so I could get away with it here. It makes assembly easier, but it also pretty much guarantees that small plasmids will be lost. ↩
There are Medaka models for hac and sup reads (r1041_e82_400bps_hac_v4.2.0 and r1041_e82_400bps_sup_v4.2.0). I also used the sup Medaka model for my res reads, since that basecalling model is based on sup. There isn’t a fast-read Medaka model, so I didn’t use Medaka on my fast-read assemblies. ↩
Read accuracies were calculated using all ONT reads, i.e. before I ran Filtlong. This means there was no read QC other than barcode demultiplexing (which serves as a bit of QC because very bad reads are more likely to end up in the unclassified bin). You can therefore expect higher mean and median values if you do some read QC (e.g. discarding ‘fail’ reads or running Filtlong). ↩
I used my compare_assemblies.py script to get a difference count. ↩
Qscore is defined as -10 × log₁₀(errors / genome size). I think of it like this: the qscore tens place is the number of nines in the accuracy: Q20 = 99%, Q30 = 99.9%, Q40 = 99.99%, etc. ↩
The modal accuracy represents the peak of the read-identity distribution. To calculate it, I rounded each identity value to three decimal places (e.g. 0.972316 → 0.972) and took the most common value. ↩
I calculated average qscores using the total number of errors across all nine genomes divided by the total size of all nine genomes. ↩
ONT sequencing has had a lot of choices in recent years: R9.4.1 vs R10.4.1, 260 bp/s vs 400 bp/s, fast vs hac vs sup, rapid vs ligation, Guppy vs Dorado, simplex vs duplex, fast5 vs pod5, etc. Some of these options are being retired (R9.4.1, 260 bp/s, Guppy and fast5), which will hopefully simplify things going forward. ↩