ONT-only accuracy: 5 kHz and Dorado
About six months ago, I tested ONT-only assembly accuracy using R10.4.1 reads:
ONT-only accuracy with R10.4.1.
In that post, I found that while the median ONT-only accuracy was quite good (~Q50), it varied a lot between bacterial species. Some genomes (Salmonella and Listeria) had very few errors, while others (E. coli and Campylobacter) had many. DNA methylation motifs seemed to be the biggest cause of errors, presumably because the basecaller wasn’t trained on native DNA that included those methylation patterns.
There have been a couple of ONT developments since that post: the move to a 5 kHz sampling rate1 and the maturation of Dorado2. So I decided it was time to once again quantify ONT-only assembly accuracy on bacterial genomes.
Methods
I used the following nine genomes, sequenced by my colleagues Louise and Hasini:
- Campylobacter jejuni (ATCC-33560)
- Campylobacter lari (ATCC-35221)
- Escherichia coli (ATCC-25922)
- Listeria ivanovii (ATCC-19119)
- Listeria monocytogenes (ATCC-BAA-679)
- Listeria welshimeri (ATCC-35897)
- Salmonella enterica (ATCC-10708)
- Vibrio cholerae (ATCC-14035)
- Vibrio parahaemolyticus (ATCC-17802)
These include the same five genomes I used last time plus four additional ones. Each had deep ONT and Illumina reads, and I produced ground-truth genomes using my usual approach3.
Unlike last time where I only tested sup basecalling, this time I tested a range of basecalling models:
dna_r10.4.1_e8.2_400bps_fast@v4.2.0
dna_r10.4.1_e8.2_400bps_hac@v4.2.0
dna_r10.4.1_e8.2_400bps_sup@v4.2.0
res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation
For brevity, I’ll refer to these as ‘fast’, ‘hac’, ‘sup’ and ‘res’. The first three are the current standard Dorado models at different levels of speed/accuracy4. The last one is interesting – it’s a sup-sized research model available on Rerio that has been fine-tuned for native bacterial DNA5. All basecalling was simplex only, and I demultiplexed and trimmed using Dorado. I then did a bit of read QC with Filtlong: first --min_length 10000
to remove short reads6 then --keep_percent 90
to discard the worst 10% of each read set. The post-QC read sets were nice and deep (ranging from about 1 Gbp to 2.5 Gbp per genome).
I made a Trycycler ONT-only assembly from each read set, also trying Medaka when possible7. This resulted in seven assemblies for each genome: fast, hac, hac+Medaka, sup, sup+Medaka, res and res+Medaka. To quantify read accuracy, I aligned the ONT reads to my ground-truth genomes8 and calculated identity from all alignments >10 kbp in length. To quantify assembly accuracy, I counted the number of differences9 between each assembly and my ground-truth genome.
Results: read accuracy
This table shows the simplex read identity (top) and qscore10 (bottom) across all nine genomes:
average | fast | hac | sup | res |
---|---|---|---|---|
mean | 91.0% Q10.5 |
96.2% Q14.1 |
97.1% Q15.4 |
97.3% Q15.7 |
median | 92.1% Q11.0 |
97.6% Q16.2 |
98.6% Q18.5 |
98.7% Q19.0 |
mode11 | 93.0% Q11.5 |
98.2% Q17.4 |
99.0% Q20.0 |
99.2% Q21.0 |
Results: assembly accuracy
This table shows the error count (top) and qscore (bottom) for each assembly:
Genome | fast | hac | hac+ Medaka |
sup | sup+ Medaka |
res | res+ Medaka |
---|---|---|---|---|---|---|---|
Campylobacter jejuni | 2539 Q28.4 |
159 Q40.5 |
137 Q41.1 |
27 Q48.2 |
73 Q43.8 |
13 Q51.3 |
23 Q48.9 |
Campylobacter lari | 2739 Q27.4 |
208 Q38.6 |
139 Q40.4 |
28 Q47.3 |
65 Q43.7 |
18 Q49.2 |
17 Q49.5 |
Escherichia coli | 5510 Q29.8 |
110 Q46.8 |
96 Q47.3 |
70 Q48.7 |
72 Q48.6 |
46 Q50.5 |
44 Q50.7 |
Listeria ivanovii | 936 Q34.9 |
9 Q55.1 |
6 Q56.9 |
9 Q55.1 |
4 Q58.6 |
8 Q55.6 |
5 Q57.7 |
Listeria monocytogenes | 795 Q35.7 |
2 Q61.7 |
0 Q∞ |
0 Q∞ |
0 Q∞ |
0 Q∞ |
0 Q∞ |
Listeria welshimeri | 668 Q36.2 |
4 Q58.5 |
1 Q64.5 |
2 Q61.5 |
3 Q59.7 |
2 Q61.5 |
1 Q64.5 |
Salmonella enterica | 4132 Q30.7 |
45 Q50.3 |
35 Q51.4 |
8 Q57.8 |
13 Q55.7 |
5 Q59.8 |
7 Q58.4 |
Vibrio cholerae | 4664 Q29.5 |
50 Q49.2 |
24 Q52.4 |
22 Q52.7 |
14 Q54.7 |
0 Q∞ |
4 Q60.2 |
Vibrio parahaemolyticus | 4202 Q30.9 |
42 Q50.9 |
7 Q58.7 |
7 Q58.7 |
17 Q54.8 |
7 Q58.7 |
7 Q58.7 |
total/average12 | 26185 Q30.8 |
629 Q47.0 |
445 Q48.5 |
173 Q52.6 |
261 Q50.8 |
99 Q55.0 |
108 Q54.6 |
Discussion and conclusions
As expected, the fast reads/assemblies were pretty rough and had a lot of errors, the hac reads/assemblies were much better, and the sup reads/assemblies were the best. You should really only use fast basecalling if you’re computationally constrained or doing an analysis that isn’t sensitive to errors, e.g. species identification.
Overall, assembly accuracy definitely improved compared to the last time I tested it. The current hac accuracy is similar to my previous sup accuracy, and the current sup accuracy is very good: less than 100 errors for all genomes and less than 10 errors for most. There are still some E. coli errors in M1.EcoMI methylation motifs and Campylobacter errors in CtsM methylation motifs, and long homopolymers (e.g. 10+ bp) sometimes have indel errors. So ONT accuracy still struggles with the same things it used to, but it’s moving in the right direction.
Regarding Medaka, I’ve been less impressed by its polishing than I used to be. It clearly helped with the hac assemblies, but it performed erratically with the sup/res assemblies, making things worse more often than better. My recommendation is therefore to use Medaka if you are assembling hac reads but skip it for sup/res reads.
Going into this, I was most curious about how Rerio’s research bacterial basecalling model would perform, and it did well! Read accuracy was slightly improved over sup, and assembly accuracy was the same or better than sup for all nine genomes. At the time of writing, this seems to be the best basecalling model for native bacterial DNA. ONT has been trying to simplify their sequencing/analysis lately13, so I understand that they are reluctant to provide additional basecalling models in Dorado. But I hope they continue to host organism-specific fine-tuned models on Rerio, for users who want to get the most out of ONT-only sequencing.
Overall, I’m very happy with these results – they show a big accuracy improvement since earlier this year! For bacterial genomes, near-perfect ONT-only assemblies are now often possible, and truly perfect ONT-only assemblies are occasionally possible.
Read availability
1 Dec 2023 update: I got a number of requests for the FASTQs, and I’m happy to announce that they are now available on NCBI: PRJNA1042815.
Note that each isolate has multiple associated SRA runs, so check the library name to make sure you’re getting the one you want. The runs used in this blog post contain the following text in their library name: 2023‑09_illumina, 2023‑09_nanopore_fast, 2023‑09_nanopore_hac, 2023‑09_nanopore_sup or 2023‑09_nanopore_res.
Also note that these are pre-QC reads, so many of the read sets are very large and have a poor N50. You therefore might want to do some read-length-based QC after downloading.
Footnotes
-
Previously, ONT sequencing generated raw data at a 4 kHz rate (~10 samples/nucleotide). Now they use 5 kHz sampling (~12.5 samples/nucleotide), giving basecallers a bit more data to work with. ↩
-
While Dorado has been available since 2022, it has only recently gotten to the point where it’s a full replacement for Guppy. It’s now the default basecaller in MinKNOW and can do barcode demultiplexing/trimming. ↩
-
Trycycler, Polypolish, POLCA and manual curation. See this tutorial for more details. ↩
-
The
fast
model is small: high speed and low accuracy. Thehac
model is medium: slower speed and higher accuracy. And thesup
model is big: slowest speed and highest accuracy. ↩ -
If you have access to the ONT community site, you can read more about it here: community.nanoporetech.com/posts/research-release-basecall. ↩
-
Discarding all reads <10 kbp in size is quite aggressive. I wouldn’t normally do this, but the unfiltered read sets were very deep so I could get away with it here. It makes assembly easier, but it also pretty much guarantees that small plasmids will be lost. ↩
-
There are Medaka models for hac and sup reads (
r1041_e82_400bps_hac_v4.2.0
andr1041_e82_400bps_sup_v4.2.0
). I also used the sup Medaka model for my res reads, since that basecalling model is based on sup. There isn’t a fast-read Medaka model, so I didn’t use Medaka on my fast-read assemblies. ↩ -
Read accuracies were calculated using all ONT reads, i.e. before I ran Filtlong. This means there was no read QC other than barcode demultiplexing (which serves as a bit of QC because very bad reads are more likely to end up in the unclassified bin). You can therefore expect higher mean and median values if you do some read QC (e.g. discarding ‘fail’ reads or running Filtlong). ↩
-
I used my
compare_assemblies.py
script to get a difference count. ↩ -
Qscore is defined as -10 × log10(errors / genome size). I think of it like this: the qscore tens place is the number of nines in the accuracy: Q20 = 99%, Q30 = 99.9%, Q40 = 99.99%, etc. ↩
-
The modal accuracy represents the peak of the read-identity distribution. To calculate it, I rounded each identity value to three decimal places (e.g. 0.972316 → 0.972) and took the most common value. ↩
-
I calculated average qscores using the total number of errors across all nine genomes divided by the total size of all nine genomes. ↩
-
ONT sequencing has had a lot of choices in recent years: R9.4.1 vs R10.4.1, 260 bp/s vs 400 bp/s, fast vs hac vs sup, rapid vs ligation, Guppy vs Dorado, simplex vs duplex, fast5 vs pod5, etc. Some of these options are being retired (R9.4.1, 260 bp/s, Guppy and fast5), which will hopefully simplify things going forward. ↩