Short-read polishing of short-read assemblies
Marit Hetland recently asked me what I recommend for polishing short-read assemblies. Unicycler used to run Pilon at the end of its pipeline, but I saw many cases where it introduced errors, so I removed Pilon from Unicycler in a recent version. I developed Polypolish for short-read polishing of long-read assemblies and tested it along with other polishers, but I don’t really know how these tools perform on short-read assemblies.
So I didn’t have a good answer for Marit, and her question made me realise that some tests were in order. Hence this blog post!
Methods
I used the same five ATCC genomes described in my previous post because I had both deep reads1 and carefully curated ground-truth assemblies. I used Unicycler (which runs SPAdes) to make the short-read assemblies. For polishers, I tried POLCA and Polypolish (my two favourites), as well as Pilon (formerly part of Unicycler).
Counting errors in short-read assemblies can be tricky due to their fragmentation. It’s straightforward with single-copy contigs (multiplicity=1): just align the contig to the reference genome2 and count discrepancies3. But repeat contigs (multiplicity>1) can be trickier, as they might have different error counts depending on which repeat instance they are compared to.
I decided to count the errors in each contig’s best alignment to the ground-truth reference. For example, if a contig represents a sequence repeated five times in the genome, and it matches at least one of those instances perfectly, I consider it error-free, even if there are errors in its alignments to the other four instances.
Here is the count_errors.py
script I used:
And here are the Bash commands I ran:
Results
ATCC-10708 Salmonella enterica
Polisher | Errors | Accuracy | Result |
---|---|---|---|
none4 | 135 | Q45.45 | |
Pilon | 135 | Q45.45 | same |
POLCA | 143 | Q45.20 | worse (+8 errors) |
Polypolish | 135 | Q45.45 | same |
ATCC-17802 Vibrio parahaemolyticus
Polisher | Errors | Accuracy | Result |
---|---|---|---|
none5 | 341 | Q41.72 | |
Pilon | 348 | Q41.63 | worse (+7 errors) |
POLCA | 366 | Q41.41 | worse (+25 errors) |
Polypolish | 342 | Q41.71 | worse (+1 error) |
ATCC-25922 Escherichia coli
Polisher | Errors | Accuracy | Result |
---|---|---|---|
none6 | 1 | Q67.11 | |
Pilon | 22 | Q53.68 | worse (+21 errors) |
POLCA | 25 | Q53.13 | worse (+24 errors) |
Polypolish | 5 | Q60.12 | worse (+4 errors) |
ATCC-33560 Campylobacter jejuni
Polisher | Errors | Accuracy | Result |
---|---|---|---|
none7 | 1 | Q62.38 | |
Pilon | 1 | Q62.38 | same |
POLCA | 2 | Q59.37 | worse (+1 error) |
Polypolish | 1 | Q62.38 | same |
ATCC-BAA-679 Listeria monocytogenes
Polisher | Errors | Accuracy | Result |
---|---|---|---|
none | 0 | Q∞ | |
Pilon | 0 | Q∞ | same |
POLCA | 0 | Q∞ | same |
Polypolish | 0 | Q∞ | same |
Discussion and conclusions
It’s important to note that short-read contigs can contain errors, often due to collapsed repeat sequences and incorrect tandem repeat counts. When I developed Unicycler, I assumed that SPAdes contigs would be reliable, i.e. repeats and ambiguity would cause assembly fragmentation but not errors in contigs. So a Unicycler hybrid assembly doesn’t change the SPAdes contigs, it just scaffolds them together with long reads. But it’s now clear to me that errors can and do occur in SPAdes contigs, so those errors will also end up in a Unicycler hybrid assembly. That’s why I now recommend a long-read-first approach to hybrid assembly.8
As for the effect of short-read polishing on short-read assemblies, the results were clear: polishing either made no difference or it introduced errors. There was not a single case where polishing reduced the error count. Polypolish introduced the fewest total errors (5) and POLCA introduced the most (58). Introduced errors were mostly found in or near repetitive sequences, once again highlighting that repeats are the bane of genome assembly.9 So my recommendation is simple: don’t polish after a short-read Unicycler assembly!
Footnotes
-
130× to 700× for ONT reads, 550× to 820× for Illumina reads. ↩
-
I actually use a doubled version of the reference genome, where each circular sequence is duplicated. This allows for a contig which spans the start/end boundary to align in a single alignment. We did something similar in our long-read assembler benchmarking paper – see Figure S6. ↩
-
I counted unaligned parts plus all non-match CIGAR parts (
X
,I
andD
) as errors. I also counted each base of an insertion/deletion as an error, e.g. a 100-bp deletion counts as 100 errors. This makes my error counts and corresponding accuracy qscores very unforgiving in the case of structural errors. ↩ -
All 135 pre-polishing errors in the Salmonella assembly were in the plasmid contig, which failed to circularise in the Illumina-only assembly and also contained some errors. These seem to be related to very low Illumina read depth in the affected regions. ↩
-
The 341 pre-polishing errors in the Vibrio assembly occurred at 3 locations. The first was an
AACAGC
tandem repeat – based on long reads, there are 32 copies, but the Illumina assembly cut this down to 21 copies. The second was an inexact 134-bp tandem repeat – based on long reads, there are ~6.5 copies, but the Illumina assembly cut this down to 4.5 copies. The third was a cluster of 7 mismatches. I can’t explain this one, other than to say the Illumina reads in this region have a lot of errors. Curiously, the low-k-mer SPAdes graph (k=27) got the sequence right, but the final graph (k=127) got it wrong. ↩ -
The one pre-polishing error in the E. coli assembly was in the rRNA operon, a 7× repeat. This was collapsed down to a single contig in the Illumina-only assembly, but there is some variation in the operon, so the Illumina contig represents a consensus sequence, and that consensus is not an exact match for any of the actual instances. ↩
-
The one pre-polishing error in the Campylobacter assembly was in a G×10 homopolymer: my ‘truth’ assembly had G×10 and the Illumina-only assembly had G×11. The Illumina reads were pretty evenly split between G×10 and G×11. I decided that G×10 was better because the R10.4.1 reads supported that length and the corresponding open reading frame was longer with G×10. But there may be heterogeneity here, i.e. both lengths are correct. ↩
-
For details on long-read-first hybrid assembly, see our ‘perfect assembly’ paper and accompanying online tutorial. ↩
-
This is why long reads are so much better for assembly: they can usually span all repeats in a bacterial genome, which effectively means there are no repeats. ↩