Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies

High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences, but has high error rates, which make sequence assembly and analysis difficult as genome size and complexity increases. Robust experimental design is necessary for ONT genome sequencing and assembly, but few studies have addressed eukaryotic organisms. Here, we present novel results using simulated and empirical ONT and DNA libraries to identify best practices for sequencing and assembly for several model species. We find that the unique error structure of ONT libraries causes errors to accumulate and assembly statistics plateau as sequence depth increases. High-quality assembled eukaryotic sequences require high-molecular-weight DNA extractions that increase sequence read length, and computational protocols that reduce error through pre-assembly correction and read selection. Our quantitative results will be helpful for researchers seeking guidance for de novo assembly projects.

eukaryotic sequences require high-molecular-weight DNA extractions that increase sequence read length, and computational protocols that reduce error through pre-assembly correction and read selection. Our quantitative results will be helpful for researchers seeking guidance for de novo assembly projects.

DATA DESCRIPTION Background
Many factors affect the quality of a de novo genome assembly. Genome size increases the size of the "puzzle" to put together, while the size of the pieces (sequence reads) remains the same. Repetitive regions and mobile genetic elements create unique challenges because they are present in more than one location in the genome, so without contextual information it is difficult to identify how many copies exist in the complete genome. For example, Alu repeat elements reach >1 million copies in the human genome [1].
Third generation sequencing technologies can theoretically solve these problems. Long sequence reads span repetitive regions and can potentially identify the exact size and location of repeats on a chromosome. The larger "puzzle pieces" require less sequencing effort to span the entire genome. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are the current front-runners in long-read sequencing platforms; both are capable of average read lengths in the order of tens of thousands of base pairs and, theoretically, entire chromosomes can be sequenced in a single read [2,3]. Both are also capable of producing high-quality assembled sequences with reasonable amounts of data [4].
ONT offers several advantages over PacBio. Nanopore sequencing relies on running molecular fragments through engineered nanopores and recording the resulting alterations in electrical current. The technology is versatile and can be used for DNA sequencing, mRNA sequencing, amplification-free mRNA quantification [5], and measuring DNA base modifications like methylation [6,7]. ONT libraries can be readily prepared with low amounts of input DNA, which is an important consideration when studying small or difficult-to-sample organisms. ONT platforms are inexpensive and designed to be used in individual research laboratories and classrooms [8]. Both PacBio and ONT continually refine their technologies and, we believe, both will be a part of modern science. Similar projects have analyzed experimental design for PacBio [9,10] but the few benchmarking studies for ONT sequencing and assembly have been carried out with prokaryotes [11,12] or single organisms with large, repeat-heavy genomes [13]. For these reasons, we have chosen to study ONT and quantify how this inexpensive, accessible technology may be best utilized to produce high-quality assembled reference genome sequences. Here, we fill an important gap in the literature by addressing eukaryotic organisms with moderately sized genomes.
ONT sequence reads present unique challenges for genome sequence assembly. DNA molecules do not move through protein nanopores at a constant rate, and changes in current are a composite signature reflecting 3-5 nucleotides occupying the nanopore (for R9.4.1 flow cells). The signal processing has trouble detecting changes in current with homopolymers (single nucleotide repeats, for example AAAAAA), short tandem repeats and heavily methylated sites [14,15]. As a result, ONT sequence reads contain small nucleotide stretches that have been incorrectly identified, inserted and deleted ( Figure 1A, B). This error structure results in relatively few large, contiguous stretches of correctly identified nucleotides ( Figure 1B) for prokaryotes and eukaryotes ( Figure 1C, D) and is uniquely challenging for assembly algorithms.

Context
The typical recommendation for genome assembly is that increasing sequencing depth (i.e., the average number of times a nucleotide in the genome is sequenced) increases assembly quality and contiguity. Here, we test this by simulating extremely high depth ONT libraries for Escherichia coli (NCBI:txid562), Caenorhabditis elegans (NCBI:txid6239) roundworms, Arabidopsis thaliana (NCBI:txid3702) mouse-ear cress, and Drosophila melanogaster (NCBI:txid7227) fruit flies. For each organism, we assemble sequence read sets at different depths and measure the contiguity, completeness, and accuracy of assembled sequences relative to the current reference genome. Many de novo assembly projects target organisms without reference genomes, and we also measure the identification of a set of genes thought to be found in single copy in each organism. Pure ONT data sets result in superior assembled genome sequences, but can be expensive to generate. We also analyze assembled sequences from "hybrid" DNA read data sets comprising ONT and less expensive short Illumina sequence reads.
We use results from our simulated sequences to guide experimental ONT sequencing and assembly for four strains of the nematodes C. remanei and C. latens. For eukaryotes, both ploidy and breeding system can influence the assembly of genome sequence. Polyploidy can create scenarios in which many sites in a genome look similar, making it difficult to place these regions within an assembled genome [16,17]. For non-haploid organisms, there ONT sequence reads contain mixtures of errors including miscalled nucleotides, deletions, insertions and truncated homopolymers. When aligned to the reference genome this results in many single and multibase deletions, insertions and mismatches for (A) E. coli and (B) relatively few stretches of contiguous matching sequence that extend beyond a few nucleotides. For (C), D. melanogaster, the error profile is similar but the large, repeat-rich genome results in multi-nucleotide deletions and insertions, and again, in (D), few stretches of long, contiguous matching sequence. may be heterozygosity, and at these sites the effective sequencing coverage is cut in half. For small, non-clonal metazoans, sufficient DNA for sequencing cannot be acquired from a single individual. DNA must be extracted from pools of individuals, and -for outbreeding organisms -there may be substantial individual diversity within the pool.
We find that ONT approaches can produce contiguous assembled sequences with relatively high sequencing coverage of >60× -well above current recommendations [18].
Pure ONT sequencing and assembly outperforms the hybrid approach. We find that contiguous assemblies cannot be achieved by solely increasing ONT sequencing depth as errors accumulate and assembly statistics plateau. Pre-assembly filtering and read correction improve contiguity, and post-assembly polishing using short Illumina DNA sequence reads increases accuracy. We also find that the use of Illumina data, even at low sequencing depths, increases accuracy through iterative polishing. Our results demonstrate that rigorous experimental design can improve inference of a genome sequence and reduce the effort and cost required.

DNA extraction and sequencing
Nematodes used for genomic sequencing were grown on two 100-mm nematode growth medium (NGM) plates [19] seeded with E. coli OP50. Worms were harvested by washing plates with M9 mineral medium into 15 mL conical tubes. Samples were rocked on a tabletop rocker for 1 hour before being centrifuged to pellet worms. The supernatant was removed and tubes refilled with sterile M9 mineral medium, mixed and pelleted by centrifugation again. This process was repeated five times or until the supernatant was clear after centrifugation. The pellet was moved to 2-mL tubes and frozen at −20 °C until extraction. Worm pellets were allowed to thaw to room temperature, then flash-frozen in liquid nitrogen and repeated three times. Worm pellets were placed in 1.2 mL of lysis buffer solution (100 mM ethylenediaminetetraacetic acid [EDTA], 50 mM Tris, and 1% SDS) and 20 μL of Proteinase K (100 mg/mL). Tubes were then placed on a 56 °C heat block for 30 minutes with shaking. Genomic DNA (gDNA) was then isolated using a modified phenol-chloroform extraction [20]. DNA concentrations and purity were measured with a Qubit 4 (Life Technologies, Carlsbad, CA USA) and NanoDrop® 1000 spectrophotometer, respectively (Thermo Fisher Scientific, Waltham, MA USA). Extracts were visualized on a 0.8% agarose gel to verify high-molecular-weight gDNA. DNA size selection was carried out using the Short Read Eliminator Kit from Circulomics Inc., according to the manufacturer's guidelines (Baltimore, MD, USA). ONT reads for each strain were trimmed of adapters using Porechop (RRID:SCR_016967) [21] with the -discard_middle flag to remove chimeric reads.

Assembly and decontamination of real data sets
We evaluated contaminants in the C. remanei PX356, C. remanei PX439, C. remanei EM464 and C. latens PX534 draft assemblies with Blobtools2 [22] and SIDR [23]. Blobtools2 uses phylum-level taxonomic assignment, read coverage depth and GC content to identify non-target contigs within the assembly. To prepare data for visualization with BlobToolKit, the ONT reads were aligned to the initial assemblies using minimap2 v2.17 [24] and sorted using SAMtools v1.8 (RRID:SCR_002105) [25]. A reference database for taxonomic identification of scaffolds was created using blastn 2.2.31 (RRID:SCR_001598) [26] using the Nucleotide database from the National Center for Biotechnology Information [27]. An additional database was generated using diamond tblastx (RRID:SCR_011823) [28]. Contigs that were >10,000 bp and taxonomically identified as Nematoda were retained in the assembly.
SIDR [23] utilizes ensemble-based machine learning to train a model capable of discriminating target and contaminant contigs based on measured predictor variables. To prepare data for SIDR decontamination Illumina, DNA and RNA sequence reads were aligned to the draft assembly with bwa [29] using the BWA-MEM algorithm. We used bbtools/bbmap software (RRID:SCR_016965) [30] to calculate the average sequence coverage, length, GC, bases covered in alignment, RNA average sequence coverage and bases covered in RNA alignment for each contig. We used these data to train a bootstrap aggregated (bagged) decision tree model based on the blast identification of contig origin and used this model to predict the origin, target or contaminant of each contig. SIDR allowed us to assign probable Nematoda origin to contigs that lacked blast identification.

4/26
Pseudomonas, Serratia and Stenotrophomonas [31]. We also identified one large 4.57 Mb E. coli contig in the C. latens PX534 that had been labeled Nematoda by BlobToolKit and removed it from the C. latens PX534 assembled sequences.

Heterozygosity analyses
We assessed heterozygosity using the software Jellyfish v2.2.4 (RRID:SCR_005491) [32] and GenomeScope (RRID:SCR_017014) [33]. Jellyfish is a fast, memory efficient k-mer counting tool that searches for occurrences of unique k-mers in raw sequence reads based on user selected size. GenomeScope estimates genome heterozygosity, genome size, and repeat content using a k-mer-based statistical approach. Jellyfish was run using the default settings and a k-mer size of 21. This k-mer size was selected based on recommendation from the maker of Jellyfish and should be sufficient to obtain all unique k-mers present. A histogram file of all k-mer occurrences was created by Jellyfish and then uploaded into the online interface of GenomeScope. GenomeScope was run using default settings with a k-mer length of 21 and a read length consistent with each sample's sequencing parameters.

Evaluation
We used the software BUSCO version 4.0.1 (Benchmarking Universal Single-Copy Orthologs; RRID:SCR_015008) [34] to identify conserved gene sets. Briefly, BUSCO searches assembled DNA sequences for a set of unique genes that are expected to be conserved in single copy in an evolutionarily related group of organisms. We used BUSCO v4.0.1 [34] and the data sets Nematoda_odb10 for C. elegans, C. remanei, and C. latens; Metazoan_odb10 for D. melanogaster; and Viridiplantae_odb10 for A. thaliana. We also measured BUSCO completeness with the Diptera_odb10 for the D. melanogaster assemblies, but found that, in some instances, our assembled sequence contained a greater proportion of conserved genes than the reference sequence. We chose to focus on the Metazoan_odb10 for D. melanogaster and present both sets of statistics in the additional files in the Gigascience database [35].
BUSCO genes in each assembly were classified as single copy, duplicated, fragmented or missing. For single-copy and duplicated classifications, the target gene must be present and >95% identical to the expected size and sequence of the reference in the database [34].

Simulated sequence libraries
We obtained ONT sequences (R9 chemistry) from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) [36]. The E. coli data set contained We obtained the reference genome sequence for E. coli strain K12_MG1655 from NCBI; all other reference genome sequences were obtained from Ensembl (release 95) [37].
We simulated 150 base pair (bp) paired-end Illumina DNA libraries with the software ART (RRID:SCR_006538) [38] and ONT DNA libraries with the software NanoSim v2.0.0 (RRID:SCR_018243) [39]. Both software programs utilize an assembled sequence to generate simulated libraries with read profiles like an empirically generated library. NanoSim  [12] to study prokaryotic genome sequences.

C. remanei and C. latens sequence libraries
Output for each flow cell used for each strain is summarized in

Assembly of simulated data
We analyzed ONT libraries assembled with the Canu (RRID:SCR_015880) [18] and Flye (RRID:SCR_017016) [41] software packages. Both are well-established assemblers that consistently outperform other approaches in assembly contiguity and accuracy [11][12][13]. We used a 64-core 512 gigabyte (GB) Dell PowerEdge R730 server for all assemblies and found these resources sufficient for our target organisms and libraries. Other assembly software, such as Miniasm/Minipolish [42], wtdbg2 (RRID:SCR_017225)/redbean [43], Shasta [44] and Raven (RRID:SCR_001937) [45], prioritize computational efficiency and may be appropriate for large genomes or laboratories with computational limitations. We assembled genome sequences using three approaches ( Figure 2). We first assembled the simulated ONT read sets with Canu v1.9 [18]. Each genome was assembled at decreasing coverage depths until the assembler errored out (i.e., failed to produce an assembly) or produced an assembly far from the appropriate size for the organism. To minimize the influence of individual reads and stochastic assembly artifacts, each read set was generated by selecting a random subset of the full simulated data set. The second approach used the same subsets of ONT data but was assembled with Flye 2.8 [41].
Hybrid genome assembly combining high-accuracy Illumina short reads with ONT long reads can be performed with Unicycler [46] or MaSuRCA (Maryland Super Read Cabog Assembler; RRID:SCR_010691) [47]. Unicycler is currently only recommended for bacterial genomes, so we focus here on MaSuRCA performance. The long-read data sets that produced the most contiguous, highest accuracy genome sequences were paired with 2 ×150 simulated paired-end Illumina data and assembled using MaSuRCA v3.3.9 [48].
Coverage depths were adjusted for both data sets to better understand the effects of increasing or decreasing coverage on the final assembly. Here, we retained the ONT data set to maximize our ability to draw parallels between assembly approaches. For example, the minimum C. elegans ONT data set that assembled with Canu [18] was an average of ∼60× coverage across the genome. This read set was used in the MaSuRCA trials, with 50× and 100× Illumina coverage, respectively.
We also used Flye [41] to assemble read sets that had been corrected using the "canucorrect" function. Canu correction uses all-versus-all overlap information to correct individual reads [18]. In addition to the consensus information, Canu employs two filters to avoid biases caused by sequence quality or repeats. From read-length estimates, Canu uses the longest available reads up to the user-specified coverage depth for correction [18]. By default, this is set to 40× coverage. Assembling these data sets with both Canu and Flye allowed us to compare the effectiveness of the error correction modules within each software package. We also assembled libraries with reads selected by the Canu correction module but not subject to correction to test the influence of error correction versus read length on assembly.
The most contiguous assemblies from the long-read only and hybrid categories were error corrected using Pilon v1.23 (RRID:SCR_014731) [49] to determine the effect of short-read polishing on the accuracy of the draft assemblies. Each simulated assembled sequence was polished with the entire simulated paired-end data set. Four rounds of polishing were completed for each assembly, with statistics measured after each round with QUAST v5.0.2 (RRID:SCR_001228) [50] and BUSCO v4.0.3 [34].

Assembly of ONT nematode data
Preliminary assemblies were created for each strain by first correcting the complete library with Canu -correct (Canu v2.0) [18]. Corrected reads from Canu were then assembled using Flye 2.8 [40]. The draft assembly for each strain was polished using Pilon v1.23 [49] and Illumina paired-end reads. All data sets contained a small subset of nematode microbiota that were removed from the final assemblies [22,23].

DATA VALIDATION AND QUALITY CONTROL Assessing the quality of assembled simulated sequences
We measured genome statistics relative to the reference sequence of each organism with QUAST [50]. We assessed contiguity and accuracy of the assembled sequence through eight statistics: 1. NG50 is a size median statistic indicating that 50% of the expected assembled genome sequence (where the expectation is based on a known reference) is contained in contiguous sequences of that size or larger.
2. NGA50 is a similar size median, but indicates that 50% of the expected assembled genome sequence that aligns to the reference genome is contained in contiguous sequences of that size or larger. 3.
LG50 is the number of linkage groups or contiguous assembled sequences containing 50% of the expected assembled genome sequence.

4.
LGA50 is similar but is measured in the portion of the assembled sequence that aligns to the reference genome. (C) few mismatches; and (D) few insertion/deletion errors (indels) than the reference sequence. The influence of sequencing coverage on contiguity or fragmentation was not monotonic for Canu-or MaSuRCA-assembled sequences. The relationship between sequencing coverage and accuracy was not monotonic for Flye assembled sequences.

Escherichia coli
The E. coli genome is relatively small at 4.64 Mb and contains few repeats (87.8% of the genome codes for proteins [51]). Assembly of 50,000 ONT sequences (∼62× coverage; N50 8350 bp) with Canu [18] and Flye [41] resulted in circular contigs with high contiguity and few mismatches, insertions or deletions ( Figure 3A-D). At low ONT coverage (∼19×), the assemblies produced by Canu and Flye were contained in six and nine contigs, respectively.
Both assembled sequences had high rates of mismatches, insertions and deletions than sequences assembled from high coverage libraries ( Figure 3C, 3D). Increasing ONT coverage beyond 62× resulted in poor contiguity for Canu-assembled sequences ( Figure 3A) and increased mismatches for Flye-assembled sequences ( Figure 3C). Assembly of hybrid  Figure 3B).
After polishing the assemblies with Pilon [49], the Canu-assembled sequence had fewer mismatches and indels per 100 kilobase pairs (Kb) than the Flye-assembled sequence or an  assembly produced by Flye using reads that were first corrected with Canu ( Figure 4A-C).
Despite its dependence on higher accuracy Illumina sequences, the MaSuRCA assembled sequence had more mismatches, insertions and deletions both before and after polishing ( Figure 4B). We used MaSuRCA to assemble a set of paired-end Illumina sequences to test the influence of adding ONT data to the assembly process. The resulting sequence was 98.76% of the E. coli genome and was fragmented into 74 contiguous pieces. The accuracy, 1.7 mismatches and 0.04 indels per 100 Kb, was improved versus the hybrid assembled sequences. This indicates that the inclusion of ONT data can introduce errors in hybrid approaches that cannot be corrected with post-assembly polishing.

Caenorhabditis elegans
The C. elegans genome is 100.8 Mb [52] contained in a single X chromosome and five autosomes [53]. C. elegans is a diploid self-fertile hermaphrodite with low levels of genetic diversity [54]. Approximately 16% of the genome is repetitive [55]. Canu, Flye, and MaSuRCA-assembled sequences plateaued in contiguity and accuracy above ∼70× coverage ( Figure 5A-D). The read N50 for these data sets ranged between 20,742 and 22,840. Both Canu and Flye assembled chromosome-scale contigs with low rates of mismatches, insertions and deletions. The MaSuRCA [48] hybrid assembly approach did not perform as well, even with high ONT coverage ( Figure 5A, B).
To improve these assemblies, we experimented with polishing the Canu and Flye assembled sequences. Additionally, we used Flye to assemble the Canu-corrected data set ("Corrected" in Figure 6) and used Flye to assemble the reads selected by the Canu correction module but not corrected ("Selected" in Figure 6). The data set corrected with Canu assembly of C. elegans ONT libraries produces assembled sequences with high contiguity, low fragmentation, few mismatches and relatively more insertions and deletions that are eliminated with postassembly polishing. The performance of Flye was increased through read selection, correction, and post-assembly polishing with the highest-quality C. elegans sequence produced through Canu correction, Flye assembly and post-assembly polishing. The data set with reads selected for length did not assemble as well as that with reads selected and corrected, indicating that correction is an important step in assembly.

Drosophila melanogaster
The 180-Mb genome of the diploid fruit fly D. melanogaster presents multiple challenges for sequencing and assembly, including approximately 60 Mb of repetitive heterochromatin [58]. The genome sequence is contained in a large X chromosome, a small ("dot") Ychromosome, and three autosomes. Assemblies of the simulated ONT D. melanogaster libraries repeated patterns seen with the C. elegans data set; the most contiguous Canu assembly was produced with ∼113× ONT coverage and a read N50 of 11,679 (∼16 Gb of data; Figure 8A, B). This assembly produced 145 contiguous pieces, but many of these were small. The LGA50 was four contigs ( Figure 8B). The Flye assembly of the same data set was much less contiguous, producing 482 contigs. When the data set was corrected with Canu then assembled with Flye, the assembled sequences had high accuracy ( Figure 9).
The top-performing Canu assembly contained 91% of the metazoan genes expected to be conserved in single copy with BUSCO [34] prior to polishing, and 94% after four rounds of Pilon with paired-end reads. In comparison, the D. melanogaster reference sequence contains 94.1% of the expected single copy metazoan genes. Additional data decreased contiguity, with a slight increase in accuracy and NGA50 ( Figure 8A, D). Hybrid MaSuRCA [48] assemblies for D. melanogaster performed markedly worse than Canu assemblies of ONT data ( Figure 8B). The top-performing MaSuRCA assembly produced 220 contigs, with 93.9% of the expected metazoan genes identified.

Arabidopsis thaliana
The 135-Mb genome sequence of the self-fertile plant A. thaliana is contained in five autosomes [59]. A. thaliana has undergone at least two rounds of whole genome duplication [60] and contains large tracts of highly similar genomic regions. Assembly of the ONT libraries with Canu [18] and Flye [41] resulted in sequences with lower contiguity and higher fragmentation than the other model organisms, even when the library contained ∼420× coverage and a read N50 of 19,577 bp (56.7 Gb of sequenced nucleotides).
Canu produced 30 contiguous pieces (LGA50 five chromosomes; Figure 10) with 98.8% of the expected Viridiplantae genes [35] identified prior to polishing [34]. Flye produced 26 contigs and 98.8% of the expected Viridiplantae genes. The Flye assembly contained many more mismatches and indels than its Canu counterpart, but this discrepancy was alleviated with Pilon polishing [35]. Following polishing with Pilon [49], the Flye-assembled sequences contained 99.1% of the expected Viridiplantae genes [34], matching that of the TAIR10 reference genome for A. thaliana. Interestingly, our combined approach of long-read methods did not work with A. thaliana data sets, and neither Flye nor Canu was able to assemble a draft genome with <40× coverage. The hybrid MaSuRCA assemblies for A. thaliana performed the best out of the three eukaryotes. The top-performing MaSuRCA [48] assembly produced 45 contiguous pieces, with 98.8% of the expected Viridiplantae genes identified in the sequence [34]. This was also brought up to 99.1% after polishing with Pilon [35].

Caenorhabditis remanei and C. latens
We used our simulated data to create a protocol for de novo sequencing and assembly for three strains of the nematode C. remanei [61] and one of the closely related nematode C. latens [62]. Both C. remanei and C. latens are obligate outcrossing species with high levels of nucleotide diversity [63] that have hobbled previous assembly attempts [54,64] The worm's small size means that pools of individuals harboring both individual and population-level diversity must be sacrificed for DNA sequencing and assembly. We prepared high-molecular-weight DNA extracts and generated ONT libraries for an outbred C. remanei strain EM464, two inbred laboratory strains PX356 and PX439, and the inbred laboratory C. latens strain PX534. We tested library preparation and computational protocols with the inbred C. remanei strain PX356 ( Figure 11A-D) and used these to generate assembled sequences for the other C. remanei and C. latens strains.
The most contiguous C. remanei PX356 assembly was achieved by using Canu [18] to correct the entire ∼156× coverage ONT data set prior to assembly with Flye [41]. After correction, the ∼40× coverage data set had a read N50 of 21,340 bp and resulted in an assembled sequence contained in 163 contigs with 89.1% of the expected conserved nematode genes [34]. Decontamination with Blobtools2 [22] and SIDR [23]  We compared our results (Figures 11, 12) with a recently published sequence for the related C. remanei strain PX506 [40] generated with PacBio sequencing and chromatin conformation capture to produce chromosome-scale sequences [64].
Canu-only and Flye-only assemblies contained more contigs and had a smaller N50 value than the combined approach described above ( Figure 12A-D). The assembly graph ( Figure 13) shows several large contiguous fragments, but also multiple regions difficult to disentangle and smaller disjunct fragments. Analyses with GenomeScope [33]  ∼0.7% of the genome remains heterozygous ( Figure 14). Assemblies with subsamples of the long-read data were increasingly fragmented with decreasing coverage. Flye assemblies with initial coverages of 102×, 70×, and 52× produced 183, 242, and 267 contigs, respectively.
BUSCO identified 97.7% (increased from 89.1%) of the expected conserved single copy genes following four rounds of polishing using Pilon [49] with Illumina paired-end reads at ∼225× average depth (Figure 8). We tested the influence of Illumina coverage on polishing and found that 97.6% of the expected conserved nematode genes could be identified after polishing with just ∼20× Illumina coverage (Table 3), indicating that a large amount of data is not necessary to correct most errors in an assembly. However, this was achieved after three successive rounds of error correction with the Pilon software package [49] utilizing the same Illumina DNA sequence reads. We generated assembled genome sequences for C. remanei PX439, C. remanei EM464, and C. latens PX534 using the "best practices" protocols developed through simulations and our C. remanei PX356 experimentation. Our C. remanei PX439 data set had an ONT coverage depth of ∼120× with a read N50 of 18,233 bp. After Canu correction, the read N50 increased

DISCUSSION
We have found that Canu [18] and Flye software packages [41] produce high-quality contiguous draft assemblies from ONT libraries for eukaryotes with moderate genome sizes. In many cases, Flye slightly outperforms Canu, while being much faster and using fewer computational resources [12,44]. We found that, given the same set of corrected reads, Flye regularly produces a more contiguous, higher accuracy assembled sequence when compared with Canu. This finding highlights the need for performing multiple assembly strategies to identify the top performing strategy for each data set.
The hybrid assemblies produced by MaSuRCA [48] contained a higher proportion of expected conserved genes than the unpolished long-read only assemblies, and had fewer mismatches and insertion/deletion errors. However, given the same amount of ONT data, both Canu [18] and Flye [41] assembled more contiguous genome sequences. The long-read-only assemblies also produced larger NG50 values and contained a larger fraction of the expected genome than hybrid MaSuRCA assemblies. The shortfalls of long-read-only assembly can be overcome by "polishing" with ∼20× Illumina DNA Figure 14. Analysis with the GenomeScope software revealed residual heterozygosity in the C. remanei PX356 sequences (estimated at 0.712%). Here, the residual heterozygosity is visible as a shoulder at ∼50× sequencing coverage.  sequences using software such as Pilon [49] to error-correct the draft assemblies. We found that polishing was more important for increasing accuracy of the real-world data than the results seen with simulated data sets. These methods improved the accuracy of Canu or Flye assemblies to be on par with, or better than, those produced by MaSuRCA.
These highly contiguous assemblies were achieved with relatively high sequencing depth; often >100× coverage across the genome. Although ONT can theoretically produce megabase-sized reads, in reality, many of the sequence reads in "real" projects are shorter because of fragmentation that occurs during DNA purification and library preparation. ONT libraries may have many more small reads and >100× "real" coverage may be necessary to achieve >40× coverage with reads >25 Kb and highly contiguous assemblies. Increasing read N50 in a library through improvement of DNA isolation and library preparation methods reduces the necessary coverage to produce a high-quality assembly [66]. Therefore, researchers should adjust their coverage goals based on the quality of libraries they can reliably produce.
Our findings with simulated data sets were supported by all real-world data sets tested.
Caenorhabditis remanei PX356, PX439 and C. latens PX534 have assembled sequences generated with paired-end Illumina sequences [55]. Our assembled sequences were at least one order of magnitude more contiguous than previous attempts, highlighting the ability of ONT long reads to improve contiguity. Outbreeding Caenorhabditis are resistant to inbreeding [67] and previous assembly attempts resulted in over-assembly due to residual heterozygosity [64] or under-assembly. Despite this, we finalized a highly contiguous and complete assembly for these nematodes ( Table 4). The strains we targeted are closely related and interfertile [68], yet their assembled genome sequences vary by as much as 10% in length. This level of structural variation between closely related species highlights the need for de novo assembled sequences in molecular evolution research.
The findings we present here clearly demonstrate that existing genome sequences can be improved by adding in ONT data sets. ONT is continually increasing accuracy through library preparation, and a new "Q20" sequencing chemistry kit is currently in the hands of developers. A high accuracy "bonito" basecaller is in development [69] and new software like NECAT [70] and NextDenovo/NextPolish [71] are being developed to specifically address ONT error structures. As sequencing and assembly technologies continue to improve and reduce in cost, many researchers will find published data sets can be updated to further improve downstream analyses.

REUSE POTENTIAL
Considering our findings, we suggest that long-read data is prioritized when undertaking de novo genome assembly projects. Our results indicate that an assembly with sufficient ONT long read coverage will be highly contiguous, and polishing with Illumina data can achieve high levels of accuracy. For situations in which high-coverage ONT libraries are not feasible, MaSuRCA-assembled [48] Illumina and ONT read sets can produce reliable draft sequences. However, the quality and contiguity of the assembled sequence is determined by Illumina read depth and effort should be made to increase Illumina read depth, even if it is at the expense of ONT sequences. MaSuRCA-assembled Illumina sequences have fewer mismatches and insertion/deletion errors than MaSuRCA-assembled ONT and Illumina hybrid read sets, indicating that the inclusion of ONT sequences introduces errors. We suggest that error correction with Illumina DNA sequences and the Pilon software package [49] is a necessary finishing step in any assembly project utilizing ONT data.
Our results demonstrate that near-chromosome-level genome sequences are achievable with sufficient ONT data. However, chromosome-level genome assemblies are often not necessary to address many research questions, particularly those focused on small numbers of genes, phylogenomic information or population variation. Researchers should approach genome sequencing by first determining what genome-completion level will be sufficient for their research goals. To aid in this approach, we hope our study will help researchers determine the amount of sequencing effort, and the sequencing approaches, that will best suit their needs.