Long-read assemblies reveal structural diversity in genomes of organelles – an example with Acacia pycnantha

Organelle genomes are typically represented as single, static, circular molecules. However, there is evidence that the chloroplast genome exists in two structural haplotypes and that the mitochondrial genome can display multiple circular, linear or branching forms. We sequenced and assembled chloroplast and mitochondrial genomes of the Golden Wattle, Acacia pycnantha, using long reads, iterative baiting to extract organelle-only reads, and several assembly algorithms to explore genomic structure. Using a de novo assembly approach agnostic to previous hypotheses about structure, we found that different assemblies revealed contrasting arrangements of genomic segments; a hypothesis supported by mapped reads spanning alternate paths.

genome (mitome) is ∼800 kbp and probably exists in multiple dynamic structures. Although the mitome has traditionally been represented as a single circular structure, there is physical evidence of multiple shapes [2]. Long sequencing reads indicate several linear, branched, or smaller circular structures [3,4] that may recombine at repeat regions [5].
Long sequencing reads may be able to span repetitive regions (depending on length) and should better capture their placement in assemblies, thereby revealing more of the structural complexity in both plastomes and mitomes. Presently, long reads generated using technologies such as Oxford Nanopore and Pacific Biosciences (PacBio) have a higher error rate than short reads from Illumina. Thus, reads from both can be combined when assembling genomes: long reads can reveal broad organelle genome structure, and short reads can correct errors. This hybrid approach has demonstrated improved accuracy in organelle genome assembly [6].
Recent work using these combined technologies has been highly successful in revealing new information about organelle genome structure. Long reads provide strong evidence that the plastome exists in two structural haplotypes in equal proportions across land plants [7], which supports certain theories of recombination. Gymnosperm mitome assemblies based on long reads reveal considerable complexity in mitome structure, and branching may be related to recombination processes [4].
Here, we used long (Oxford Nanopore) and short (Illumina) sequencing reads to assemble the plastome and mitome of Acacia pycnantha (Golden Wattle, Australia's floral emblem; NCBI:txid880440). The iconic, economically important Acacia genus has more than 1000 species, yet currently, long-read organelle assemblies are lacking. Data from nuclear ribosomal DNA and plastomes have provided a good basis for phylogenetic investigation [8]. Three plastomes and one mitome for the genus are available in RefSeq [9], and 94 additional partial assemblies (incomplete, with gaps in non-coding, repeat rich regions) in the National Center for Biotechnology Information (NCBI) database, many from Williams et al. [8]. These new long-read assemblies will complement and expand our knowledge of Acacia organelle structures.
To further facilitate exploration of genome structure and to limit the introduction of bias or errors, we assembled sequencing reads de novo, rather than mapping them to an existing assembly. Analysis steps are automated in reproducible scripts with freely available tools.

Obtaining organelle reads
We used an iterative approach to extract organelle-only reads from full genomic read sets containing a mixture of nuclear, mitome and plastome reads. First, gene coding sequences from related Acacia species were used as baits to extract organelle Nanopore reads. These reads were assembled, and the assembly itself used as bait for repeat organelle read extraction from the full Nanopore read set. These reads were assembled, and this second assembly was used as bait to extract short Illumina reads from the full Illumina read set. The short reads were then used to polish the assembly (Figure 1). Additional assemblies were completed in different configurations as discussed in more detail below.

Assembling organelle genomes
Short-read technologies have provided the dominant source of organelle sequencing read data for the past 10 years. Therefore, by necessity, assembly tools have also been based on reads of these lengths and fidelity, and have typically relied on mapping the reads to existing assemblies. More recently, the tools NOVOPlasty (RRID:SCR_017335) [1] and GetOrganelle [10] have been developed using iterative baiting algorithms to perform de novo organelle assemblies, and have improved accuracy. However, they are not configured to work with long-read data, which differs not only in read length, but in read length variability and a higher error profile [11]. Since we specifically wanted to investigate assemblies based on long reads, we therefore used assemblers optimized for these types of data.

Choice of assembly tools
Organelles and bacteria have similar genomes because they probably have shared ancestry [12]. In assembling organelle genomes, it is therefore appropriate to consider methods used for bacterial genome assembly, which is an area of active research because accurate assemblies underpin many areas of public health. Recent benchmarking of bacterial assembly tools for long reads found the best performers to be Flye (RRID:SCR_017016) [13], Raven (RRID:SCR_001937) [14] and Miniasm [15] and Minipolish [16], but no single tool performed best on all metrics such as reliability, circularization, errors, or completeness [16].
Here, we used a combination of these well-tested assemblers, which also capture the diversity of algorithms in current use. For example, Flye uses approximate repeat graphs; Miniasm is a true Overlap-Layout-Consensus (OLC) method and only outputs unitigs; and Raven combines an OLC method with improved graph cleaning by removing unsupported overlaps. All are designed to work with "noisy" long reads such as Oxford Nanopore sequences. Because we also include short reads in this analysis, we assembly by Unicycler [17] -a hybrid method using an initial short-read assembly followed by long-read scaffolding ( Figure 1).

Annotations
This analysis is primarily concerned with establishing first-pass assemblies for Acacia organelles, using long reads to explore structural configuration. Annotation is the next step to further investigate gene and feature content, arrangement, loss or duplication, transfer between organelles and nuclear genomes, and comparison with related and more distant species. In appreciation of the complexity of this, and the need for domain-specific knowledge to best produce a useful annotation, this work does not attempt to present a complete or final annotation of these organelle genomes. However, basic annotations are presented to provide an initial visualization of the feature landscape of these organelles.
We used GeSeq [18,19] to produce annotations, which is automated, reproducible, and has been used successfully for plant mitomes and plastomes [20,21]. Future editing and refinements of these annotations are expected, and will no doubt improve the interpretation of structural and physiological processes. A continuous iterative process of refinement of assemblies, then annotations, then assemblies, and so on, would indeed be beneficial, particularly in the study of non-model organisms.

Read trimming and filtering
Raw, uncorrected Nanopore reads were used because correction can introduce artificial consensus sequences, and because we have additional, more accurate data (Illumina) available for correction. To ensure Illumina reads were as accurate as possible for the correction step, we used fastp version 0.20.0 (RRID:SCR_016962) [23] to filter and trim reads, with the following settings: discard read or pair if more than 3 Ns; require min length 130; require average quality 35. This reduced the number of read pairs to ∼410 million (Table 1).

Extraction and assembly of organelle-only Nanopore reads: round 1
To extract organelle-only reads from the full read sets, we used known sequences from related taxa as "baits". For the plastome, we used three coding sequences from Acacia ligulata (NC_026134.2) in FASTA nucleotide format. We chose the genes rbcL, matK and ndhF because these are all likely to be plastid-only genes and are also well conserved. The rbcL and matK genes are usually located at either end of the large single-copy (LSC) region, and ndhF is usually in the small single-copy (SSC) region; these are well spaced around the plastid so that long reads should be extracted with roughly even coverage. As the mitome is much larger than the plastome, we used all 38 of the coding sequences from the mitome of Acacia ligulata (NC_040998.1). We mapped the raw Nanopore reads (∼5.5 million) to the baits with minimap2 version 2.17 (RRID:SCR_018550) [24] and used SAMTOOLS version 1.9 (RRID:SCR_002105) [25] to extract mapped reads. We then used Filtlong version 0.2.0 [26] to keep only the longest of the extracted reads up to a coverage of 250×, because assembly becomes more fragmented or impossible when coverage is too high (indeed, preliminary tests confirmed this with our data). For the plastome, we extracted ∼28,000 reads, downsampled to 901 reads (longest ∼121 kbp). For the mitome, we extracted ∼14,000 reads, with no downsampling because coverage did not meet the cut-off (250×) (longest ∼105 kbp) ( Table 1). Extracted Nanopore reads were assembled with Flye version 2.8 [13] and the assembly was polished with two rounds of Racon version 1.4.11 (RRID:SCR_017642) [27].

Extraction and assembly of organelle-only Nanopore reads: round 2
We used the first assembly as the bait file for the next round of extracting organelle reads from the original full read set. In Minimap2, we set a minimum match value to 5000, as preliminary tests showed that more leniency here resulted in too many reads being extracted to assemble properly. Again, we kept only the longest reads to a target coverage of 250×. From the ∼5.5 million raw reads, for the plastome, we extracted ∼70,000 reads (approximately twice as many as in round 1), downsampled to 864 reads (longest ∼121,000 bp; the same as round 1). For the mitome, we extracted ∼14,000 reads (similar to round 1), downsampled slightly to ∼12,000 reads (longest ∼105 kbp; same as round 1) ( Table 1). As in the first round, these reads were then assembled with Flye and polished with two rounds of Racon. In testing, further rounds of Racon polishing made little difference.

Extraction of organelle-only Illumina reads
Using the round 2 assembly as baits, we then extracted organelle-only reads from the filtered and trimmed Illumina reads (∼410 million read pairs). The extracted read sets were then randomly downsampled to a coverage of 250× using Rasusa version 0.2.0 [28]. For the plastome, this resulted in ∼26 million read pairs, downsampled to ∼130,000 read pairs; for the mitome, this resulted in ∼23 million read pairs, downsampled to ∼670,000 read pairs (Table 1).

Polishing the assembly with Illumina reads
The round 2 assembly was then polished with the extracted, downsampled Illumina reads, using two rounds of Pilon version 1.23 (RRID:SCR_014731) [29,30], with a mindepth of 0.5 and fix set to bases (not contig breaks).

Unicycler assembly
Using both the extracted Illumina and Nanopore reads, we used Unicycler version 0.4.8 [17] to perform a hybrid assembly. Unicycler first performs a short-read only assembly using Spades (RRID:SCR_000131) [31], and scaffolds this with long reads.

Assembly comparisons and verification
Assembly graphs were visualized with the Bandage tool GUI version 0.8.1 [32]. In particular, we used the BLAST v2.2.21 (RRID:SCR_001598) [33] tool within Bandage to compare assemblies. After loading a genome graph, a local BLAST database was built, and the query assembly file was compared; the assembly graph was then colored by BLAST hits. We made several comparisons: comparing each assembly to the Unicycler assembly, and comparing the Unicycler assembly to three closely related taxa from NCBI reference genomes. Further read-mapping was done to verify that long reads spanned multiple alternate structures.

RESULTS Overview
Plastome or mitome reads were extracted from the full read sets of Nanopore and Illumina data ( Table 1). Short reads were assembled with Spades within Unicycler. Long reads were assembled with Flye, Miniasm and Raven, and assemblies were polished with long reads and then short reads. A hybrid assembly was performed with Unicycler. Assembly statistics are shown in Table 1, and summarized in Tables 2 and 3. Supplementary files available at Zenodo include assemblies, assembly graphs, and annotation files [40].

Plastome short-read assembly
Despite being based only on short-reads, the plastome assembly is fairly well-resolved:

Plastome long-read assembly
As expected, this assembly is more fully resolved than the short-read assembly (    Table 1 for all statistics).

Plastome hybrid assembly
The hybrid assembly by Unicycler resolved the plastome assembly into a single circle, of length 173,902 bp, which is very similar to the long-read polished assembly size 173,695 bp (Table 2, Figure 4). As Unicycler is designed to work well for hybrid read sets like this, and  can make use of short-read accuracy and long-read bridging, we consider this the best representation of the plastome in this analysis. Although this assembly is resolved into a circle, we keep in mind that there are probably two orientations of the SSC placement [7] and suspect that the long-read assembly alone does not call consensus on this ambiguity.

Plastome assemblies with other tools
Plastomes were also assembled using two other tools. Long reads were assembled with Miniasm, then polished with Minipolish and Pilon, producing an assembly of ∼275 kbp (Table 1, Figure 5). This assembly is much longer than both the Flye and Unicycler assemblies. Miniasm makes unitigs using the overlap-layout method, but with no consensus step. Here, because of either sequencing error and/or the multiple SSC orientations, it has probably assembled very similar regions that have not been collapsed into a consensus. BLAST reveals that this is probably the case, because almost the entire Miniasm assembly matches the Unicycler assembly ( Figure 5). To better visualize the components of this assembly, we used BLAST to find locations of the LSC, SSC and IRs, taken from the Flye assembly in Figure 3 ( Figure 6). Here, we can see that Miniasm has assembled reads into two contigs, one of which is almost the entire plastome, but that there is some ambiguity in the overlap of the other contig. Long reads were also assembled with Raven, then polished with short reads and Pilon, producing a single contig of ∼210 kbp (Table 1, Figure 7). Raven uses OLC in a slightly different way to Miniasm, and then includes a consensus step using Racon. Thus, because it is using OLC, this assembly is longer than the Flye/Unicycler assemblies, but because it includes a consensus step, it is shorter than the Miniasm assembly. Again, to better visualize the components of this assembly, we used BLAST to compare it to the LSC, SSC and IR regions taken from the Flye assembly in Figure 3 (Figure 8) where we can see that the IR has been assembled approximately three times, and the SSC twice.

Plastome draft annotation
The draft annotation (Figure 9) is a visual first-pass approximation of the gene and feature content, rather than a highly polished finished annotation. Supplementary files, including  GenBank and GFF3 formats of this annotation are available for researchers to further explore this annotation [40].

Plastome summary
Based on the Unicycler assembly here, and annotations by GeSeq, the assembly of the Acacia pycnantha plastome is broadly similar to previous results found in other Acacia species (Table 4)    Acacia ligulata statistics are from [41]; Acacia dealbata statistics are from [42]. Acacia pycnantha statistics are derived from the GeSeq annotation, visualized in OGDRAW.

Mitome short-read assembly
The mitome assembly based on short reads has 161 contigs, ranging from 127 bp to ∼55,000 bp in length, giving a total length of 823,167 bp ( Table 3). As expected, the assembly graph shows some unresolved ambiguity, at least one dead end, and several very small fragments ( Figure 11).

Mitome long-read assembly
The long-read assembly of the mitome is a vast improvement over the short-read assembly ( Table 3) in terms of contig lengths and contiguity. The number of contigs has reduced from 161 to 6, the shortest contig has increased in size from 127 bp to ∼43 kbp, and the longest has increased from ∼55 kbp to ∼392 kbp. Total length has increased from ∼823 kbp to ∼906 kbp. The assembly graph is much less tangled: there are two possibly circular segments of ∼93 kbp and ∼108 kbp, and the remainder forms a single structure, albeit with    Contigs are colored according to their match with the hybrid Unicycler mitome assembly, using the BLAST tool within Bandage. Labels show contig lengths and depths. This graph is unpolished; contig sizes differ slightly after polishing with both Racon and Pilon.

Mitome hybrid assembly
The mitome hybrid assembly produced by Unicycler is well resolved, with some apparent improvements over the long-read assembly by Flye (Table 3, Figure 13). The number of contigs has decreased from 6 to 4, the shortest contig has increased in size from ∼43 kbp to ∼54 kbp, and the longest contig from ∼392 kbp to ∼579 kbp. Total size has decreased from ∼906 kbp to ∼818 kbp. There are three apparent circular segments of sizes ∼93 kbp, ∼93 kbp and ∼54 kbp.
As with the plastome assembly, based on the strengths of Unicycler in working with hybrid read sets, we consider this Unicycler assembly the best representation of the mitome in this analysis. However, by also considering the long-read Flye assembly, we can explore the complexity of this genome structure further. The Flye assembly joins a longer section together, indicating how a particular segment (shown in red) may be integrated. Although the Flye assembly is substantially longer than the Unicycler assembly, a BLAST comparison ( Figure 12) shows that all components match well to the Unicycler assembly, indicating that additional length may be from a similar repeat region that has not been collapsed. The Flye assembly also shows that one of its circular segments (shown in black) is twice the size of the similar segment in the Unicycler assembly, which again suggests a repeat region that has not been collapsed. In the Unicycler assembly, BLAST shows that this segment matches to a related Acacia species ( Figure 17). However, whether these repeat regions are truly independent and should be collapsed is unclear, demonstrating the complexity of this structure.

Mitome assemblies with other tools
Mitomes were also assembled with two other tools. Long reads were assembled with Miniasm, then polished with Minipolish and Pilon, producing an assembly of ∼983 kbp (Table 1, Figure 14). As with the plastome results, this assembly is much longer than the Flye and Unicycler assemblies, which is expected as Miniasm has no consensus step. Using BLAST to compare this assembly with that produced by Unicycler, all sections match the Unicycler assembly ( Figure 14).
Long reads were also assembled with Raven, then polished with short reads and Pilon, producing an assembly of ∼935 kbp (Table 1, Figure 15). As with the plastome results, this Raven assembly is shorter than the Miniasm assembly, but is longer than the Flye/Unicycler  assemblies because of the algorithm employed. A BLAST comparison with the Unicycler assembly confirms that all sections match ( Figure 15).
An interesting result from the Minisam and Raven assemblies is the placement of a particular segment, colored in black. In the Unicycler assembly, this segment is circularized, but these assemblies indicate how the segment may join to other parts of the genome (Figures 14, 15).

Mitome draft annotation
The draft annotation ( Figure 16) is presented to provide a first-pass visualization of gene and feature content. To further explore this annotation, supplementary files available at Zenodo [40] include GenBank and GFF3 formats of this annotation.

Mitome summary
Based on the Unicycler assembly, the assembly of the Acacia pycnantha mitome is 818,342 bp in length, and may be arranged in a long linear piece and three smaller circular segments. Alternative assembly results suggest that some circular segments can be incorporated into the larger structure. In comparison, the closest sequenced relative, Acacia ligulata is substantially shorter, with a total length of 698,138 bp [43]. This was assembled with short Illumina reads into 10 contigs, followed by manual editing and joining. is much non-matching sequence in the Acacia pycnantha assemblies, increasing in concert with phylogenetic distance (Figure 17).

Mitome structural variation
A major aim of this study is to understand more about multiple structures of an organelle genome that may exist simultaneously. Although various assembly results suggest this, we performed an extra manual check to deduce whether long reads would span multiple structures. To do this, we specifically looked at the location of the red contig, with a size of ∼90 kbp. In the long-read Flye assembly (Figure 12), this contig (=edge 6) is integrated into the large blue connected component, and located between the repeat region of 2,770 bp (=edge 1). However, in the Unicycler assembly ( Figure 13), the red contig is excised as an independent circular contig, and a small fragment is also present in the long linear contig.
The question is: can both these structures exist? Do long reads support both configurations? The two paths to compare in Figure 12 are a structure that includes edge 6 (edge 8-edge 1-edge 6-edge 1-edge 7) and a structure that excludes edge 6 (edge 8-edge 1-edge 7). We mapped the long reads to these paths to visually inspect whether there was support for both assemblies. To do this, we drew the Flye assembly in Bandage in double mode ( Figure 18), extracted nodes in the two paths of interest (keeping direction consistent), reduced the lengths of the outer contigs (edges 7 and 8) to only 10,000 bp, and combined them into a single FASTA file of paths. Then, we used this FASTA file as bait to extract matching reads from the full mitome Nanopore read set with minimap2 [24], and visualized the bam track (available at Zenodo [40]) in JBrowse [44] in Galaxy [45]. This confirmed that long reads span both paths and thus both structural configurations, where the ∼90 kbp red A. E. Syme et al.

Figure 18.
Tracing alternate paths through the mitome assembly graph, to include or exclude edge 6. Graph is drawn in Bandage in double mode, so the path directionality is maintained when nodes are extracted.
contig can be incorporated into the larger mitome structure ( Figure 12) or not ( Figure 13).

DISCUSSION
Long sequencing reads are becoming the "new normal" for genome assembly projects, providing new ways to investigate structural complexity. In this work, we successfully extracted organelle-only reads from full nuclear + organelle read sets, and assembled the reads under various algorithms in well-tested tools. In this case, we consider that the hybrid short-and long-read assembly produced by Unicycler gave the best representations of the Acacia pycnantha mitome and plastome. Draft annotations have been presented from the GeSeq tool.
Additional assemblies have suggested the existence of multiple mitome configurations; a hypothesis supported by long reads that span alternate assembly graph paths. This builds on an increasing body of work that refutes the existence of a single, static, circular mitochondrial genome [3,4,46,47].
There are many avenues to explore to improve both assemblies and annotations, so we consider the assemblies presented here as "version 1". New technologies, such as PacBio HiFi sequencing, are improving long-read fidelity. Oxford Nanopore raw sequencing data can benefit from being re-basecalled with new tools [48], and trained on relevant taxonomic data [49]. Long-read specific assemblers are continually optimized, particularly to error profiles, and there is a large focus on improving the assembly of repeat regions [50].
One option to explore in further research is that multiple structures may be better assembled via metagenomic approaches. Alternate structures could be thought of as part of a metagenomic pool, and reads clustered and assembled accordingly, considering that abundances of alternate forms may not be equal.

19/23
In either case, the increased use of long reads and de novo assembly will further improve organelle assemblies and pave the way for fuller genomic comparison across species.

AVAILABILITY OF SOURCE CODE AND REQUIREMENTS
A copy of the assembly script (assembler.sh) is available in Zenodo [34] and in GitHub [40], with instructions on how to run the script and the required inputs and tools. The latest release (v1.1) contains the MIT license.

DATA AVAILABILITY
Raw sequence data has been deposited in NCBI under BioProject PRJNA752212. Extracted organelle data is available at Zenodo [51]. Supplementary files also available at Zenodo [40] include, for each organelle genome: 14 assemblies in fasta format, and associated GFA format file if available (not all stages produce this file), as well as the Spades GFA from Unicycler. For each Unicycler assembly there is a set of annotation files that include GenBank and GFF3 formats, and outputs from HMMER, ARAGORN, and tRNAscan-SE. There is also a bam file of reads mapped to alternate assembly paths for the mitome to explore the 90 kbp contig of interest.

ETHICAL APPROVAL
Not applicable.