A high-quality draft genome for Melaleuca alternifolia (tea tree): a new platform for evolutionary genomics of myrtaceous terpene-rich species

The economically important Melaleuca alternifolia (tea tree) is the source of a terpene-rich essential oil with therapeutic and cosmetic uses around the world. Tea tree has been cultivated and bred in Australia since the 1990s. It has been extensively studied for the genetics and biochemistry of terpene biosynthesis. Here, we report a high quality de novo genome assembly using Pacific Biosciences and Illumina sequencing. The genome was assembled into 3128 scaffolds with a total length of 362 Mb (N50 = 1.9 Mb), with significantly higher contiguity than a previous assembly (N50 = 8.7 Kb). Using a homology-based, RNA-seq evidence-based and ab initio prediction approach, 37,226 protein-coding genes were predicted. Genome assembly and annotation exhibited high completeness scores of 98.1% and 89.4%, respectively. Sequence contiguity was sufficient to reveal extensive gene order conservation and chromosomal rearrangements in alignments with Eucalyptus grandis and Corymbia citriodora genomes. This new genome advances currently available resources to investigate the genome structure and gene family evolution of M. alternifolia. It will enable further comparative genomic studies in Myrtaceae to elucidate the genetic foundations of economically valuable traits in this crop.

production of tea tree oil is one of the more important industries based on an essential oil from a myrtaceous species in Australia and overseas [1].

Context
Here we report a de novo high quality draft assembly of the M. alternifolia genome, using the short sequencing reads from an earlier draft genome [6] together with newly generated Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing reads. This new draft assembly resulted in a genome size of 362 megabase pairs (Mb), which was close to the size of the previous assembly (356.5 Mb) [6] and the physical size estimated from flow cytometry (357 Mb) [7]. The overall assembly statistics were improved by using longer sequencing reads; the N50 of assembled scaffolds was 1.9 Mb, which is about 214 times higher than the N50 achieved by Calvert et al. [6]. The BUSCO (benchmarking universal single-copy orthologs) score for the genome, which assesses the presence of single-copy orthologs [8], was also increased from 86.3% to 98.1%. The subsequent gene annotation led to more than 37,000 genes being predicted, and a high level of ortholog completeness (89.4%) was confirmed for the predicted proteome. Furthermore, the organisation of genes could be resolved in more detail, with many long scaffolds showing synteny to Eucalyptus grandis chromosomes.
This new genome sequence for tea tree should meet our aim to generate a resource for comparative studies with other Myrtaceae species to investigate mechanisms of genome evolution in this group. We are especially interested in mechanisms underlying gene family evolution and, in particular, the terpene synthase (TPS) gene family. This gene family is responsible for the final stages of terpene biosynthesis and is highly diversified in plants [9].
Tandem duplication is thought to be an important mechanism for gene family evolution in the TPS family, and more broadly for adaptive genes in long-lived eucalypts and other tree groups [10]. An earlier analysis of the TPS gene family in M. alternifolia revealed relatively Protocol providing additional information for the creation of a high-quality draft genome for Melaleuca alternifolia (tea tree) [12]. https://www.protocols.io/widgets/doi?uri=dx.doi.org/10.17504/protocols.io.bwi2pcge few TPS compared with other myrtaceous species [6]. This new, more contiguous genome will allow more confident exploration of questions on gene family size and microsynteny analysis in tea tree.

DNA extraction
Fresh, young foliage was collected from the reference genotype SCU1 of Melaleuca alternifolia, the same individual used by Calvert et al. [6] for Illumina sequencing. To yield high-quality DNA for PacBio sequencing, the cetrimonium bromide (CTAB) extraction protocol from Healey et al. [11] was used, with modifications as mentioned in the protocols.io protocol ( Figure 2) [12].
The quantity of extracted DNA was measured fluorometrically with a Qubit BR assay (Invitrogen), and the quality was assessed using a Nanodrop (Thermo Fisher Scientific).
Furthermore, the integrity of the DNA was examined on a 0.5% agarose gel.

DNA sequencing
The DNA sample was sent to the Ramaciotti Centre for Genomics (University of New South Wales, Sydney, Australia). Here, further quality control was undertaken using Pippin Pulse gel electrophoresis (Sage Science). DNA was concentrated to 200-250 ng/μl using AMPure PB beads (Pacific Biosciences) so that it was suitable for library preparation, where the aim was to generate DNA fragments of 20-50 kilobase pairs (Kb). The gDNA fragments were sequenced using two PacBio Sequel single-molecule real-time (SMRT) cells. Raw sequencing reads are available from the National Center for Biotechnology Information (NCBI) BioProject PRJNA702189.

Genome assembly
Before assembly, all reads were mapped to the reference sequence of the E. grandis chloroplast (GenBank accession MG925369.1) using minimap2 v2.17-r941 (RRID:SCR_018550) [13]. Reads aligning with more than 80% of their length were filtered out and the remaining reads were used for the nuclear genome assembly.
With the current advances in sequencing technologies, many tools have been developed to meet the demand for long-read assemblers, using different assembly algorithms [14,15].
Furthermore, the selected assembly was screened for contamination using BlobTools  [25]. The required coverage files were created with minimap2 and SAMtools v1.9 (RRID:SCR_002105) [26]. Post contaminant removal, the final genome assembly was deposited at NCBI GenBank with accession number JAGKPW010000000.
The heterozygosity of the genome was calculated by aligning the Illumina paired-end libraries to the final assembly using minimap2. The alignment files were assigned to read groups with Picard v2.23.8 (RRID:SCR_006525) [27], and variants were called using the Genome Analysis Toolkit (GATK v4.1.9.0, RRID:SCR_001876). The heterozygosity ratio was calculated as the number of heterozygous sites versus the total number of nucleotides in the assembly.

Gene prediction
The Fgenesh++ v7.2.2 pipeline (RRID:SCR_018928) [28] was used to predict genes in the assembled scaffolds. The NCBI non-redundant plant protein database (provided by Softberry), the Eucalyptus grandis gene matrix (purchased by the Australian BioCommons), and RNA-seq spliced read alignments were provided as evidence. For the RNA evidence, the M. alternifolia RNA-seq data from three other individual trees of the same chemotype (BioProjects PRJNA388506; BioSamples SAMN07178263, SAMN07178261, SAMN07178248) were downloaded and converted to fastq format with the NCBI sequence read archive (SRA) toolkit. They were subjected to quality control using FastQC (RRID:SCR_014583) and trimming with Flexbar (RRID:SCR_013001) [29], following the method by Padovan et al. [30].
They were then aligned to the genome using ReadsMap (v1.10.1). Upon completion of the Fgenesh++ run, the predicted genes were filtered to remove incomplete gene models  [32] and AGAT v.0.5.1 [33]. With InterProScan, a similarity-based approach was used to screen predicted proteins for sequences listed in the PfamA database (RRID:SCR_004726) [34]. Furthermore, using the taxonomy search of the Pfam database [35], a list of protein families known to be present in eudicotyledons was created to assess which InterProScan results contained those eudicot protein families.

DATA VALIDATION AND QUALITY CONTROL A de novo genome assembly for Melaleuca alternifolia
The modified DNA extraction protocol yielded high-molecular-weight (HMW) DNA of high purity (260:280 ratio = 1.86, 260:230 ratio = 2.01). As confirmed by Pippin Pulse gel electrophoresis, gDNA was highly intact with a size greater than 40 Kb. Subsequent PacBio sequencing resulted in 1,831,348 reads, with a GC content of 42%, individual sequence lengths of between 50 and 71,437 bp and a total length of 19,732,598,005 bp. Based on a flow cytometry genome size estimate of 356.97 Mb, this implies a sequencing depth of around 55×.
After sequencing, the tea tree genome was assembled with the three different tools Canu, Flye and MaSuRCA. For the MaSuRCA assembly, Illumina sequencing reads [6] were included, which covered the genome more than 300×. Consequently, the Canu and Flye For each of the three assemblies, BUSCO analysis was conducted to assess their completeness of orthologous sequences. The MaSuRCA assembly had the highest score (98.1%) for complete BUSCO groups, followed by Canu (97.3%) and Flye (96.9%) ( Table 1).
The MaSuRCA assembly also contained the fewest fragmented and missing reference genes (1.9%), but the lowest proportion of duplicated orthologs was found in the Flye assembly (3.1%) ( Table 1). Unlike the other assemblies, the Canu assembly had a notably higher percentage (24%) of complete but duplicated BUSCOs. Taken together with the notably larger total assembly length for the Canu assembly (451 Mb) relative to the other approaches, this high percentage of duplicated sequences suggests that Canu might have assembled more than one haplotype [51]. These findings coincide with recently reported assembly comparisons, in which Canu resulted in larger than expected assembly sizes containing uncollapsed haplotypes [52,53]. In the M. alternifolia assembly, this might be explained by the poor resolution of haplotypes in highly heterozygous regions, as it was also observed by Guiglielmoni et al. [53]. Furthermore, the propensity of PacBio reads to contain higher error rates [54] might have contributed to this outcome. An increased sequencing depth followed by the phasing of haplotypes with specific tools should be able to resolve these errors [16].
It is possible that MaSuRCA exceeded the other two tools owing to the increased sequencing depth achieved by including Illumina reads. Canu and Flye both reportedly perform well with a sequencing depth of 50× or less, but their performance is expected to  improve further with increasing coverage [16,17]. Particularly for repeat-rich plant genomes, higher PacBio coverage might be required for long-read-only assemblies. Given its higher contiguity and completeness over the other assemblies, MaSuRCA was used for further analyses.
BlobTools was used to screen for contaminating sequences from other species. Sixty

Fgenesh++ prediction results
A total of 52,135 protein-coding genes were predicted by Fgenesh++, of which 23,920 were predicted based on protein evidence (high sequence similarity to NCBI nr-protein plant database), with the remaining genes being predicted de novo. A total of 2170 genes had missing start or stop codons. A further 12,739 genes were excluded because of overlaps with transposable elements, resulting in a filtered set of 37,226 complete gene models.
BUSCO assessment of the translated sequences revealed an overall high level of ortholog completeness in the proteome after filtering the predictions (89.4%), with most of them being single-copy (Table 3). Furthermore, a screen against the reference database of PfamA protein families showed that 64.39% of all proteins contain known eudicot Pfam domains.

Synteny to eucalypts
The predicted coding sequences (CDS) of M. alternifolia were aligned to eucalyptus reference CDS using CoGe Synmap. The resulting syntenic map showed a high level of collinearity in the gene order of M. alternifolia and E. grandis (Figure 4). Since the syntenic path assembly ordered scaffolds based on their synteny to the reference genome of E. grandis, no conclusions can be drawn about the orientation of short scaffolds containing too few genes. However, the assembled scaffolds of sufficient size reveal rearrangements on the chromosome scale; some inversions in the tea tree genome were evident (E. grandis chromosome one), together with one translocation between the E. grandis chromosomes eight and six. Comparison with C. citriodora CDS showed a similar plot, but there were more gaps, especially on C. citriodora chromosomes 4 and 5 (see syntenic maps in GigaDB [55]). Furthermore, a comparison of M. alternifolia with P. trichocarpa revealed that synteny is still identifiable between these two rosids. Nevertheless, the synteny was reduced, with larger genetic distance leading to translocations and inversions being more notable than among the examined Myrtaceae.
Whole-genome alignments with Mummerplot resulted in similar findings, with an overall high sequence similarity among the compared Myrtaceae, with more inversions in the tea tree alignment to C. citriodora than to E. grandis (see whole-genome alignments Mummerplot in GigaDB [55]). The underlying cause of the different densities in the pairwise synplots between Melaleuca, Corymbia and Eucalyptus, and differences in the rearrangements detected between the pairs will require more investigation.
M. alternifolia is an ideal model species to compare genomic features with eucalypts. The two tribes Melaleuceae and Eucalypteae share some anatomical features, such as capsular fruit, an abundance of oil glands in their leaves, and epicormic buds underneath their bark [1,2]. Their sequence similarities are expected to be high, with genetic markers being transferable to some extent [56]. The syntenic comparisons and whole-genome alignments of our study confirmed these expectations.

Orthogroup discovery
The inference of phylogenetic relationships and the discovery of orthogroups based on the sequences of selected species affirmed the fgenesh++ gene prediction results for M. alternifolia.
OrthoFinder assigned 248,528 genes (91.3% of total) to 29,672 orthogroups. Of these, 10,982 orthogroups contained genes from all eight species, while 4386 groups were species-specific. Most of the M. alternifolia genes (78%) were assigned to orthogroups with two or more species present (see tables in GigaDB for more OrthoFinder-related statistics [55]).
Phylogenetic relationships were inferred based on the gene trees created by OrthoFinder for the eight selected species [37]. The tea tree genome will make a valuable addition to the growing quiver of sequences available for Myrtaceae. So far, genomic studies on Myrtaceae have mainly been motivated by the economic importance of some species and the relevance to identify the genetic foundations of specific traits [57]. Hence, the genomes of E. grandis and E. globulus were the first trees of the Myrtaceae family to be sequenced [42], but these were soon followed by the genome of a sister genus, C. citriodora [47,58]. Their geographical isolation, thus their separated evolution in Australia, make eucalypts a model taxon for comparative genomic studies to find shared evolutionary history, but also to find unique genome characteristics, compared with other woody perennials of the rosid lineage, such as Populus and V.
In addition to providing an outgroup for phylogenetic analyses of derived and ancestral characters in the eucalypts, the M. alternifolia genome resource will help to elucidate genomic responses to adaptive pressures among the Myrtaceae evolving on the Australian continent.
Although the evolutionary history of the Melaleuca remains uncertain, the genera Eucalyptus and Melaleuca are thought to have had a shared ancestry until 68 mega annum (Ma) [59], and current evidence suggests that Melaleuca and Eucalyptus largely adapted to contrasting environments [60]. Most of the genus Melaleuca are small trees or shrubs that grow in wetland or periodically waterlogged habitats [1], and have developed adaptations consistent with their waterlogged life history. For example, Melaleuca trees are capable of gas exchange through their bark [61], and seedlings develop aquatic roots as well as changed leaf morphology when submerged [62,63]. Thus, melaleucas are well-adapted to permanently or seasonally wet habitats. In contrast, most eucalypts are adapted to survive in arid environments with frequently occurring fires [64]. This divergence in their habitat might have evoked distinct adaptive responses in eucalypts and melaleucas.
The genome of another related species, Leptospermum scoparium (mānuka), has recently been sequenced [65]. Phylogenetic studies on Myrtaceae indicate that Leptospermum and include TPS genes and other genes involved in stress responses. In E. grandis, for example, most resistance (R) genes containing nucleotide-binding sites and leucine-rich repeat domains (NBS-LRR) were shown to be organised in tandem clusters [68], as well as genes encoding for S-domain receptor-like kinases (SDRLK) and MYB transcription factors [42,69].
These findings indicate that tandem duplication is essential for adaptation to dynamic environments, especially for genes involved in responses to abiotic and biotic stresses.
Therefore, comparison of how similarly these gene families evolved in Myrtaceae, and which selection pressures might have influenced their genome structure, is warranted.

REUSE POTENTIAL
De novo assembly of the M. alternifolia genome using MaSuRCA with PacBio long-reads and Illumina short-reads resulted in a high-quality draft genome that was similar in length (362 Mb) to the previously reported short-read assembly (357 Mb) [6], but with considerably longer scaffolds. The N50 value was increased by a factor of 214. The completeness of the hybrid assembly was also improved, as indicated by 6.1% fewer BUSCOs missing from the assembly, and a decrease of 5.6% fewer fragmented BUSCOs than the previous draft.
Alignment of M. alternifolia sequences to E. grandis and C. citriodora indicated high sequence similarities and correlations in gene order among the three Myrtaceae. The longer scaffolds of this new assembly will help to illuminate genome organisation in M. alternifolia and will allow further exploration of the significance of tandem gene duplication as a mechanism of gene family evolution in tea tree and related Myrtaceae species.

DATA AVAILABILITY
This whole genome sequencing project has been deposited at NCBI GenBank under the accession JAGKPW000000000. The assembly version described in this paper is version JAGKPW010000000. PacBio raw sequencing reads are available under BioProject PRJNA702189. Supplementary information regarding the computational methods has been summarised at protocols.io [12]. Other supporting data are available in the GigaScience GigaDB repository [55].

CONSENT FOR PUBLICATION
Not applicable.