Chromosome-scale assembly of the highly heterozygous genome of red clover (Trifolium pratense L.), an allogamous forage crop species

Relative to other crops, red clover (Trifolium pratense L.) has various favorable traits making it an ideal forage crop. Conventional breeding has improved varieties, but modern genomic methods could accelerate progress and facilitate gene discovery. Existing short-read-based genome assemblies of the ∼420 megabase pair (Mbp) genome are fragmented into >135,000 contigs, with numerous order and orientation errors within scaffolds, probably associated with the plant’s biology, which displays gametophytic self-incompatibility resulting in inherent high heterozygosity. Here, we present a high-quality long-read-based assembly of red clover with a more than 500-fold reduction in contigs, improved per-base quality, and increased contig N50 by three orders of magnitude. The 413.5 Mbp assembly is nearly 20% longer than the 350 Mbp short-read assembly, closer to the predicted genome size. We also present quality measures and full-length isoform RNA transcript sequences for assessing accuracy and future genome annotation. The assembly accurately represents the seven main linkage groups in an allogamous (outcrossing), highly heterozygous plant genome.

systems [3]. The improved protein storage and utilization of this forage appears to be associated with the post-harvest oxidation of o-diphenolic compounds by an endogenous polyphenol oxidase [4], although condensed tannins could also play a role [5]. Red clover tissues accumulate polyphenol oxidizable phenolics (mainly caffeic acid derivatives), condensed tannins, and various specialized metabolites, including flavonoid compounds [6,7]. Such compounds can influence animal and rumen physiology both negatively [8] and positively [9]. Specialized metabolites from red clover also have potential medicinal or nutraceutical value (see, for example, [10]). Improved red clover varieties have been developed, especially with respect to persistence, disease resistance, and yield, but further improvements could be made in these and other traits affecting quality and nutritional value [1]. Genetic progress and greater understanding of the physiology and biochemistry of agronomic and quality traits could be accelerated using genomic tools based on the production of a high-quality reference genome for the species. Such a genome would also facilitate gene discovery efforts.

Context
Red clover is a hermaphroditic allogamous (outcrossing) diploid (2n = 2x = 14) with a homomorphic gametophytic self-incompatibility (GSI) system [11] whereby a pistil expressed S-RNase mediates the degradation of pollen tubes from "self" pollen [12]. The GSI locus has been mapped to linkage group one in red clover. The GSI system in red clover appears to be especially effective [13], making red clover an obligate out-crossing species with a high degree of heterozygosity. This high degree of heterozygosity has made genome assembly with short-read sequencing data difficult. Two previous short-read genome assemblies [14,15] have been reported with limited contiguity (>135,000 contigs), completeness, and accuracy. We report a long-read based assembly consisting of 258 contigs, which provides a much-improved reference genome to enhance genome-enabled red clover improvement.

Sample information
The individual used in this study is HEN17-A07, a red clover plant selected from the US Dairy Forage Research Center (Madison, WI, USA) breeding program, representing elite North American red clover germplasm ( Figure 1). This individual derived from 30 years of selection and breeding for red clover grazing tolerance, persistence, biomass yield, and Fusarium oxysporum Schlect resistance [16,17]. Source varieties and germplasm for HEN17-A07 include the red clover varieties "Dominion" [18] and "Redlangraze" (ABI Alfalfa Inc., now part of Land O'Lakes, Inc. Arden, MN, USA); and experimental populations C452, C11, and C827 out of the US Dairy Forage Research Center red clover breeding program. Plant material used for all nucleic acid isolations was clonally propagated from the original selected plant and maintained in a growth chamber at 22°C with 18-h days and light intensities of approximately 400 μmol m −2 s −1 .

DNA and RNA extraction and sequencing
Approximately 0.8 g of frozen unexpanded leaf tissue from the red clover individual Hen17-A07 (hereafter referred to as "red clover") was ground in a mortar and pestle under liquid nitrogen. High-molecular-weight DNA was extracted using the NucleoBond HMW DNA extraction kit as directed by the manufacturer (Macherey Nagel, Allentown, PA, USA). The DNA pellet was resuspended in 150 μL of 5-mM Tris-Cl pH 8.5 (kit buffer HE) by standing at 4°C overnight, with integrity estimated by fluorescence measurement (Qubit, Qiagen, Germantown, MD, USA), optical absorption spectra (DS-11, DeNovix), and size profile (Fragment Analyzer, Thermo Fisher, Waltham, MA, USA).
The Ligation Sequencing Kit (SQK-LSK109) was used to prepare libraries for Nanopore sequencing from the extracted DNA as directed by the manufacturer (Oxford Nanopore Technologies, Oxford, UK). The libraries were sequenced in 14 R9.4 MinION flowcells on a GridION ×5 instrument. The Guppy version 3.3 basecaller was used to call sequence bases producing 60 gigabase pairs (Gbp) of nanopore sequence in 4.5 million pass_filter reads, having average read length of 13.6 kilobase pairs (Kbp).
The DNA for HiFi sequencing was sheared (Hydroshear, Diagenode, Denville, NJ, USA) using a speed code setting of 13 to achieve a size distribution with peak at approximately 23 Kbp. Smaller fragments were removed by size selection for >12 Kbp fragments (BluePippin, Sage Science, Beverly, MA, USA). Size-selected DNA was used to prepare a SMRTbell library using the SMRTbell Express Template Prep Kit 2.0 as recommended by the manufacturer (Pacific Biosciences, Menlo Park, CA, USA). The library was sequenced in two SMRT Cell 8M cells on a Sequel II instrument using Sequel Sequencing Kit 3.0, producing 23 [19]. A library was prepared using a TruSeq DNA PCR-Free library preparation kit, according to manufacturer guidance, and was sequenced on a NextSeq 500 instrument (Illumina) with a NextSeq High Output v2 300 cycle kit, generating 198 million 2 × 150 paired end reads. This resulted in 30.0 Gbp of short-read data, which were used for error-correction and assembly validation.
The Omni-C library was prepared from unexpanded leaf tissue collected from plants grown in the dark for 3 days, and ground in liquid nitrogen with a mortar and pestle. The The relatively large size of the alternate haplotype assembly likely reflects the obligate heterozygosity of red clover, since high heterozygosity supports more complete separation of parental haplotypes during HiFi-based assembly. The primary haplotype assembly was retained for use in downstream polishing and assembly quality assessment. Residual haplotype sequence was removed from the assembled contigs using purge_dups v1.2.5 (RRID:SCR_021173) [22]. Depth of coverage cutoff values for the purge_dups workflow were estimated from minimap2 (RRID:SCR_018550) [23] alignments of HiFi reads to the contigs.   Scaffolds were created from the HiFi Contigs using the SALSA v2 scaffolding workflow [24]. Omni-C reads were aligned to the purged contig assembly using BWA MEM [25] with the "-SP5" flag to disable paired-end read recovery. Resulting BAM files were converted to a bed file format using the Bedtools2 (RRID:SCR_006646) [26] tool "bamToBed". SALSA was subsequently run without misassembly detection to avoid unnecessary contig breaks and the "DNASE" setting owing to the use of OmniC reads for scaffolding. This placed the 258 contigs into 143 scaffolds with a scaffold N50 of 15.6 Mbp (Table 1). This intermediary dataset is referred to as "Omni-C scaffolds" for convenience. The contiguity, as summarized by the contig and scaffold N50 values, compared favorably with legume assemblies that had the benefit of extensive polishing, such as the Medicago truncatula reference, MedTr 4.0 [27] (Figure 2).

Scaffold placement using linkage data
Previously published expressed sequence tag (EST) [28], bacterial artificial chromosome (BAC) end [14], and Oxford Nanopore read overlaps were used to generate super-scaffolds representing the best approximation of red clover linkage group chromosomes. EST and BAC reads were converted to fasta format and aligned against Hi-C scaffolds using BWA MEM. A custom script [29] was used to order and orient EST and BAC information into a tabular, bipartite graph format for comparison. Oxford Nanopore reads were aligned to the Omni-C scaffolds with minimap2 [23] and overhanging reads were identified using custom Perl scripts [30]. Overlapping reads from two different contigs were combined into bipartite graphs as evidence of connection.
BAC, EST, and Oxford Nanopore datasets were analyzed using the Python NetworkX (version 2.5) [31]  formation. The Oxford Nanopore read overlaps showed substantial overlap with the underlying EST dataset, but the BAC end sequence showed no substantial overlap with other datasets. The final linkage group super-scaffolds were generated by assigning Omni-C scaffolds to linkage groups and ordering them according to their placement in the EST alignment dataset. Scaffolds that did not have EST mappings but were identified via Nanopore overlaps (four scaffolds in total) were incorporated into the final super-scaffolds on the side of the scaffold indicated by overlapping read data. The final set of super-scaffolds were generated using the "agp2fasta" utility of the "CombineFasta" Java tool (version 0.0.17) [32]. The final set of super-scaffolds is referred to as "ARS_RCv1.1" in the text.

Iso-seq transcript identification
Iso-seq sequence data was processed for isoform identification using the Iso-Seq Analysis pipeline in smrtlink v9.0.0.92188, including the option to map putative isoforms to the assembly scaffolds imported as a reference genome. A total of 9.2 million HiFi reads were generated from the 49 million sub-reads, of which 8,899,606 (97%) were classified as full-length non-concatemer reads (FLNC) with a mean length of 3.2 Kbp. These FLNC reads collapsed to 437,586 predicted unique polished high-quality isoforms, of which 308,804 (70%) mapped to 24,955 unique gene loci in the assembly, consistent with approximately 12 isoforms per unique loci. These gene loci are provided as BED coordinate files for future annotation efforts.

Assembly error-rate assessment
Genome quality was tested using a composite of k-mer and read mapping quality statistics as implemented in the Themis-ASM workflow [33]. All references to short-read whole genome sequence (WGS) data refer to the short-reads generated from the HEN17-A07 individual sequenced and assembled in this study unless otherwise mentioned. The completeness and quality of the assembly was first assessed using Merqury [34] k-mer analysis and FreeBayes (RRID:SCR_010761) [35] variant analysis. Merqury estimated the overall quality of the assembly at a Phred-based [36] quality value (QV) score of 49, which corresponds to an error every 129,000 bases ( Table 2). Comparison of k-mer profiles between the HiFi contigs and the previously published TGACv2 red clover assembly [14] (accession GCA_900292005.1) using the -Py-UpSet Python module [37] (Figure 3) indicated that only 55.2% of all k-mers were shared between the two assemblies. This surprisingly low shared content could be the result of real differences in the genomes of the different varieties of this highly heterozygous species (the earlier assembly used an individual from the Milvus variety versus the Hen17-A07 individual used here), or the higher degree of completeness of the current assembly (the previous assembly comprised 135,023 contigs and was 68 Mbp smaller), or assembly and haplotype switching errors in the short read assembly, or a combination of these factors. The Themis-ASM analysis of TGACv2 estimated an error every 142 bases, indicating that the ARS_RCv1.1 assembly has a three orders of magnitude improvement in k-mer-based QV estimates. Indeed, the count of erroneous, singleton k-mers identified in the TGACv2 assembly was over 40 million, compared to less than 10,000 in the ARS_RCv1.1 assembly (Figure 4). This represents a substantial improvement in assembly accuracy enabled by improved sequencing technologies.   between the two assemblies still points towards a higher error rate in the TGACv2 reference, and suggests that the ARS_RCv1.1 assembly is more suitable as a reference for short-read WGS alignment in the red clover species. The MedTr4 assembly represents a high-quality reference for most legume species, and has been used in several whole genome comparisons to indicate assembly quality [39,40]. This includes the original release of the TGACv2 reference, where synteny was identified between MedTr4 and the TGACv2 assembly [14]. However, the Merqury-estimated error rate of one out of every ten bases when mapping red clover WGS reads suggests that MedTr4 is unsuitable as a reference for red clover WGS alignment. This conclusion is supported by the observation that over 60% of the HEN17-A07 individual WGS reads were unmapped when aligned to the MedTr4 reference. This suggests that more distantly related legume species require a high-quality reference genome assembly for satisfactory alignment quality metrics. The approach described here provides a method to develop these reference assemblies for highly heterozygous allogamous species, such as red clover, without the requirement for extensive post-hoc polishing.

Structural variant assessment and comparative alignments
The structural accuracy of the super-scaffolds was assessed using a variety of comparative metrics native to the Themis-ASM workflow [33]. The short-read WGS data alignments were used as a basis for FRC_align [41] quality metrics, which identified a relatively low number of regions with predicted interscaffold alignments in ARS_RCv1.1 (   Figure 5A).
Furthermore, these whole-scaffold alignments revealed several structural variants representing potential expansions of the TGACv2 reference compared to ARS_RCv1.1 ( Figure 5B). The accuracy of ARS_RCv1.1 super-scaffold placement on a macro-scale was examined by alignment of previously generated BAC end sequence data from the Milvus B individual [14] to both assemblies with minimap2 using the "-x sr" preset.  Figure 4).

Re-use potential
We report a new red clover reference assembly using a combination of HiFi and Nanopore-based long-read sequencing, with Omni-C and BAC-end sequence scaffolding to produce chromosome-scale super-scaffolds. The quality of the assembly demonstrates that low-error rate long reads are suitable for resolving issues in assembling allogamous heterozygous (>50%) diploid plant genomes and generating continuous scaffolds. The addition of Omni-C read linkage data supported generation of an assembly with only 143 scaffolds. These scaffolds were then combined into seven final linkage-group super-scaffolds, better reflecting the haploid structure of red clover chromosomes.
Compared with a previous reference for the species, ARS_RCv1.1 contains 20% more assembled sequence and has an error rate that is lower by three orders of magnitude.
Comparative mapping statistics to other legume genome assemblies suggest that this assembly will enable better alignment of red clover short-read WGS data, improve gene model prediction, and facilitate transcriptomic studies and gene discovery efforts based on both marker-phenotype association and sequence identity. Previous assemblies of red clover were limited by the error-rates or length of reads used to construct them.  We demonstrate that recent improvements in DNA sequencing technologies are finally capable of generating a suitable assembly for this highly heterozygous species, and that these methods can be applied to other similar species without the need for expert curation.

AVAILABILITY OF SOURCE CODE AND REQUIREMENTS
Project name: Themis-ASM.

DATA AVAILABILITY
All sequence data used in the assembly, scaffolding, and analysis of ARS_RCv1.

CONSENT FOR PUBLICATION
Not applicable.