Chromosome-level assembly of the common vetch (Vicia sativa) reference genome

Vicia sativa L. (common vetch, n = 6) is an annual, herbaceous, climbing legume, originating in the Fertile Crescent of the Middle East and now widespread in the Mediterranean basin, West, Central and Eastern Asia, North and South America. V. sativa is of economic importance as a forage legume in countries such as Australia, China, and the USA, and contributes valuable nitrogen to agricultural rotation cropping systems. To accelerate precision genome breeding and genomics-based selection of this legume, we present a chromosome-level reference genome sequence for V. sativa, constructed using a combination of long-read Oxford Nanopore sequencing, short-read Illumina sequencing, and high-throughput chromosome conformation data (CHiCAGO and Hi-C) analysis. The chromosome-level assembly of six pseudo-chromosomes has a total genome length of 1.65 Gbp, with a median contig length of 684 Kbp. BUSCO analysis of the assembly demonstrated very high completeness of 98% of the dicotyledonous orthologs. RNA-seq analysis and gene modelling enabled the annotation of 53,218 protein-coding genes. This V. sativa assembly will provide insights into vetch genome evolution and be a valuable resource for genomic breeding, genetic diversity and for understanding adaption to diverse arid environments.


DATA DESCRIPTION Background
Vicia sativa L. (common vetch, NCBI:txid3908) (Figure 1) is an annual legume belonging to the Fabaceae family, and Vicia genus [1]. The Vicia genus contains about 180-210 species, including the economically important crop broad bean [2]. To date, no chromosome-level genome assembly has been reported within the Vicia genus. Interestingly, V. sativa has at least three different reported haploid chromosome numbers: n = 5, 6 or 7 [3], but n = 6 is the best characterized karyotype.
V. sativa is thought to have originated in the Fertile Crescent of the Middle East and is now widespread on every continent as both a crop and a weed [4]. V. sativa is a multipurpose legume; the plants are often grown for forage and the seeds can be used safely as a feed for ruminant animals. V. sativa seed contains up to 30% crude protein and is rich in essential amino acids and unsaturated fatty acids [5]. However, only a small amount of the seed can be safely fed to monogastric animals like chickens and pigs, because of the presence of the neurotoxic proteinaceous amino acids -cyano-L-alanine and -glutamyl--cyano-alanine [6].
V. sativa is often used in crop rotation systems to increase nitrogen input to the soil. In a study of V. sativa/wheat rotation over a 4-year-period, cultivation of V. sativa during autumn increased soil water storage and subsequently increased biological yield and grain yield of wheat. Both yields were doubled in the third year compared with the second year of the rotation [7]. Furthermore, the symbiosis between soil rhizobia bacteria and V. sativa roots allows the plant to fix atmospheric nitrogen and later provide nitrogen for the following crop, hence reducing the use of expensive nitrogen fertilizer [8]. V. sativa exhibits excellent drought tolerance and is suitable for cultivation in arid areas. In one drought tolerance study, V. sativa could withstand a month of drought stress, with the leaf weight not decreasing significantly compared with the non-drought control [9]. V. sativa offers multiple usage and is a valuable crop in a sustainable agricultural system [10].
With the important value of V. sativa, vetch breeders have primarily selected for traits conferring high yield, pod shattering, flowering time, disease resistance against Ascochyta fabae, Uromyces viciae-fabae (rust) and Sclerotinia sclerotium [11]. Recently published transcriptome data has allowed agriculturally important traits to be uncovered at the gene expression level, such as pod-shattering resistance [12] and drought tolerance genes [13] in V. sativa. However, a lack of high-quality genome reference is currently impeding the genetic mapping of important genes and hindering further applications such as genome editing when compared with other crops.

Context
In this study, we assembled a high-quality chromosome-level reference genome for V. sativa, which is the first chromosome-level reference genome in the Vicia genus. We performed genome annotation using RNA-seq data from five tissues to ensure most of the expressed genes were captured. We also included a phylogenetic analysis of V. sativa and legume relatives. We envisage that our V. sativa genome will be an important resource for evolutionary studies of this species. The well-annotated chromosome-level genome will also provide important information to facilitate genetic mapping, gene discovery and functional gene studies.

Sampling and sequencing
To prepare V. sativa for whole genome sequencing (WGS) using long-read and short-read data, seeds of cultivar Studenica (V. sativa subsp. sativa) were obtained from the South Australian Research and Development Institute (SARDI, South Australia, Australia). Seeds were sterilized and germinated in vitro on half-strength Murashige & Skoog (1/2 MS) basal medium with 1% sucrose for 3 days at 25 °C, in the dark. Bulk 3-mm-long primary root tips were then harvested and snap-frozen in liquid nitrogen for subsequent DNA extraction. DNA was extracted using the phenol:chloroform method [14], with an additional high-salt low-ethanol wash to improve DNA purity [15]. High-quality DNA was confirmed by electrophoresis on 1% agarose gel. The DNA was sent to the Australian Genome Research Facility (AGRF, Melbourne, Australia), and Novogene Co., Ltd (Hong Kong, China) for library preparation and sequencing on a PromethION (PromethION, RRID:SCR_017987) and Novo-Seq 6000 (Illumina NovaSeq 6000 Sequencing System, RRID:SCR_016387), respectively. We obtained 72 gigabase pairs (Gbp) of Nanopore long-read data, and 205 Gb paired-end short-read data (150 base pairs [bp] read length).
To produce V. sativa CHiCAGO sequencing data [16] and Hi-C sequencing data [17], 2 g of young leaf tissue was snap-frozen in liquid nitrogen and sent to Dovetail Genomics (USA) for library preparation and sequencing. CHiCAGO and Hi-C libraries were sequenced on an Illumina HiSeq X (Illumina HiSeq X Ten, RRID:SCR_016385) to produce 162 Gbp of CHiCAGO and 148 Gbp of Hi-C sequencing data, respectively.

Genome size estimation and genome assembly
We first performed a genome size estimation for V. sativa. To do this, short-Illumina (paired-end 150 nt) reads were trimmed using TrimGalore v0.4.2 (Trim Galore, RRID:SCR_011847) with default parameters and 25-mers were counted using Jellyfish v2.2.6 (Jellyfish, RRID:SCR_005491) [19]. The 25-mer count distribution data was used as an input to GenomeScope (GenomeScope, RRID:SCR_017014) [20] for genome size estimation with the read length set to 150 and max k-mer coverage set to 1 million. GenomeScope estimated a genome-wide heterozygosity level of 0.09% ( Figure 2) and a genome size of 1.61 Gbp; approximately 160 megabase pairs (Mbp) smaller than the genome size estimated by flow cytometry (1.77 Gbp) [21].
Next, we conducted contig assembly from the Nanopore long-reads using Canu v1.7 (Canu, RRID:SCR_015880) [22] under default parameters with the expected genome size set at 1.77 Gbp. Canu was used to perform read trimming and sequencing error correction for the input data before performing contig assembly. The assembled contigs were polished using clean WGS short-reads with Pilon v1.22 (Pilon, RRID:SCR_014731) [23] under default parameters. We repeated the polishing step and observed a further improvement in contig quality (Table 2). Contig quality was assessed using BUSCO v5.2.2 (BUSCO, RRID:SCR_015008) [24] for the completeness of the genome, and after two rounds of polishing, complete BUSCOs increased from 69.9% to 97.8% (Table 2). Overall, we obtained

Chromosome-level assembly using Hi-C and linkage map data
To generate a chromosome-level assembly for V. sativa, Hi-C proximity [25] ligation data and CHiCAGO [26] were used to order and orient the contigs along chromosomes. The scaffolding process was carried out by Dovetail Genomics (Santa Cruz, CA, USA) using Dovetail™ Hi-C library reads to connect and order the input set of contigs. After scaffolding with HiRise (v2.1.7) [51], the assembled genome sequence initially comprised 1.8 Gbp, with a scaffold and contig N50 of 51.4 and 0.6 Mbp, respectively. A high fraction of the assembled sequences (93%) were contained in four pseudo-chromosomes ( Figure 3A); however V. sativa has six pairs of chromosomes [1]. We observed that two of the four pseudo-chromosomes had weak interactions, suggesting misjoining of two contigs ( Figure 3A). In parallel to the HiRise analysis, we performed a second chromosome-level assembly using 3D-DNA (3D de novo assembly, RRID:SCR_017227) [27]. 3D-DNA scaffolding was performed by first mapping Hi-C reads to the contig assembly using Juicer v1.6 (B) The collinearity between pseudo-chromosomes "a" to "c" and "d", and between pseudo-chromosomes "b" to "e" and "f" in (A) were confirmed by Mummer alignment.
(Juicer, RRID:SCR_017226) [28], which then generated 304,484,352 uniquely mapped pair-end reads and of which 51.1% (155,477,211) of the uniquely mapped reads were identified as valid Hi-C contacts. The 3D-DNA v180114 pipeline was integrated to anchor contigs to pseudo-chromosomes based on valid Hi-C contacts. The output file was used to generate a Hi-C heatmap for manual inspection using Juicebox Assembly Tools v1.11.08 (Juicebox, RRID:SCR_021172). This revealed six high-quality pseudo-chromosomes ( Figure 3A).
To further support that these two HiRise pseudo-chromosomes were misjoined, we  (Table 4).

DATA VALIDATION AND QUALITY CONTROL
Three approaches were used to assess the quality of the final version of our genome assembly. First, the WGS short-read data was mapped to this final assembly. A very high proportion (99.7%) was mapped (  RRID:SCR_018970) [32] and LTR_FINDER_parallel v1.2 [33] into LTR_retriever v2.9.0 (LTR_retriever, RRID:SCR_017623) [34], suggest that the genome reached a reference quality.

Genome annotation
To annotate the V. sativa genome assembly, we masked repeat regions of the genome, then  RepeatModeler (RepeatModeler, RRID:SCR_015027), which generated a non-redundant transposable element (TE) library used to annotate the TE regions on the genome. The TE library generated from EDTA was also used as an input to RepeatMasker v4.1.2 (RepeatMasker, RRID:SCR_012954) to identify and perform "hard-masking" and "soft-masking" for the repetitive region on the genome. A total of 83.9% of the genome was masked, and 64.4% of the genome was detected as LTR elements (Table 6). After genome masking, a combination of ab initio prediction and transcript evidence from the RNA-seq was used for gene prediction. Briefly, each RNA-seq data set was trimmed for low quality bases using TrimGalore v0.4.2, and mapped to the hard-masked-genome by using STAR v2.7.9 (STAR, RRID:SCR_004463) [37] to generate BAM files. Then the soft-masked genome and the BAM files generated from STAR were used for gene prediction using BRAKER v2.1.6 (BRAKER, RRID:SCR_018964) [38]. A total of 53,218 predicted protein-coding-genes were identified (Table 7). To assess the completeness of these protein-coding-genes, BUSCO v5.1.3 with fabales_odb10 gene sets were used which then identified 5127 (95.6%) complete, 395 (7.4%) duplicated, 70 fragmented (1.3%) and 169 missing (3.1%) orthologs.
Putative functions of the predicted protein-coding-genes were characterized by comparing the predicted proteins against the SwissProt and National Center for Biotechnology Information (NCBI) non-redundant database using Diamond v2.0.11 (DIAMOND, RRID:SCR_016071) [39] with e-value cut-off of 1 × 10 −5 . Protein motifs and domains were annotated by comparing the predicted proteins against the InterPro Gigabyte, 2022, DOI: 10.46471/gigabyte. 38   . As a result, we were able to annotate 47,580 (89.4%) predicted protein-coding genes with at least one function term (Table 8).

10/19
In addition, we also identified and annotated non-coding RNA in the V. sativa genome.

Phylogenetic tree construction and divergence time estimation
We identified the orthogroups, phylogenetic positions and divergence times between V. sativa and 11 other plant species. The source of the protein-coding sequences used in our analysis are listed in Table 10  (OrthoFinder, RRID:SCR_017118) [55] with default parameters. A total of 10,009 single-copy and 43,209 multi-copy genes were identified in the V. sativa annotation ( Figure 6B, Table 11), forming 19,096 orthogroups ( Figure 6A, Table 10). Comparing orthogroups amongst V. sativa, P. sativum, M. truncatula, P. vulgaris, F. albida, we identified 2309 orthogroups that are specific to V. sativa ( Figure 6A). Orthofinder was further used to perform phylogenetic reconstruction with the multiple sequence alignment approach (using the -msa command). The generated species tree has a support value of one on all nodes (Figure 7), indicating the high reliability of the revealed phylogenetic relationships.
To estimate divergence times between V. sativa and other important legume species ( Substitution models were selected using BEAST Model Test [61] for each alignment and were allowed to coalesce using unlinked relaxed log-normal molecular clocks [62]. A calibrated Yule prior [63] was used to inform tree building and speciation with four node calibrations (Table 12) Figure 6C).
To identify whole genome duplication events (WGD), WGDI v0.5.1 [73] was used to identify gene collinearity between V. sativa, M. truncatula and P. vulgaris. The K s (synonymous substitutions per synonymous site) value was calculated based on the identified collinearity gene to construct a frequency distribution map. The Ks distribution indicated that V. sativa, M. truncatula and P. vulgaris share the same ancestral WGD event.
The estimated time of this WGD event (∼58 MYA) [74] and the corresponding Ks value (∼0.93, Figure 6D) reveal that the average mutation rate of V. sativa genome is 8.02 ×10 −9 per site per year.

REUSE POTENTIAL
Understanding the genetic, epigenetic and epitranscriptomic basis of the evolutionary processes shaping drought tolerance, low nutrient requirements and adaption to broad habitats requires comparison of multiple legume genomes, preferentially assembled at the chromosome level. In this study, we present a complete chromosome-level genome assembly for the legume V. sativa (common vetch) and provided a detailed genome annotation. There are >19,000 species of legumes, about 200 within the Vicia genus, and this genome will serve as an excellent reference for the assembly of other Vicia genomes. The V. sativa genome will also facilitate comparative analyses aimed at understanding the evolutionary origin and dynamics of legume specific gene families. Our new V. sativa genome will greatly benefit legume researchers and plant breeders who are interested in conventional as well as engineering crop improvement.

DATA AVAILABILITY
Final assembly and original Nanopore assembly, as well as annotation files, Supplementary File 1, predicted transcript and protein sequences, and bioinformatics supporting information, were deposited in the database GigaDB [57]. Additionally, assembly, Illumina and Nanopore subreads, and transcriptome raw data are available at NCBI and can be accessed with BioProject PRJNA762450 and BioSample SAMN21393724. Illumina and Nanopore subreads can be obtained, with SRR16004114 and SRR16004115; and RNA-sequencing raw reads, as follows: SAMN21545804, SAMN21545805, SAMN21545806, SAMN21545807 and SAMN21545808. Additional data is available in the GigaScience GigaDB database [57].