The draft genome assembly of the critically endangered Nyssa yunnanensis, a plant species with extremely small populations endemic to Yunnan Province, China

Nyssa yunnanensis is a deciduous tree species in the family Nyssaceae within the order Cornales. As only eight individual trees and two populations have been recorded in China’s Yunnan province, this species has been listed among China’s national Class I protection species since 1999 and also among 120 PSESP (Plant Species with Extremely Small Populations) in the Implementation Plan of Rescuing and Conserving China’s Plant Species with Extremely Small Populations (PSESP) (2011-2-15). Here, we present the draft genome assembly of N. yunnanensis. Using 10X Genomics linked-reads sequencing data, we carried out the de novo assembly and annotation analysis. The N. yunnanensis genome assembly is 1475 Mb in length, containing 288,519 scaffolds with a scaffold N50 length of 985.59 kb. Within the assembled genome, 799.51 Mb was identified as repetitive elements, accounting for 54.24% of the sequenced genome, and a total of 39,803 protein-coding genes were predicted. With the genomic characteristics of N. yunnanensis available, our study might facilitate future conservation biology studies to help protect this extremely threatened tree species.

Although N. yunnanensis is not the first species sequenced in the Nyssaceae family, a detailed understanding of this endangered species' genomic makeup along with other information, such as population structure and reproductive biology, is urgently required to improve the PSESP conservation strategy for its continued survival.

METHODS
A protocol collection gathering together methods for DNA extraction and with DNBSEQ-G50 and 10X library construction and sequencing is available via protocols.io ( Figure 2).

Plant material
We selected and sampled a 70 cm high individual tree of Nyssa yunnanensis from Ruili, Yunnan province, China (97° 56′ 20.99′′ N, 24° 03′ 02.72′′ E, altitude 843 m). Fresh young leaves were collected then immediately transferred into liquid nitrogen and stored in dry ice until DNA and RNA extraction. Voucher specimens and images were collected and stored in the CNGB herbarium ( Figure 3). The extracted DNA is now stored in the BGI-sample center (voucher RL0289 and RL1182).

DNA extraction and sequencing
Total genomic DNA was extracted from leaf tissues of N. yunnanensis using a modified CTAB method [13]. Quality control was done using a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA) and a Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA).
A Sage Science Pippin Pulse electrophoresis system was used to evaluate the molecular weight of the DNA and high-molecular-weight (HMW) gDNA with a length of around 50 kb was obtained for further sequencing. The HMW gDNA was then loaded onto a Chromium Controller chip with 10X Chromium reagents and gel beads, and the rest of the library preparation procedures were carried out according to the manufacturer's protocol [14].

RNA extraction and sequencing
Total RNA was extracted from young leaves of the same individual N. yunnanensis tree using a CTAB-pBIOZOL method [17]. The purity, concentration, and integrity of RNA samples were measured on a NanoDrop One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA), a Genome size estimation The 1.64 Gb genome size of N. yunnanensis was estimated using the 21 k-mer counts of clean reads from the 10X Genomics library. First, K-mer frequency distribution analyses were performed using the kmer_freq_hash software within the gce v1.0.0 package (GCE, RRID:SCR_017332) based on the clean 10X Genomics data with the parameters "-k 21 -l reads.lst -t 8". Then, gce software within the same package was used to estimate the overall characteristics of the genome, including genome size, repeat proportions, and level of heterozygosity [19].

Genome evaluation
The completeness of the N. yunnanensis assembly was estimated using two strategies.
RRID:SCR_011930) [29] was used to carry out self-training with the default settings. To search for homologs, protein sequences of Camptotheca acuminate and Arabidopsis thaliana were used as references. For RNA evidence, a de novo approach was used. All of the clean RNA reads were assembled into inchworm contigs to function as expressed sequence tag evidence using Trinity v2.0.6 (Trinity, RRID:SCR_013048) [30] with the parameters "-min_contig_length 100 -min_kmer_cov 2 -inchworm_cpu 6 -group_pairs_distance 200 -no_run_chrysalis". MAKER-P v2.31 (MAKER, RRID:SCR_005309) [31] was used to perform the prediction based on the evidence above. The first round of MAKER-P was run with the "protein2genome" and "est2genome" parameter set to "1" to obtain evidence-supported gene models. SNAP [32] was then applied to train these gene models. Then, MAKER-P was run for the second round with default parameters to generate the final consensus gene set. The search tool tRNAscan-SE v1.23 (tRNAscan-SE, RRID:SCR_010835) [33] was used for identifying tRNA genes. The rRNA sequences of Arabidopsis thaliana and Oryza sativa were BLAST against the N. yunnanensis assembly using BLASTN (BLASTN, RRID:SCR_001598) (E-value ≤ 1e−05) to identify rRNA genes. MicroRNAs and snRNAs were detected by searching the sequences against the Rfam database [34] using INFERNAL (Infernal, RRID:SCR_011809) [35] software.

RESULTS & DISCUSSION Assembly and annotation of the N. yunnanensis genome
We assembled the draft genome assembly of the highly endangered tree species N. yunnanensis with DNBSEQ-G50 data from a 10X Genomics linked-reads library. The final genome assembly was 1.475 Gb in length, which is close to the estimated genome size of 1.64 Gb, with a scaffold N50 of 985.59 Kb and a contig N50 of 32.33 Kb (Table 1). The N. yunnanensis genome size we assembled was also close to the estimated genome size of 1.23 Gb based on the raw data produced [39] for the Digitization of the Ruili Botanical Garden project [40]. The GC content of the N. yunnanensis assembly was 42.18% excluding gaps, and a total of 54.24% of the assembly was composed of repetitive elements (Table 2). We ultimately obtained 39,803 protein-coding genes and successfully annotated 96.57% of the N. yunnanensis gene loci (Table 3). Non-coding genes were also annotated, identifying 175 microRNA (miRNA), 1,130 transfer RNA (tRNA), 1,502 ribosomal RNA (rRNA) and 3,106 small nuclear RNA (snRNA) genes (Table 4).

Data validation and quality control
The BUSCO analysis showed that up to 1244 (90.5%) of the expected 1375 conserved plant orthologs were detected as complete in the N. yunnanensis assembly and 81.9% of them  were identified as complete and single-copy genes (

Potential for reuse
Here we report a draft genome assembly of the PSESP plant species N. yunnanensis. The completeness assessment carried out by reads mapping and BUSCO assessment indicated the high completeness of this draft assembly. As part of the 10KP (10,000 Plants) Genome Sequencing Project [41], the sequencing data and the well-annotated draft assembly generated in this study can be used for future phylogenetics and comparative genomics analyses, such as resolving the controversial phylogenetic relationships within the Nyssa genus. In particular, due to the extremely small population structure of N. yunnanensis, the genomic resources released in this study will support further research on the conservation biology of this highly endangered species as well as other PSESP species.

DATA AVAILABILITY
The 10X Genomics clean reads and RNA-seq clean reads are deposited in NCBI under the BioProject accession PRJNA438407, with SRA accession number SRX8345787 and SRX8373586. These reads are also deposited in the CNGB Nucleotide Sequence Archive (CNSA) with accession number CNP0001048. Genome assembly, protein-coding genes, and repeat annotations are deposited in the GigaScience GigaDB repository [42].