A chromosome-level genome assembly and annotation of the maize elite breeding line Dan340

Background Maize is an important model organism for genetics and genomics research. Though reference genomes of maize are available, some genomes of important genetic germplasms for maize breeding are still lacking, for instance, the cultivar Dan340, which is a backbone inbred line of the LvDa Red Cob Group with several desirable characteristics. In this study, we constructed a high-quality chromosome-level reference genome for Dan340 by using long HiFi reads, short reads, and Hi-C. The final assembly of the Dan340 genome was 2348.72 Mb, which was anchored to 10 chromosomes. Repeat sequences accounted for 73.40% of the genome and 39,733 protein-coding genes were annotated. Comparative genomic analysis between Dan340 and other maize lines identified that 1806 genes from 359 gene families were specific to Dan340. Conclusions Our genome assembly and annotation provide a valuable resource for improving maize breeding and further understanding the intraspecific genome diversity in maize.

resistance, high combining ability, and wide adaptability. More than 50 maize hybrid breeds have been derived from Dan340 since 2000, and their planting area has reached 19 million ha. It is considered that Dan340 originated from a landrace in China and exhibits significant genetic differences from other maize germplasms that represent the most important core maize germplasms in China [12]. Therefore, Dan340 could serve as a model inbred line for the genetic dissection of desirable agronomic traits, combining ability, heterosis, and breeding history.
In the present study, we constructed a high-quality chromosome-level reference genome for Dan340 by combining PacBio long HiFi sequencing reads, Illumina short reads, and chromosomal conformational capture (Hi-C) sequencing reads. The completeness and continuity of the resulting genome are comparable with those of other important maize inbred lines: B73 [4], Mo17 [7], SK [13], PH207 [5], and HZS [8]. Furthermore, comparative genomic analyses were performed between Dan340 and other maize lines. Genes and gene families specific to Dan340 were identified. In addition, large numbers of structural variations between Dan340 and other maize inbred lines were detected. The assembly and annotation of this genome will increase our understanding of the intraspecific genomic diversity in maize and provide a novel resource for maize breeding improvements.

Plant materials and DNA sequencing
The inbred line Dan340 (  One Hi-C library was constructed using young leaves following previously published procedures [14], with slight modifications outlined in our published protocol [15] (Figure 2). In brief, approximately 5-g leaf samples from seedlings were cut into minute pieces and cross-linked using a 4% formaldehyde solution at room temperature in a vacuum for 30 min. Each sample was mixed with excess 2.5 M glycine for 5 min to quench the cross-linking reaction and then placed on ice for 15 min. The cross-linked DNA was extracted and then digested for 12 h with 20 units of DpnII restriction enzyme (NEB, Ipswich, MA, USA, Catalog #R0543S) at 37°C. Next, the resuspended mixture was incubated at 62°C for 20 min to inactivate the restriction enzyme. The sticky ends of the digested fragments were biotinylated and proximity ligated to form enriched ligation junctions and then ultrasonically sheared to a size of 200-600 bp. The biotin-labelled DNA fragments were pulled down and ligated with Illumina paired-end adapters, and then amplified by PCR to produce the Hi-C sequencing library. The library was sequenced using an Illumina HiSeq X Ten platform with 2 × 150 bp paired-end reads. After removing low-quality sequences and trimming adapter sequences, 304.37 Gb (approximately 130×) of clean data were generated and used for the genome assembly.

Genome assembly
To obtain a high-quality genome assembly of Dan340, we employed both PacBio HiFi reads and Illumina short reads, with scaffolding informed by high-throughput Hi-C. The assembly was performed in a stepwise fashion. First, a de novo assembly of the long CCS reads generated from PacBio single-molecule real-time (SMRT) sequencing was performed using   [15]. https://www. protocols.io/widgets/doi?uri=dx.doi.org/10.17504/protocols.io.bp2l61mkzvqe/v1 default parameters, followed by the deduplication of reads using pbmarkdup (Version 0.2.0) [18], as recommended by PacBio. Next, HiFi reads were aligned to each other and assembled into genomic contigs using Hifiasm [16] with default parameters. Next, the primary contigs (p-contigs) were polished using Quiver [19] by aligning the SMRT reads.
Then, Pilon [20] (RRID:SCR_014731) was used to perform the second round of error correction using the short paired-end reads generated by the Illumina Hiseq platforms.
Subsequently, the Purge Haplotigs pipeline [21] was used to remove redundant sequences formed due to heterozygosity. The draft genome assembly was 2348.68 Mb; it reached a high level of continuity and a contig N50 length of 45.11 Mb.
Reads were excluded from subsequent analyses if they did not align within 500 bp of a restriction site or did not uniquely map. Also, the number of Hi-C read pairs linking each scaffold pair was tabulated. ALLHiC (Version 0.8.12) [24] was used in simple diploid mode to scaffold the genome and optimize the ordering and orientation of each clustered group, producing a chromosome-level assembly. The Juicebox Assembly Tools (Version 1.9.8, RRID:SCR_021172) [25] were used to visualize and manually correct the large-scale inversions and translocations to obtain the final pseudo-chromosomes ( Figure 3). Finally, 2315 scaffolds (representing 91.30% of the total length) were anchored to 10 chromosomes  Table 1).

Evaluation of the assembly quality
We assessed the quality of the assembly using several independent methods. First, the short reads obtained from the Illumina sequencing data were aligned to the final assembly using BWA [26]. Our results showed that the percent of reads mapped to the reference genome was 97.48%. Second, a total of 248 conservative genes existing in six eukaryotic model organisms were selected to form the core gene library for the Core Eukaryotic Genes   Mapping Approach (CEGMA) [27] (RRID:SCR_015055) evaluation. To evaluate its integrity, our assembled Dan340 genome was aligned to this core gene library using TBLASTN (RRID:SCR_011822) [28], GeneWise (Version 2.2.0, RRID:SCR_015054) [29], and the GeneID tools (Version 1.4 RRID:SCR_021639) [30]. Our results showed that 238 complete (95.97%) and 243 partial (97.98%) genes were detected in our assembly. Third, the completeness was assessed using the benchmarking universal single-copy orthologs (BUSCO) [31] (RRID:SCR_015008). The final assembly was tested against BUSCO (v.3) with embryophyta_odb10 database [32], which includes 1614 conserved core genes. Our results showed that 98.08% (1583), 1.11% (18), and 0.81% (13) Figure 5 and Table 2).
A higher LAI score indicates a more complete genome assembly because more intact LTR retrotransposons are identified, as was the case of our Dan340 genome. Furthermore, whole-genome sequence alignments of Dan340 to the genomes of the other three maize inbred lines demonstrated that our assembly has highly collinear relationships with other published maize genomes ( Figure 6). Taken together, our assessment results suggest that the Dan340 genome assembly is of high quality.

Genome annotation
Repeat sequences of the Dan340 genome were annotated using both ab initio and homolog-based search methods. For the ab initio prediction, RepeatModeler (Version 1.0.8, RRID:SCR_015027) [36], RepeatScout (Version 1.0.5, RRID:SCR_014653) [37], and LTR_Finder [34] were used to discover transposable elements (TEs) and to build a TEs library. An integrated TEs library and a known repeat library (Repbase Version 15.02, homolog-based, RRID:SCR_021169) were subjected to RepeatMasker (Version 3.3.0 RRID:SCR_012954) [38] to predict the TEs. For the homolog-based predictions, RepeatProteinMask was performed to detect the TEs in our genome by comparing it against a TE protein database. Tandem repeats were ascertained in the genome using Tandem Repeats Finder (Version 4.07b, RRID:SCR_022193) [39]. As a result, 1723.99 Mb of repeat sequences were identified, accounting for 73.40% of the genome size. Among these repeat sequences, 1555.57 Mb were predicted to be long-terminal repeat (LTR) retrotransposons, and 44.53 Mb were predicted to be DNA transposons, accounting for 66.23% and 1.60% of the genome, respectively. Furthermore, among the LTR retrotransposons, the Gypsy and Copia superfamilies comprised 23.81% and 12.75% of the genome, respectively. Thus, retrotransposons accounted for a large proportion of the Dan340 genome, which was consistent with the genomic characteristics of other maize inbred lines ( Table 2). All repetitive regions except the tandem repeats were soft-masked for protein-coding gene annotations. Five ab initio gene prediction programs, Augustus (Version 3.0.2, RRID:SCR_008417) [40][41][42], GENSCAN (Version 1.0, RRID:SCR_013362) [43], GeneID [30], GlimmerHMM (Version 3.0.2, RRID:SCR_002654) [44], and SNAP (Version 2013-02-16, RRID:SCR_007936) [45], were used to predict genes. In addition, the protein sequences of five homologous species (Sorghum bicolor, Setaria italica, Hordeum vulgare, Triticum aestivum, and Oryza sativa) were downloaded from Ensembl and NCBI. Homologous sequences were aligned against the genome using TBLASTN (E-value 1 × 10 −5 ). GeneWise [29] was employed to predict gene models based on the sequence alignment results.
For the RNA-seq predictions, fresh samples of six tissues (stem, endosperm, embryo, bract, silk, and ear tip) were collected. The total RNA was extracted from each sample using an RNAprep Pure Plant Kit (Tiangen Biotech Co., Ltd., Beijing, China). The isolated, purified RNA, having fragment lengths of approximately 300 bp, was the template for constructing a cDNA library. The NEBNext Ultra RNA Library Prep Kit from Illumina (New England Biolabs, Ipswich, MA, USA) was used to construct the cDNA library following the manufacturer's instructions. The sequencing was performed on an Illumina HiSeq X Ten platform, and 150-bp paired-end reads were generated. Raw reads were trimmed by removing the adapter sequences, reads with more than 5% of unknown base calls (N), and low-quality bases (base quality less than 5). Clean paired-end reads were aligned to the genome using TopHat (Version 2.0.13, RRID:SCR_013035) [46] to identify exon regions and Table 3. Summary statistics of annotated protein-coding genes in Dan340 and other maize inbred lines and common crop species.

Comparative genomic analysis between Dan340 and other maize lines
We applied the OrthoMCL pipeline [55] to identify orthologous gene families among the four maize inbred lines, including Dan340, B73, Mo17, and SK. The longest protein from each gene was selected, and the proteins with a length of less than 30 amino acids were removed.
Subsequently, pairwise sequence similarities between all input protein sequences were   [56] (RRID:SCR_001010) with an E value cut-off of 1 × 10 −5 . Markov clustering (MCL) of the resulting similarity matrix was used to define the ortholog cluster structure of the proteins, using an inflation value (-I) of 1.5 (default setting of OrthoMCL).
Next, comparative analyses were performed among Dan340, B73, Mo17, and SK ( Figure 7A). The genes from the Dan340 genome and those from B73, Mo17 and SK were clustered into 27,654 gene families. Of these, 15,690 families were shared among the four maize inbred lines, representing a core set of genes across these maize genomes. We found 1806 genes from 359 gene families that were specific to Dan340, of which many had functional GO annotations related to "protein phosphorylation", "single-organism catabolic process", and "pheromone binding" (Figure 7B). Using the KEGG functional enrichment, the most enriched pathways of the Dan340-specific genes were "antifolate resistance", "epithelial cell signaling in Helicobacter pylori infection", and "pentose and glucuronate interconversions" (Figure 7C). In addition, OrthoMCL was used to identify the core and dispensable gene sets based on gene families. The gene families that were shared among the four inbred lines were defined as core gene families. Furthermore, gene families shared among three inbred lines, between two inbred lines, and those only present in one inbred line (private gene families) are also displayed in Figure 7D.

Genetic variation analysis
To MaizeGDB [58], and the genome of SK was obtained from the National Genomics Data Center [59]. Next, the output of Nucmer was analyzed using SyRI [60] with default parameters to identify variation. On the basis of the above pipeline, we obtained structural variation sets and generated into the vcf file. We also used PBSV ( (Table 5). Furthermore, the structural variations presented in Mo17 and SK were also detected in this study (Tables 6 and 7). The dataset generated by PBSV is available in GigaDB [62]. These datasets provide abundant variation resources for future molecular improvements and breeding in maize.  Comparisons of the Dan340 genome with the reference genomes of three other common maize inbred lines identified 1806 genes from 359 gene families that were specific to Dan340. In addition, we also obtained large numbers of structural variants between Dan340 and other maize inbred lines, and these may be underlying the mechanisms responsible for the phenotypic discrepancies between Dan340 and other maize varieties. Therefore, the assembly and annotation of this genome improves our understanding of the intraspecific genomic diversity in maize and provides novel resources for maize breeding improvements.

DATA AVAILABILITY
The raw sequence data have been deposited in NCBI under project accession No. PRJNA795201. Data is also available in the GigaScience GigaDB repository [62].