Chromosome-level genome assembly of a benthic associated Syngnathiformes species: the common dragonet, Callionymus lyra

Background The common dragonet, Callionymus lyra, is one of three Callionymus species inhabiting the North Sea. All three species show strong sexual dimorphism. The males show strong morphological differentiation, e.g., species-specific colouration and size relations, while the females of different species have few distinguishing characters. Callionymus belongs to the ‘benthic associated clade’ of the order Syngnathiformes. The ‘benthic associated clade’ so far is not represented by genome data and serves as an important outgroup to understand the morphological transformation in ‘long-snouted’ syngnatiformes such as seahorses and pipefishes. Findings Here, we present the chromosome-level genome assembly of C. lyra. We applied Oxford Nanopore Technologies’ long-read sequencing, short-read DNBseq, and proximity-ligation-based scaffolding to generate a high-quality genome assembly. The resulting assembly has a contig N50 of 2.2 Mbp and a scaffold N50 of 26.7 Mbp. The total assembly length is 568.7 Mbp, of which over 538 Mbp were scaffolded into 19 chromosome-length scaffolds. The identification of 94.5% complete BUSCO genes indicates high assembly completeness. Additionally, we sequenced and assembled a multi-tissue transcriptome with a total length of 255.5 Mbp that was used to aid the annotation of the genome assembly. The annotation resulted in 19,849 annotated transcripts and identified a repeat content of 27.7%. Conclusions The chromosome-level assembly of C. lyra provides a high-quality reference genome for future population genomic, phylogenomic, and phylogeographic analyses.


DATA DESCRIPTION BACKGROUND INFORMATION
Until recently, the family Callionymidae was placed into the order Perciformes, which is often considered a 'polyphyletic taxonomic wastebasket for families not placed in other orders' [1]. However, recent phylogenetic analyses suggest a placement of Callionymidae within the order Syngnathiformes, which currently contains ten families with highly derived morphological characters such as the pipefish and seahorses [1]. Syngnathiformes has recently been divided into two clades, a 'long-snouted clade' and a 'benthic associated clade,' each comprising five families [2]. The 'long-snouted clade' (Syngnathidae, Solenostomidae, Aulostomidae, Centriscidae, and Fistulariidae) is currently represented by genomes from the Gulf Pipefish (Syngnathus scovelli) and the Tiger Tail Seahorse (Hippocampus comes) [3,4] and additional draft assemblies of pipefish [5]. A genome of the 'benthic associated clade' (Callionymidae, Draconettidae, Dactylopteridae, Mullidae, and Pegasidae) has not been sequenced and analysed yet. Callionymidae comprises 196 species [6], of which the common dragonet, Callionymus lyra (Linnaeus, 1758) ( Figure 1), is one of three Callionymus species inhabiting the North Sea [7]. All three species also occur in the East Atlantic, and the Mediterranean Sea [6]. They represent essential prey fish for commercially important fish species such as the cod (Gadus morhua) [8]. The males of the North Sea dragonet species (C. lyra, C. maculatus, C. reticulatus) show strong morphological differentiation in the form of species-specific colouration and size relations. The much less conspicuous females can be distinguished morphologically, with rather high inaccuracy, by the presence or absence of their preopercular, basal spine and by various percentual length ratios. The great resemblance among the different species' females, together with the fact that all three species can be found in sympatry, suggests there is the possibility of hybridization among them.
Here, we present the chromosome-level genome of the common dragonet, representing the first genome of the 'benthic associated' Syngnathiformes clade as a reference for future population genomic, phylogenomic, and comparative genomic analyses. The chromosome-level genome assembly was generated as part of a six-week university master's course. For a detailed description and outline of the course, see Prost et al. [9]. were initially frozen at −20°C on the ship and later stored at −80°C until further processing.

SAMPLING, DNA EXTRACTION, AND SEQUENCING
The study was conducted in compliance with the 'Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from Their Utilization'.
We extracted high molecular weight genomic DNA (hmwDNA) from muscle tissue of the female individual following the protocol by Mayjonade et al. [10]. Quantity and quality of the DNA was evaluated using the Genomic DNA ScreenTape on the Agilent 2200 TapeStation system (Agilent Technologies). Library preparation for long-read sequencing followed the  Additionally, we sent tissue samples to BGI Genomics (Shenzhen, China) to generate additional sequencing data. A 100 bp paired-end short-read genomic DNA sequencing library was prepared from the muscle tissue of the female individual. This library was later used for genome assembly polishing. Moreover, a 100 bp paired-end RNAseq library was prepared for pooled RNA isolates derived from kidney, liver, gill, gonad, and brain tissues of the male individual. Both libraries were sequenced on BGI's DNBseq platform (BGISEQ-500/DNBSEQ-G50 sequencing) [11]. We received a total of 159,925,221 read pairs (∼32 Gbp) of pre-filtered genomic DNA sequencing data and 61,496,990 read-pairs (∼12.3 Gbp) of pre-filtered RNAseq data.
Furthermore, we prepared a Hi-C library using the Dovetail™ Hi-C Kit (Dovetail Genomics, Santa Cruz, California, USA) from muscle tissue of the female and sent the library to Novogene Co., Ltd. (Beijing, China) for sequencing on an Illumina NovaSeq 6000. Sequencing yielded a total of 104,668,356 pre-filtered 150 bp paired-end read pairs or 31.4 Gbp of sequencing data. This data was used for proximity-ligation scaffolding of the assembly.

GENOME SIZE ESTIMATION
We estimated the genome size for C. lyra using both k-mer frequencies and flow cytometry. The k-mer frequency for K = 21 was calculated from the short-read DNBseq data and summarized as histograms with jellyfish v.2.2.10 (RRID:SCR_005491) [12]. Plotting the histograms and calculating the genome size and heterozygosity with GenomeScope v.1.0 (RRID:SCR_017014) [13] resulted in a genome size estimate of approximately 562 Mbp. For the genome size estimation using flow cytometry, frozen muscle tissue was finely chopped with a razor blade in 200 μl LeukoSure Lyse Reagent (Beckman Coulter Inc., Fullerton, CA, USA). Large debris was removed by filtering through a 40 μm Nylon cell strainer and an RNAse treatment was performed with a final concentration of 0.3 mg/ml. Simultaneously, we stained the DNA in the nuclei with propidium iodide (PI) at a final concentration of 0.025 mg/ml and incubated the solution for 30 min at room temperature, protected from light exposure. Fluorescence intensities of the nuclei were recorded on the CytoFLEX Flow Cytometer (Beckman Coulter Inc., Fullerton, CA, USA). The domestic cricket (Acheta domesticus, C-value: 2.0 pg) was used as a reference to determine the genome size of C. lyra.
For a more precise estimate we analysed five independent technical replicates resulting in an average C-value of 0.66 pg, which corresponds to a haploid genome size of approximately 645 Mbp.  Table 1).

GENOME ASSEMBLY AND POLISHING
The final dataset, after concatenation of all read-files, was further examined with NanoPlot v.1.0.0 (Table 1) [15]. Concatenation of all read-files resulted in a total dataset of 31 Gbp or approximately 55-fold coverage as the basis for the genome assembly.
We assembled the genome of C. lyra with wtdbg2 v.2.2 (RRID:SCR_017225) [16] using the default parameters for ONT reads. The resulting assembly was subjected to a three-step polishing approach. First, a single iteration of racon v.1.4.3 (RRID:SCR_017642) [17] corrected for errors typical of the MinION platform: homopolymers and repeat errors. Next, we used one iteration of medaka v.0.11.5 [18] on the racon-polished assembly. According to the developers medaka is most effective after a polishing run with racon. Following polishing with the long-read data, we used three iterations of pilon v.1.23 (RRID:SCR_014731) [19] to correct for random errors and single-base errors with the high-quality short-read data.

ASSEMBLY QC AND SCAFFOLDING
We calculated assembly continuity statistics using QUAST v.5.0.2 (RRID:SCR_001228) [20] and performed a gene set completeness analysis using BUSCO v.4.0.6 (RRID:SCR_015008) [21] with the provided database for Actinopterygii orthologous genes  complete, single copy) and only 4.4% missing BUSCOs, which suggests that the assembly contains most of the coding regions of the genome (Figure 2, Table 2).
To achieve chromosome-length scaffolds, we used the long-read based assembly and the generated Hi-C data as input for the HiRise scaffolding pipeline [22] as part of the Dovetail  Mbp) of the total assembly length was scaffolded into 19 chromosome-length scaffolds ( Figure 3A). The number of chromosome-length scaffolds is consistent with the haploid number of chromosomes derived from karyotypes of females of two Callionymidae species (C. beniteguri and Repomucenus ornatipinnis) [23]. Therefore, the number of chromosomes appears to be relatively conserved within Callionymidae and it is likely that C. lyra follows the same chromosomal sex determination system as C. beniteguri and R. ornatipinnis (♀: X 1 X 2 -X 1 X 2 (2n = 38); ♂: X 1 X 2 -Y (2n = 37)) [23]. For a final assembly quality control, we mapped the raw nanopore reads with minimap2 v.2.17-r941 [24]  with a length of <200 bp from the final assembly (for final statistics see Table 3). In addition, we screened for mitochondrial sequence contamination with BLASTN v.2.9.0+ (RRID:SCR_001598) [27] using the available mitochondrial genome sequence of C. lyra JupiterPlot v.1.0 [28], found overall strong agreements with only few differences ( Figure 3B).
These likely constitute assembly errors in the contig assembly that were fixed by HiRise during scaffolding. A BUSCO analysis of the final assembly found slightly less complete BUSCO genes compared to the wdtbg2 contig assembly (94.5% vs. 95.0%) ( Figure 2, Table 2).

TRANSCRIPTOME ASSEMBLY AND QUALITY
In addition to the genome, we assembled the transcriptome of C. lyra for subsequent use in the genome annotation using Trinity v.2.9.0 (RRID:SCR_013048) [29,30] based on the 12.3 Gbp multi-tissue RNAseq data. The resulting transcriptome assembly has a total length of 255.5 Mbp (Table 3). BUSCO analysis suggests a high transcriptome completeness with 87.8% of orthologous genes found in the transcriptome assembly ( Figure 2, Table 2).

Repeat annotation
In order to annotate repeats in the assembly, we created a custom de novo repeat library using RepeatModeler v.1.0.11 (RRID:SCR_015027) [31] and combined this library with the  (Table 4).  The scaffolds were blasted against the NCBI nucleotide database. Scaffolds with assignments to Proteobacteria or Uroviricota were removed from the final assembly.

Gene annotation
Prior to annotating genes, interspersed repeats in the genome were hard-masked and simple repeats soft-masked to increase the accuracy and efficiency of locating genes. Gene  [3,4]. Of all identified gene models, 96% had an AED score of ≤ 0.5 (AED score distributions in GigaDB [14]), indicating a high quality of the annotated gene models [37]. In addition, BUSCO analysis identified 87.0% complete BUSCOs, which suggest a high completeness of the annotation ( Figure 2, Table 2).

CONCLUSION
Here we report the first genome assembly of the 'benthic associated' Syngnathiformes clade, the sister group to the 'long-snouted clade' (e.g., seahorses and pipefish). The annotated genome of Callionymus lyra, with its high continuity (chromosome-level), provides an essential reference to study speciation and potential hybridization in Callionymidae and is an important resource for phylogenomic analyses among syngnathiform fish.

DATA AVAILABILITY
All raw data generated in this study including Nanopore long-reads, DNBSeq short-reads, Hi-C reads, and RNASeq data, and the chromosome-level assembly are accessible at GenBank under BioProject PRJNA634838. Annotation, results files and other data are available in the GigaDB repository [14].