A genome assembly of the Atlantic chub mackerel (Scomber colias): a valuable teleost fishing resource

The Atlantic chub mackerel, Scomber colias (Gmelin, 1789), is a medium-sized pelagic fish with substantial importance in the fisheries of the Atlantic Ocean and the Mediterranean Sea. Over the past decade, this species has gained special relevance, being one of the main targets of pelagic fisheries in the NE Atlantic. Here, we sequenced and annotated the first high-quality draft genome assembly of S. colias, produced with PacBio HiFi long reads and Illumina paired-end short reads. The estimated genome size is 814 Mbp, distributed into 2,028 scaffolds and 2,093 contigs with an N50 length of 4.19 and 3.34 Mbp, respectively. We annotated 27,675 protein-coding genes and the BUSCO analyses indicated high completeness, with 97.3% of the single-copy orthologs in the Actinopterygii library profile. The present genome assembly represents a valuable resource to address the biology and management of this relevant fishery. Finally, this genome assembly ranks fourth in high-quality genome assemblies within the order Scombriformes and first in the genus Scomber.

. Scomber colias is usually found at depths of up to 300 m and occupies a key position in the trophic web. This species acts as a link between primary producers and top predators, since it feeds mainly on zooplankton and small pelagic fish, and is an essential element of the diet of larger pelagic fish (e.g., tuna, swordfish, and sharks) and marine mammals (e.g., dolphins and seals) [3]. Besides its ecological importance, S. colias also supports important commercial fisheries for several countries across its distribution range, being an important component in the diet of local populations [1,4]. This is probably related to its nutritional value, as this mackerel is a valued source of important fatty acids for human nutrition, particularly docosahexaenoic acid (DHA), an omega-3 fatty acid [5,6]. Additionally, S. colias is used as bait for the tuna longline and handline fisheries, and is caught in purse seine and pelagic trawl fisheries which target sardines and anchovies [7].
The availability of S. colias makes it a sustainable marine resource [6] and a viable alternative to the European sardine (Sardina pilchardus), which is under fishing restrictions due to a population decline. Curiously, fluctuations in abundance and a northwards shift in the distribution of S. colias, with a likely inverse relationship with sardine abundance, have been recently demonstrated [8]. Due to its ecological and economic importance, S. colias has been the focus of several recent studies on different aspects of its fisheries and biology [3,8,9]. Yet, genomic resources for the species are still limited. Presently a (liver) transcriptome [10], a mitogenome [11], and single-nucleotide polymorphism (SNP) data obtained through restriction site-associated DNA sequencing [12], have been described for the species. With the vast majority of the world's fish stocks already in collapse, and with climate change as additional pressure, information on fish genomes is becoming a pressing tool to address conservation efforts [13,14]. Here, we report the first high-quality draft genome of S. colias, assembled with Illumina and Pacific Biosciences (PacBio) Single Molecule High-Fidelity (HiFi) reads. This resource provides a critical platform to uncover the species' adaptive physiological potential in a changing environment. Specifically, it will help understand the current observed populational northward shift, postulated to be part of a more general expansion of species from warmer areas [8]. Moreover, being one of the genomes with higher quality within the family Scombridae and the first within the Scomber Gigabyte, 2022, DOI: 10.46471/gigabyte.40 2/21 genus, this information will help to improve the conservation, management, and sustainable exploitation of this valuable fish resource as well as that of its highly valued congeners.

Sampling and DNA extraction
Two specimens of S. colias were collected at 2 sampling time points. The first specimen was collected in 2017, during the "Programa Nacional de Amostragem Biológica" managed by the Instituto Português do Mar e da Atmosfera" (IPMA), in North Atlantic waters (41.501944 N 8.851667 W). From this individual, 2 tissue types were collected, and were stored in 100% ethanol (muscle) or RNA later (liver). Liver tissue was used to produce and describe the first liver transcriptome of S. colias [10]. Muscle tissue was used in the present study, for genomic DNA (gDNA) extraction using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany), following the manufacturer's instructions. The gDNA was then used for Illumina paired-end (PE) sequencing (described below). The second specimen was caught in 2020, near Mira, Portugal (40.5588270 N 9.4529720 W). Immediately upon harvesting, the muscle was snap frozen in liquid nitrogen. The frozen tissue was shipped to Brigham Young University DNA Sequencing Center (BYU), where gDNA with high molecular weight was extracted from 1.1 g of muscle using the QIAGEN Genomic-tip 20/G kit. The quality and concentration of the gDNA were assessed with Qubit Fluorometric system (ThermoFisher), and the fragment size was determined with a fragment analyser (Agilent Technologies, RRID:SCR_013575) before loading on the Pacbio Sequel II system (PacBio Sequel II System, RRID:SCR_017990).

DNA sequencing libraries construction and sequencing
For the first DNA sample, Illumina PE library preparation and sequencing were carried out by Macrogen, Inc. (Seoul, Korea), using Illumina HiSeq X Ten platform (Illumina HiSeq X Ten, RRID:SCR_016385), with 250 bp PE configuration. For the second specimen, PacBio HiFi library preparation and sequencing were performed at BYU, following the manufacturer's recommendations [15]. The size-selected fraction had a mean read length of 15.3 kbp and was selected on the SageELF system (Sage Science, RRID:SCR_014808). The sequencing was conducted on two single-molecule, real-time (SMRT) cells using Sequel II system v.9.0, with a run time of 30 h, and 2.9 h pre-extension. The circular consensus analysis was performed in SMRT ® Link v9.0 [16] under default settings (the statistics of raw data generated from each PacBio SMRT cell can be viewed in Additional File 1 [17, 18]).
Only HiFi reads with match hits over 90% identity and query coverage of 50% in the Actinopterygii taxon (NCBI:txid7898), or without match hits at all, were considered for further analysis (Figure 2).

Mitochondrial genome assembly
Given that 2 specimens were used for the distinct sequencing approaches, i.e., PacBio HiFi and Illumina PE, the whole mitochondrial genome (mtDNA) was assembled and

Nuclear genome assembly and assessment
For whole-genome assembly a combined approach, using short-and long-read assemblies, was applied ( Figure 2). While long-read assemblies were mainly used to produce the primary assembly, short-read assemblies were used to scaffold and improve the contiguity of the basal assembly. In summary, short-read assemblies were performed with the W2RAP  [28] with optimised parameters (default). While Hifiasm generated 2 pseudo-haplotypes per assembly, HiCanu generated 1 merged assembly. To choose the "best" assembly we applied a series of analyses, including Bandage (a bioinformatics application for nagivating de novo assembly graphs easily) v.0.8.1 [29] and manual inspection; Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.2.2 (BUSCO, RRID:SCR_015008) [30] with Eukaryota and Actinopterygii databases was used to assess the gene completeness of the assemblies, and Quality Assessment Tool for Genome Assemblies  (QUAST) v.5.0.2 (QUAST, RRID:SCR_001228) [31], to check general metrics of the assemblies ( Figure 2). Due to discrepancies in the length of the Hifiasm primary and alternative pseudo-haplotypes, we chose to concatenate them in a single assembly. At this point, the assembly with the highest complete BUSCO scores, highest contiguity (N50), and longest contig, was selected for further analysis. The pseudo-haplotypes were separated by purge_dups v.1.2.5 (purge dups, RRID:SCR_021173) [32]. After the first round of purging and inspection by k-mer plot, produced by the KAT tool, cutoffs were manually adjusted. To assess the influence of purge_dups in the genome, BUSCO (rate of deduplicates) and QUAST (N50 and genomic length per pseudo-haplotype) were used. Next, to improve the contiguity and quality of the assembly, short-read assemblies were used to structurally scaffold the assembly without the introduction of any new bases in the assembly, similar to the literature [33,34] (Figure 2). The 4 short-read assemblies were inputted to the Long Interval Nucleotide K-mer Scaffolder (LINKS) v.1.8.7 [35], being used as long reads; using several distance values, i.e., -d 0.5, 1.5, 3, 9, 27 kb, the primary assembly was rescaffolded interactively for 5 rounds (additional parameters: -k 21 -e 0.5). Furthermore, the scaffolded genome and the long-read assemblies, initially produced by Hifiasm and HiCanu and discarded based on contiguity and completeness, were inputted to Cobbler v.0.6.1 [36] and RAILS v.1.5.1 [36] pipeline, with default parameters. This allowed gap filling of ambiguity regions (produced by short-read scaffolding), and further rescaffolding using long-read information. To evaluate the final assembly, several metrics and software were used. In

Assessing the nuclear receptor and the "chemical defensome" repertoire in Scomber colias
To demonstrate the value of the present genome resourse, we collected the repertoire of nuclear receptors (NRs) in S. colias via TBLASTN (TBLASTN, RRID:SCR_011822) searches in the primary genome assembly with default parameters. Protein sequences of DNA-binding domains and ligand-binding domains in H. sapiens NRs were collected from the RefSeq [50] database and used in a query (NP_000466. To identify the genes related to the chemical defensome, target genes were selected based on a previous report profiling the "chemical defensome" of teleost species [68]. Next, gene names were used as queries to search the deduced S. colias genome annotation, a simple but successful approach for well-annotated genomes such as D. rerio [68]. When gene names were not retrieved from S. colias genome annotation (i.e., fthl, gstp, hsph, maff, nme8, slc21), further TBLASTN searches were performed in the primary genome assembly with optimised parameters (-max_hsps 1 to keep the best query-subject pair), using D. rerio sequences as a query.

Demography with pairwise sequentially Markovian coalescent (PSMC)
To explore the variation in the demographic history of the species, a pairwise sequentially

DATA VALIDATION
To produce the S. colias genome assembly, 2 sequencing strategies were used: Illumina PE short reads and PacBio HiFi long reads. The PE dataset was used to assess the genomic proprieties of the S. colias species and scaffold the long-read assembly, while HiFi reads were used to perform the primary genome assembly and gap closing (Figure 2).
The Illumina sequencing yielded 149 M of PE reads and the PacBio sequencing generated 1.7 M of HiFi reads (Table 1). Trimmed short reads were used to estimate the genome size (817 Mbp), heterozygosity rate (1.31%), and genome repeat content (approximately 26%), using GenomeScope2 (Figure 3). The complete statistics of GenomeScope2 can be consulted in Additional File 3 [17, 18]. In parallel, the HiFi dataset was inspected, and mitochondrial reads, as well as possible sources of contamination, were removed (amounting to 0.31% of the initial dataset) ( Table 1).
For the mtDNA assemblies, a total of 38,868 mtDNA PE reads were filtered by GetOrganelle and a total of 792 mtDNA PacBio HiFi reads were filtered by BLASTN search.
The 2 assemblies had the same length, 16,570 bp, and differed from each other by 0.    (uncorrected p-distances). The mtDNA gene content and arrangement is as expected for most fishes and is standard for vertebrates [72], consisting of 13 protein-coding genes, 22 transfer RNA (trn), and 2 ribosomal RNA (rrn) (Figure 4). The primary genome assembly was produced using filtered PacBio HiFi reads and the below software packages and settings. Following the above-mentioned criteria (Material and Methods: Nuclear genome assembly and assessment) the Sco_k21 assembly was selected, with both pseudo-haplotypes merged and subjected to purge_dups. Detailed statistics of Hifiasm and HiCanu genome assemblies can be consulted in Additional   File 4 [17, 18]. Although the purge_dumps generated a primary and an alternative assembly, only the primary assembly was used in subsequent steps. At the same time, 4 short-read genome assemblies were performed with W2RAP software, and contigs with over 500 bp were used as "long reads" to scaffold the primary assembly. Additional File 5 shows QUAST and BUSCO statistics for the PE genome assemblies [17,18]. Importantly, during the scaffolding process, only structural information of short-read assemblies was used, without the inclusion of bases. Lastly, the remaining non-basal long read assemblies were used to fill gaps inserted during the scaffolding stage. The final assembly (primary assembly) of   (Table 2). In the primary assembly, the k-mer analyses (via Merqury) showed a low level of k-mer duplication in the genome (colour blue, green, purple, and orange in Figure 5a), indicating a high level of haplotype uniqueness (red colouring in Figure 5a), and a similar k-mer distribution pattern to GenomeScope2 (performed with Illumina PE reads). Additionally, we found a high mapping rate in the Illumina, PacBio, and RNA-Seq reads, against the primary assembly of 95%, 99.8%, and 90.02%, respectively. Overall, these results provide evidence of the high quality of the S. colias genome assembly (Table 2). Our S. colias genome assembly ranks fourth in high-quality genome assemblies within the order Scombriformes and first in the genus Scomber (Additional File 6) [17, 18].
The RepeatMasker software masked 29.62% of bases in the primary genome assembly.
The masked regions were predominantly linked to DNA elements (11.66%), long interspersed nuclear elements (4.11%), long terminal repeats (2.58%), and simple repeats (2.88%). Furthermore, 8.62% of the genome was masked and annotated as "unclassified", and only a small percentage were classified as short interspersed nuclear elements, small RNA, or satellite repeats (Table 3). The genome annotation process generated about 27,675 protein-coding genes and 30,999 protein-coding sequences. On average, we found 9.5 exons and 1,656 bp lengths per CDS (Table 4). Of the CDS, 30,355 had at least 1 BLASTP hit in SwissProt or RefSeq databases, 27,101 were identified in the InterPro database, and 21,664 of these were classified as belonging to a specific homolog superfamily (Table 5).
To validate the protein-coding sequences we performed phylogenetic analysis (via OrthoFinder) and BUSCO analysis (using the Actinopterygii library profile) (Figure 5b   analysis. Alignment, trimming and concatenation of all single-copy orthologues, resulted in a final 120,886 aa-long supermatrix alignment that was used for phylogenomic inference in IQ-Tree. The resulting Maximum Likelihood phylogenetic tree has maximum support for almost all nodes (Figure 5b). The phylogeny recovered the reciprocal monophyletic Acanthopterygii groups Pelagiaria, Eupercaria, Anabantaria, Carangaria, and Ovalentaria, with Pelagiaria being the basal clade and represented by the 3 Scombrifomes, including S. colias (Figure 5b). These results are in accordance with the most recent phylogenomic study of ray-finned fishes [73], as well as the Ensembl Compara Species Tree of Ensembl database [51]. BUSCO analysis showed the S. colias proteome with 93.6% of the groups complete, 2% fragmented, and 4.4% missing (Figure 5c). In comparison, T. maccoyii had 99.8% BUSCO groups complete, while T. orientalis had but 82.8%. These results are expected, since the T. maccoyii genome assembly, part of the Vertebrate Genome Project [74], was built at chromosome level, with multiple technologies (including 46x PacBio data, 46x 10X Genomics Chromium data, BioNano data, and Arima Hi-C data) and several manual curation steps [75]. In contrast, both T. Orientalis [76] and S. colias were built at scaffold level using only short-and long-read information.
We further explored the quality of the annotation by investigating the repertoire of the NRs superfamily in the S. colias assembly. NRs are critical molecular physiology components, with vital roles in animal physiology and disruption [77]. Moreover, their   in line with the repertoire described for other teleost species [78]. Among the retrieved NRs we found those that are key components of the "chemical defensome"-an ensemble of related and unrelated genes that protect organisms against chemical stressors, and are thus critical under anthropogenic chemical build-up and climate change scenarios-such as the xenobiotic-inducible pregnane X receptor (pxr, nr1i2) [68,79]. Subsequent analysis, using gene names, further suggested the presence of gene annotations for the vast majority of the reported members of the teleost "chemical defensome" in S. colias, similarly to that  We additionally validated our dataset by examining the present population structure of the species, since the genome may also provide clues regarding its past demographic history [69]. One popular method to produce these inferences is the pairwise sequentially Markovian coalescent (PSMC) model, here applied to the S. colias final genome assembly.
Since PSMC requires an estimation of the genome-wide mutation rate, and since this has never previously been produced for S. colias, we used the recently estimated genome-wide mutation rate of the yellowfin tuna, T. albacares, of 7.3 × 10 −9 mutations/site/generation [70].
The results suggest a past population expansion between 160,000-115,000 years ago, with maximum effective population size (N e ) of 36,000 during the end of the Mid-Pleistocene Transaction, corresponding to the Eemian (i.e., the last interglacial period) and the transition between Marine Isotope Stages (MIS) 5 and 6 ( Figure 6). This population expansion is followed by an apparent decrease in the N e to around 25,000 at the beginning of the Late Pleistocene, corresponding to the beginning of the Last Glacial Period. These results, suggesting the influence of climatic changes from the Pleistocene glaciation cycles on the N e , are following other recent studies on Scombriformes, such as the Pacific Sierra mackerel, Scomberomorus sierra [80], and the Indo-Pacific Yellowfin tuna T. albacares [70], as well as in other pelagic marine species such as the killer whale [81].

REUSE POTENTIAL
This study reports the first genome assembly of Atlantic chub mackerel. Scomber colias is a valuable marine resource, with a high impact on the fisheries of several countries along the west coast of the Atlantic Ocean and the Mediterranean Sea. Ecologically, this species