The genome of a giant (trevally): Caranx ignobilis

Caranx ignobilis, commonly known as giant kingfish or giant trevally, is a large, reef-associated apex predator. It is a prized sportfish, targeted throughout its tropical and subtropical range in the Indian and Pacific Oceans. It also gained significant interest in aquaculture due to its unusual freshwater tolerance. Here, we present a draft assembly of the estimated 625.92 Mbp nuclear genome of a C. ignobilis individual from Hawaiian waters, which host a genetically distinct population. Our 97.4% BUSCO-complete assembly has a contig NG50 of 7.3 Mbp and a scaffold NG50 of 46.3 Mbp. Twenty-five of the 203 scaffolds contain 90% of the genome. We also present noisy, long-read DNA, Hi-C, and RNA-seq datasets, the latter containing eight distinct tissues and can help with annotations and studies of freshwater tolerance. Our genome assembly and its supporting data are valuable tools for ecological and comparative genomics studies of kingfishes and other carangoid fishes.


Oceans [9]
, and is heavily targeted by small-scale and recreational fisheries throughout its range. Understanding its evolutionary and ecological role in the ecosystem structure and function is important for fisheries management and the protection of reef and coral ecosystems. Importantly, new putative populations of C. ignobilis in the Indian and Pacific Oceans have recently been described using genomic datasets [10]. A highly-continuous genome allows for the inference of demographic history, genomic signals of selection and adaption, and comparative genomic studies with other Carangoid fishes, such as the hybridization with the closely related bluefin trevally, Caranx melampygus [11].
For our C. ignobilis assembly, we present the results derived from 58.25 Gbp of Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing data. The Illumina paired-end sequencing data were also generated with libraries for both RNA-seq and Hi-C, totaling 347.6 Gbp. Both datasets were used for scaffolding purposes and are valuable individually. The estimated genome size is 625.92 Mbp [14,15], of which 96.7% is covered by known bases in the primary haploid assembly. In addition to being highly contiguous, our genome assembly contains complete, unduplicated copies of >95% of the expected single-copy orthologs, suggesting the assembly is reasonably complete. This draft assembly and the supporting sequencing datasets are sufficiently high-quality to serve as valuable resources for a variety of prospective comparative and population genomics studies.

METHODS
An overview of the methods used in this study is provided here. Where appropriate, additional details, such as the code of custom scripts and the commands used to run software tools, are provided in a file in GigaDB [16].

Sample acquisition and sequencing
Blood, brain, eye, fin, gill, heart, kidney, liver, and muscle tissues from one C. ignobilis University (BYU; Provo, Utah, USA) and stored at −80°C until sequencing. The blood sample was used to create the Omni-C dataset. All the non-blood tissue samples were used for short-read RNA sequencing; the heart tissue was also used for long-read DNA sequencing. Hi-Seq 2500 (RRID:SCR_016383) at the DNASC. Finally, the "Omni-C Proximity Ligation Assay Protocol" version 1.0 was followed using a Dovetail Genomics Omni-C kit to prepare the DNA for Illumina Paired-end sequencing. Adapters were obtained from Integrated DNA Technologies, and sequencing proceeded in Rapid Run mode for 250 cycles in one lane on an Illumina Hi-Seq 2500.
The scaffolds were visually inspected using a Hi-C contact matrix ( Figure 4) created with PretextView v0.1.4 (https://github.com/wtsi-hpag/PretextView) (RRID:SCR_022024) and The NG curve and the area under it are plotted for the contigs and scaffolds. This visually demonstrates an increase in continuity from contigs to scaffolds. Scaffolding with RNA-seq data -which has minimal effect on its own (data not shown) -further increases the scaffold-level continuity. This plot also shows that duplicate purging and fixing misassemblies slightly reduced the contig-level continuity, as expected. The visual comparisons with other carangoid genomes were created for the cursory comparative genomics analysis and coarse validation via the observation of general similarities. Dot plots were generated using Mashmap v2.0 commit #ffeef48 (RRID:SCR_022194) [41] (-f 'one-to-one' -pi 95 -s 10000) and the comparison of single-copy orthologs was created using ChrOrthLink commit #d29b10b (RRID:SCR_022195) after the assessment with BUSCO v3.0.6 [39] using the Vertebrata set from OrthoDB v9 [42]. The genome assemblies obtained from NCBI for these analyses were the following (alphabetical order): Caranx melampygus (bluefin trevally) [11], Echeneis naucrates (live suckershark) [43,44], Seriola dumerili (greater amberjack) [43,44], Seriola quinqueradiata (yellowtail) [45,46], Seriola rivoliana (longfin yellowtail) [47], Trachinotus ovatus (golden pompano) [48,49], and  In the context of scaffolding, Hi-C contact matrices show how correct the scaffolds are based on Hi-C alignment evidence. The longest 26 scaffolds are shown, ordered by descending length from top-left to bottom-right; the grey lines show the scaffold boundaries. Off-diagonal marks, especially dark and large ones, are possible evidence of mis-assembly and/or incorrect scaffolding. Regions with sharp edges similar to where the grey lines appear, but without the grey lines (e.g., three such locations occur in the top-left square), are joins between contigs in that scaffold that lack Hi-C evidence. The lack of Hi-C alignment evidence could suggest that these joins are invalid; however, evidence for these joins does exist from the RNA-seq alignments. The detection of any spurious joins would, at a minimum, require manual curation. Such curation would enable additional adjustments to fix the minor issues evidenced in the contact matrix.
distribution is plotted in Figure 2. A summary of the results for the sequencing run is available in Table 1. This genome is the second for the Caranx genus and ranks highly in terms of N50 among the available carangoid genomes [49,51].
The RNA-seq from the eight tissues (i.e., brain, eye, fin, gill, heart, kidney, liver, and muscle) generated 435.99 M pairs of reads totaling 108.30 Gbp. Across all eight tissues, the mean and N50 read lengths were 124.21 and 125 bp, respectively. The combined results from all eight tissues are provided in Table 1, while the results from each tissue are available in Table 2. Omni-C sequencing generated 80.92 Gbp of data across 169.1 M read pairs. The N50 and mean read length were respectively 250 and 239.3 bp.
The Omni-C results are also provided in Table 1 with the PacBio and RNA-seq data.
The RNA-seq and Omni-C reads were not corrected, but the quality was assessed using fastqc [54]. The results from each type of DNA and RNA sequencing from Caranx ignobilis. PE = Paired-end reads; SMRT = single-molecule real-time sequencing; CLR = continuous long-reads. Results of the RNA sequencing of each tissue from one Caranx ignobilis individual. The eight tissues were spread across two lanes and run on an Illumina Hi-Seq 2500 in Rapid Run mode for 250 cycles to generate paired-end reads. Unless otherwise specified, lengths of nucleotide sequences are measured in base pairs (bp).

PacBio CLR error correction
The correction process reduced the number of reads from 3. 74

Genome assembly, duplicate purging, and scaffolding
The initial assembly generated by Canu comprised 1.8 K contigs for a total assembly size of 758 Mbp. That was a diploid assembly: both haplotypes were present and intermixed, separated whenever a bubble in the assembly graph prevented a single, reasonable contig. can be visualized through the auNG as shown in Figure 3 (also see Table 3).  Continuity statistics for the Caranx ignobilis genome assembly at the contig and scaffold level. Note that the auNG value is the area under the NG curve, not the N curve. The final set of scaffolds (far right column) is the same as "Scaffolds (SALSA + Rascaf" except that the identified contaminants were manually removed from the assembly and the gaps were unified to 100 Ns. Unless otherwise specified, all nucleotide sequences are measured in base pairs (bp).
Paired-end Illumina reads, such as those produced from Hi-C or RNA-seq libraries, can provide information to order and orient contigs into scaffolds. However, they contain insufficient information for gap-filling procedures. Accordingly, the result of the assembly statistics should increase lengths, decrease the number of sequences, and leave the number of known bases unchanged. This pattern was evident in the assembly statistics from our iterative scaffolding procedure (Table 3). It is important to note that SALSA and Rascaf introduce gaps of unknown size, using fixed runs of 500 and 17 Ns, respectively, to represent such gaps. For submission to NCBI, these gaps were converted to a fixed length of 100 Ns; the evidence for whether the joins were supported by Hi-C or RNA-seq data was submitted in an accompanying file in AGP format  Table 3). All joins were represented in a contact matrix (Figure 4), showing the Hi-C evidence for the assembly. Some joins were poorly supported by the Hi-C evidence, which was not surprising as some joins were based on RNA-seq evidence instead. Without manual curation, it is difficult to ascertain whether any individual join is spurious.
The assembly completeness, as assessed with single-copy orthologs, was also evaluated at the contig and scaffold level (  Summary BUSCO results for the Caranx ignobilis genome assembly at the various contig and scaffold stages. Each value is the percentage of the single-copy orthologs (n = 3,640) in the Actinopterygii lineage dataset from OrthoDB v10. Summary of repeat content in the Caranx ignobilis genome assembly as reported by RepeatMasker [29] using the Dfam v3.3 [33] and RepBase RepeatMasker v20181026 [34,35] repeat libraries.

Comparison between the genomes of the giant trevally and other carangoids
We compared the C. ignobilis genome with the published genomes of other carangoids spanning the carangoid phylogeny, including the live sharksucker (Echeneis naucrates) [43,44], the golden pompano (Trachinotus ovatus) [48,49], the yellowtail (Seriola quinqueradiata) [45,46], the longfin yellowtail (Seriola rivoliana) [47], the greater amberjack (Seriola dumerili) [57,58], the Atlantic horse mackerel (Trachurus trachurus) [50][51][52], and the closely-related bluefin trevally (Caranx melampygus) [11]. We generated dot plots to visualize the genome alignments and look for general similarities between the genomes ( Figure 5). Some structural variations can be seen, but overall there do not appear to be regions of significant variation (e.g., inversions or frameshifts) between C. ignobilis and other carangoid species. We similarly compared the same genomes by visualizing the grouping of single-copy orthologs plotted along the assemblies (Figure 6). Large groupings of orthologs consistently appear between genomes, suggesting orthology not just between genes but also between larger genomic regions. However, at this scale and by comparing several genomes at once, it is difficult to make more refined inferences on the evolution of  However, some joins based on RNA-seq data can be spurious under certain conditionssuch as when RNA-seq reads split across introns and the mapping software mistakenly assigns each end to different genes with similar sequences (e.g., from duplication events or gene families). The true structure of the genome can be further elucidated by karyotype analysis, additional sequencing data (e.g., Ultra-long Nanopore (Oxford, England, UK)), and one-on-one comparisons with high-quality, chromosome-scale assemblies from related species. Ultimately, this genomic dataset is useful for future comparative studies on genome structure and evolution within Carangiformes and, more broadly, marine teleosts.
See Table 6 for a complete list of the datasets and their mapping to BioSamples. The contigs, the scaffolds resulting from Hi-C evidence, and the scaffolds resulting from Hi-C or RNA-seq evidence are also available from the Center for Open Science's (https://www.cos.io) Open Science Framework [80]. Snapshots of the code and other results files are available in the GigaDB repository [16].