Genome assembly of the roundjaw bonefish (Albula glossodonta), a vulnerable circumtropical sportfish

The roundjaw bonefish, Albula glossodonta, is the most widespread albulid in the Indo-Pacific and is vulnerable to extinction. We assembled the genome of a roundjaw bonefish from Hawai‘i, USA, which will be instrumental for effective transboundary management and conservation when paired with population genomics datasets. The 1.05 gigabase pair (Gbp) contig-level assembly had a 4.75 megabase pair (Mbp) NG50 and a maximum contig length of 28.2 Mbp. Scaffolding yielded an LG50 of 20 and an NG50 of 14.49 Mbp, with the longest scaffold reaching 42.29 Mbp. The genome comprised 6.5% repetitive elements and was annotated with 28.3 K protein-coding genes. We then evaluated population genetic connectivity between six atolls in the Western Indian Ocean with 38,355 SNP loci across 66 A. glossodonta individuals. We discerned shallow population structure and observed genetic homogeneity between atolls in Seychelles and reduced gene flow between Seychelles and Mauritius. The South Equatorial Current might be the limiting mechanism of this reduced gene flow. The genome assembly will be useful for addressing taxonomic uncertainties of bonefishes globally.

. Roundjaw bonefish (Albula glossodonta) adult. Quantitative morphological data for this illustration of A. glossodonta were obtained primarily from Hidaka et al. [18] and Shaklee and Tamaru [14]. These were evaluated to select specific values for details such as the number of pored lateral line scales (76) and the number of rays in the pectoral (18), dorsal (16), pelvic (10), and anal fins (9). Each of these was portrayed in the illustration to be at or near the middle of the ranges reported in the reference articles. While some limited information was found in the literature describing coloration and general shape, the artist found particular benefit in some excellent, detailed photographs by Derek Olthuis of samples that Dr. J. S. K. Kauwe caught in Hawai'i and later genetically identified as A. glossodonta. A full-resolution version of this illustration can be viewed at https: //www.timjohnsongallery.com/albula-glossodonta-illustration (also archived at the Internet Archive (https://web. archive.org) on 4 March 2022).
Most of the species in the A. vulpes complex can be found in the Caribbean Sea and Atlantic Ocean. By contrast, A. glossodonta can be found throughout the Indian and Pacific Oceans; this range overlaps slightly with A. koreana [19] from the A. vulpes complex and drastically with each species in the A. argentea complex [4]. A. glossodonta may be distinguished genetically from other species, but morphological identification based on its more rounded jaw and larger average size is difficult for non-experts [4,20]. This difficulty, alongside underregulated fisheries and anthropogenic habitat loss, poses considerable threats to the future of this species. A. glossodonta has been evaluated as "Vulnerable" on the International Union for the Conservation of Nature's (IUCN) Red List of Threatened Species™ [7], and several incidents of overexploitation, including regional extirpation, have been reported [21][22][23][24][25].

Context
The threat to A. glossodonta and other bonefish species will persist unless identification is made easier and population genomics techniques are employed to understand and identify evolutionarily significant units, areas of overlap between species, presence and extent of hybridization, and life-history traits, especially migration and spawning [4]. Genetic identification has previously been accomplished using only a portion of the mitochondrial cytochrome b gene and some microsatellite markers [6,9,15,19,[26][27][28][29][30][31][32][33], which probably provide an insufficient taxonomic history [4,[34][35][36]. To contribute to a more robust capacity for identification and enable more complex genomics-based analyses, we present a high-quality genome assembly of an A. glossodonta individual. A transcriptome assembly was also created and was used alongside computational annotation methods to create structural and functional annotations for the genome assembly. Additionally, we present results from a population genomic analysis of A. glossodonta populations in Seychelles and Mauritius, two island nations that support lucrative bonefish fly fishing industries.

METHODS
An overview of the methods used in this study is provided here. Where appropriate, additional details, such as the code for custom scripts and the commands used to run software, are provided in the annotated shell scripts hosted in GigaDB [37].

Tissue collection and preservation
Blood, gill, heart, and liver tissues from one A. An estimate of the number of k-mers present in the reads is required to run BFCounter.
This number is based on the number of reads, the length of the reads, and k-mer size according to this equation: where n is the number of reads, l is the read length, k is the k-mer size, and T is the total number of k-mers (not necessarily unique or distinct) present in the reads. This assumes a uniform read length.

PacBio CLRs
Several methods were attempted to correct the PacBio CLRs. Corrected reads from each method that did not fail were assembled, and the assembly results were used to choose the correction strategy. Ultimately, a hybrid correction strategy was employed. Typically, a "hybrid" correction strategy is one in which more than one data type (i.e., PacBio CLRs and Illumina short reads) are employed. This differs from a "self" correction strategy in which only the PacBio CLRs are used to correct themselves. Our hybrid strategy is not fully described by the word "hybrid" because both a "self" and "hybrid" strategy were serially employed; we referred to this strategy as a "dual" strategy ( Figure 2). First, the reads were self-corrected using Canu v1.6 (RRID:SCR_015880) [48]. Second, the self-corrected reads were further corrected using Illumina short-reads (previously corrected with Quake) using CoLoRMap (commit #baa680, RRID:SCR_022012) [49]. Note that Illumina short reads had to be interleaved into a single file for CoLoRMap. Similarly, all PacBio reads were concatenated into a single file, and the headers were required to be unique up to the first space. CoLoRMap is a pipeline with a basic wrapper script. In practice, it makes more sense to run each step in the wrapper script as separate jobs to avoid re-computing if a failure (e.g., too much random access memory (RAM) or time) occurs in a downstream step; this is how we ran the pipeline.

Genome size estimation
Genome size was estimated using a k-mer analysis on the corrected Illumina WGS reads.
First, the k-mer coverage was estimated using ntCard v1.0.1 [46]. The k-mer coverage histogram was computationally processed to calculate the area under the curve and identify the peak to determine genome size according to the following equation:  where a is the area under the curve, p is the number of times the k-mers occur (the x-value) at the peak, and s is the genome size.

Genome assembly, polishing, and scaffolding
Multiple assemblies were generated from various correction strategies. The final assembly was based on a hybrid correction strategy as described. The assembly was created using ( Table 2).

Transcriptome assembly
The transcriptome was assembled from Illumina RNA-seq reads from all three tissues (i.e., gill, heart, and liver). The raw reads were used because Rcorrector did not modify any bases, thus making the raw reads and the "corrected" reads identical. The transcripts were assembled using Trinity v2.6.6 (RRID:SCR_013048) [

ddRAD sequence assembly and filtering
We assembled all ddRAD data using the program ipyrad v0.9.31 (RRID:SCR_022016) [68]. The input parameters for ipyrad are shown in Table 3. All A. glossodonta reads were mapped to the genome assembly. The seven-step ipyrad workflow was as follows: (1) Sequences were demultiplexed by identifying individual sample barcode sequences and restriction overhangs.
(2) Barcodes and adapters were trimmed from reads, which were then hard-masked using a q-score threshold of 20 and filtered for a maximum number of undetermined bases per read.
(3) Reads were clustered with a minimum depth of coverage of six to retain clusters in the ddRAD assembly.
(4) Sequencing error rate and heterozygosity were jointly estimated from site patterns across the clustered reads assuming a maximum of two consensus alleles per individual. (5) Consensus base calls were determined for each allele using the parameters from step four and filtered for a maximum number of undetermined sites per locus. (6) Consensus sequences and aligned reads were clustered for each sample. (7) Data were filtered by the maximum number of alleles per locus, the maximum number of shared heterozygous sites per locus, and other criteria [68], and output files were formatted for downstream analyses. We included all loci shared by at least 10 individuals.

Detection of loci under selection
Before conducting population genomic analyses, we performed outlier tests to identify loci putatively under selection. These are usually identified by a significant difference in allele frequencies between populations [97]. Specifically, we implemented two outlier detection methods that accommodate missing data: pcadapt v4.  were also annotated with respect to the MAKER-based annotations of the genome assembly using SnpEff v5.0e (RRID:SCR_005191) [104].

Population structure and genetic differentiation
To examine population structure, we used a model-based clustering method to reconstruct the genetic ancestry of individuals using sparse non-negative matrix factorization (sNMF) and least-squares optimization. Model-based analyses were performed using the package LEA v2.6.0 (RRID:SCR_022020) [105] in R. The sNMF function in LEA estimates the number of ancestral populations and the probability of the number of gene pools from which each individual originated by calculating an ancestry coefficient and investigating the model's fit through cross-entropy criterion [106]. We calculated and visualized cross-entropy scores of  of the atolls that comprised our sampling localities ( Figure 5). The five Seychelles atolls we sampled were spread among three separate clusters of islands, which are commonly referred to as the "outer" island groups owing to the geographic locations of these outlying coralline islands relative to the densely populated, granitic "inner" islands of the Seychelles

Hi-C sequencing
Sequencing yielded 88.7 million pairs of reads comprising 44.28 Gbp. The mean and N50 read lengths were 249.493 and 250, respectively. A summary of these results is presented in

Illumina DNA
When Quake corrects paired-end reads, three outcomes are possible for each pair of reads: (i) both reads are either already correct or correctable, (ii) one read is either correct or correctable and the other is low-quality, or (iii) both reads are low-quality. Of the original 218.96 million reads (109.5 million pairs of reads), Quake corrected 62.7 million of them and removed 51.6 million of them. 5.97 million pairs of reads were discarded because both reads were rated as erroneous. 39.6 million pairs of reads had one read that was correct or correctable and one read that was low-quality; these were also discarded. The remaining 63.88 million pairs of reads were either correct or correctable and were kept in the final read set containing 29.11 Gbp of sequence.

Illumina RNA
No corrections were made to the RNA-seq reads by Rcorrector [47].

PacBio CLRs
The dual-correction strategy (self-correction followed by hybrid-correction) reduced the number of reads from 9.5 million to 2. 79

Genome size estimation
The genome size was estimated to be approximately 0.933 Gbp as a result of the k-mer analysis, which was consistent with the authors' expectations based on two closely related elopomorph species [110,111].

Genome assembly, polishing, and scaffolding
The initial assembly from Canu comprised 3,799 contigs with a total assembly size of   Continuity statistics for the Albula glossodonta genome assembly at various stages. The "Scaffolds (Hi-C + RNA-seq)" column represents the final assembly. Also note that when submitted to GenBank, the gaps were all converted to a length of 100 bp. Unless otherwise specified, all nucleotide sequences are measured in base pairs (bp).
(+4.37%). Table 6 summarizes the assembly continuity statistics, and the auNG is visualized in Figure 8.
The assembly completeness, as assessed with single-copy orthologs, was also evaluated at each stage ( Table 2). The results suggest that the modifications made to the primary Canu-based assembly from polishing and scaffolding did not significantly impact the correct assembly of single-copy orthologs. The final set of scaffolds had 3,481 complete single-copy orthologs (95.6% of 3,640 from the ODB10 Actinopterygii set). Of these 88.4% (3,076) were present in the assembly only once, and 11.6% (405) were present more than once. Twenty-five (0.7%) and 135 (3.7%) single-copy orthologs were fragmented in and missing from the assembly, respectively. To show that the polished contigs (green) share nearly the same curve, the line was plotted more thickly so it can just barely be seen. Similarly, the Hi-C + RNA-seq scaffolds (purple) curve is very similar to the Hi-C only scaffolds (blue) curve. In this case, differences are more apparent. In certain places, e.g., at the highest peak, the Hi-C + RNA-seq scaffolds (purple) are plotted more thickly so it can be seen behind the Hi-C only scaffolds (blue).

Transcriptome assembly
The transcriptome assembly generated by Trinity comprised 455K sequences with a mean sequence length of 1,177 bp. The N50 and L50 were 2.6 Kbp and 56K, respectively. The N90 and L90 were, respectively, 410 bp and 270K. Of the 3,640 single-copy orthologs in the ODB10 Actinopterygii set, 86.4% (3,144) were complete; 39.5% (1,241) of which were present only once in the transcript set. One hundred and twenty-eight (3.5%) single-copy orthologs were fragmented in the transcript set, 368 (10.1%) were missing (see Table 2).

Computational genome annotation
Computational structural and functional annotation yielded 28.3 K protein-coding genes. Of these, 17.2 K and 15.6 K have annotated 5′ and 3′ untranslated regions (UTRs), respectively.
Eighteen hundred (1,800) tRNA genes were also identified. The annotations are available with the assembly on GenBank. Approximately 6.5% of the genome comprised repetitive elements, and a summary of repeat types is available in Table 7.

Population genomic analysis
Cross-entropy scores generated by the model-based population differentiation analysis, sNMF, provided support for a single population of A. glossodonta across all localities. However, individual ancestry plots generated from sNMF results showed evidence of genetic differentiation in individuals from Mauritius (St. Brandon's Atoll), compared to the Seychelles sites ( Figure 9A). This differentiation was corroborated by PCA visualization of the first two principal components, where St. Brandon's Atoll individuals clustered separately from the four Seychelles island groups ( Figure 9B). Together, both population differentiation analyses indicated weak geographic population structure across all sampling localities, with reduced gene flow between St. Brandon's Atoll and the Seychelles sites. Pairwise F ST results also indicated greater genetic differentiation between St. Brandon's Atoll and all other island groups ( Table 8). Estimates of observed and expected heterozygosity were similar across island groups (Table 9), suggesting no differences in genetic diversity between sampling localities and providing no evidence for distinguishing metapopulation processes such as inbreeding. A test of isolation by distance between sampling sites was not significant (p = 0.1501).

Discussion
Albula glossodonta is an important fishery species in the Indo-Pacific for both subsistence and recreational purposes [21,31,115,116]   ∼1,500-2,000 km from the Seychelles Islands ( Figure 5). This genetic structuring was unexpected, given the long pelagic larval duration of A. glossodonta. However, there is evidence of limited gene flow between Seychelles and Mauritius in other marine fish species with pelagic larvae, such as Lutjanid kasmira [128], Lethrinus nebulosus [129], and Pristipomoides filamentous [130].
We attribute the observed genetic structure between Seychelles and St. Brandon's Atoll to the ocean currents in the SWIO and their role in larval transport [131,132]. St. Brandon's Atoll is in the direct path of one of the bifurcated arms of the South Equatorial Current as it passes through the Mascarene Plateau [124,133]. The South Equatorial Current pushes water westward, which may create a barrier to gene flow to islands south of Seychelles such as Mauritius and Réunion [129,130,133]. Although there are currently no bonefishor even elopomorph -larval dispersal models for the Indian Ocean, pelagic larval dispersal simulation models of coral species in the SWIO corroborate the biogeographic break between Seychelles and Mauritius, suggesting connectivity is limited even when the pelagic larval duration is between 50-60 days [124,133]. However, these models considered coral larvae, which are completely reliant on currents for their dispersal [124,133,134]. While the dispersal behavior of A. glossodonta larvae is unknown, we speculate that, similar to eels (Anguillidae; which also have long pelagic larval durations), bonefishes could disperse greater distances than passive corals by having the ability to swim (e.g., Anguilla japonica [135]) or may even take part in vertical migrations (e.g., Anguilla japonica [136,137]). While officially undescribed, swimming ability in bonefish leptocephali has been observed [138], and vertical migrations have previously been theorized [121,139].
Genome-wide datasets have enabled researchers to better delineate population connectivity across seascapes for marine species where conventional markers (e.g., mtDNA, microsatellites) have not provided sufficient genomic resolution [126,140,141]. Such advances in genomic sequencing have altered our view of population connectivity in other marine fishes such as yellowfin tuna (Thunnus albacores [142]) and the American eel (Anguilla rostrata [143]). These studies, including ours, highlight the power of large genomic datasets for investigating connectivity in open ocean environments containing few, if any, natural barriers that were traditionally thought to drive population structure. Although there has been a rapid increase in the number of studies using next-generation sequencing datasets for marine fishes, few studies to date have employed the use of genomic datasets on elopomorphs, and none on bonefish [143][144][145].

REUSE POTENTIAL
This is the first genome assembly and annotation for an albulid species, as well as the first use of a genome-wide SNP dataset to investigate population structure for Albula glossodonta or any bonefish species in the Indian Ocean. Individuals of A. glossodonta were genetically homogenous across four coralline island groups in the Seychelles Archipelago, but they showed evidence of genetic differentiation between the Seychelles and Mauritius (St. Brandon's Atoll). These patterns of connectivity are probably facilitated by pelagic larval dispersal, which is presumed to be strongly shaped by currents in the SWIO. Only with high-resolution genomic data were we able to discern this pattern of population structure between Seychelles and Mauritius. Our dataset serves as a valuable resource for future genomic studies of bonefishes to facilitate their management and conservation.

DATA AVAILABILITY
The raw reads, genome assembly, and annotations are available under BioProject PRJNA668352 and BioSamples SAMN16516506-SAMN16516510 and SAMN17284271.