Genome assembly and annotation of the tambaqui (Colossoma macropomum): an emblematic fish of the Amazon River Basin

Colossoma macropomum, known as “tambaqui”, is the largest Characiformes fish in the Amazon River Basin and a leading species in Brazilian aquaculture and fisheries. Good quality meat and excellent adaptability to culture systems are some of its remarkable farming features. To support studies into the genetics and genomics of the tambaqui, we have produced the first high-quality genome for the species. We combined Illumina and PacBio sequencing technologies to generate a reference genome, assembled with 39× coverage of long reads and polished to a consensus quality value (QV) of 36 with 130× coverage of short reads. The genome was assembled into 1269 scaffolds (a total of 1,221,847,006 bases), with a scaffold N50 size of 40 Mb, where 93% of all assembled bases were placed in the largest 54 scaffolds corresponding to the diploid karyotype of the tambaqui. Furthermore, the NCBI Annotation Pipeline annotated genes, pseudogenes, and non-coding transcripts using the RefSeq database as evidence, guaranteeing a high-quality annotation. A Genome Data Viewer for the tambaqui was produced, which will benefit groups interested in exploring the unique genomic features of the species. The availability of a highly accurate genome assembly for tambaqui provides the foundation for the discovery of novel ecological and evolutionary insights, and is a helpful resource for aquaculture.


INTRODUCTION
The Amazon Basin harbors enormous freshwater ichthyo diversity throughout its rivers and tributaries, with 2406 validated freshwater native fish species from 232,936 georeferenced records [1]. Colossoma macropomum (NCBI:txid42526, fishbase ID:263) is the largest Characiformes representative found across the Amazon River and its tributaries, with individuals reaching 1 meter in length and 30 kg in weight ( Figure 1) [2]. This species is known by different common names, such as "tambaqui" in Brazil and "cachama Negra" in Colombia. Tambaquis are omnivore/frugivore benthopelagic fish, and they have an essential ecological role as seed dispersers [3]. They are potamodromous fish, with upstream migration and reproduction taking place in the white waters along woody shores between November and February [4]. The tambaqui is an important food and income source for Amazonian fishing communities; it is the most frequently farmed native fish species in Brazil, with a production of 101,079 metric tons in 2019 [5,6].
The ecological and economic importance of the tambaqui means it is a comparatively well-studied species. Research to date has focused on its biological adaptations to the Amazon River waters, and on the genetics of production traits to assist selective breeding programs. Transcriptomic characterization of tambaqui exposed to (i) distinct climate change scenarios, and (ii) during gonadal differentiation, has provided helpful resources for understanding the molecular mechanisms underlying both adaptation to a future new climate and the process of sex determination [7][8][9]. Other molecular mechanisms related to enzymatic capacity for long-chain polyunsaturated fatty acid biosynthesis have also been confirmed by the functional characterization of core genes in these processes [10,11]. The first steps for deciphering the structure and functional dynamics of the tambaqui genome have already been taken, with large-scale single nucleotide polymorphism (SNP) discovery allowing a high-density genetic linkage map of the species to be built [12], along with preliminary microRNA identification and characterization [13]. Equally pertinent are the new findings in morphology: specimens lacking intramuscular bones were identified in a fish farm in Brazil; however, the genetic and molecular mechanisms underlying the expression of such desirable phenotypes for the fish market remain unknown [14,15].
Considering the great need for increased genetic resources for the tambaqui to assist fishery management and aquaculture [16], here we present the first high-quality reference genome for C. macropomum. This complete set of DNA provides a valuable resource for the study of evolutionary and functional genomics in bony fishes, providing a window of opportunity to reveal singularities of the tambaqui genome, as well as to help develop molecular techniques to improve selective breeding programs.

DNA isolation and taxonomy identification
Genomic DNA was isolated from caudal fin-clip samples from a C. macropomum specimen obtained from the germplasm bank maintained by the National Center for Research and Conservation of Freshwater Aquatic Biodiversity of the Brazilian Ministry of the Environment. The specimen was a 3.5-kg female ( Figure 1). To confirm the taxonomic status of the specimen used in this work, we carried out (i) an external morphological evaluation [17], and (ii) a preliminary genetic analysis of an initial Illumina run for C. macropomum using the k-mer-matching tool Seal from the BBTools package (v 37.90, RRID:SCR_016968) [18]. We downloaded the sequences of one mitochondrial and four nuclear genes of C. macropomum and its two close relatives, Piaractus brachypomus and P. mesopotamicus (Table 1)

Sequencing and assembly
Different data types were produced for the genome assembly of C. macropomum.
High-molecular-weight DNA was extracted from muscle and fin clip using the MagMAX CORE nucleic acid purification kit (Thermo Fisher Scientific, Carlsbad, CA, USA) to produce PacBio continuous long reads (CLR) and Illumina paired and jumping reads ( Table 2). The 3) [20], implemented in Canu assembler [21] and GenomeScope [22].    (Table 3). For the genome assembly, PacBio reads were input to the assembler Flye (v2.5, RRID:SCR_017016) [23] with the parameters "genome-size 1.5g -pacbio-raw". Then, the assembly was polished using the Illumina reads with Pilon software (RRID:SCR_014731) [24], and the parameters "frags" for paired reads and "jumps" for mate-pair reads. Finally, the assembly of the tambaqui had one round of purging with Purge_Dups (RRID:SCR_021173) [25]. Purging was performed to remove any sequences representing duplicated portions of a chromosome, which can be erroneously kept in assemblies when the divergence level of those regions in both haplotypes is high. This removed 1,167 contigs and 26 Mbp (megabase pairs) of haplotypic retention. The final tambaqui genome was assembled into 1,269 scaffolds with a scaffold N50 of 40 Mbp and a total assembly length of 1,221,847,006 bp (Table 2). A fraction of 93% of the genome is assembled on 54 scaffolds, which represent the main tambaqui karyotype [26]. We have also identified the mitochondrial genome ( Figure 3) within our assembled genome: it is represented by scaffold NW_023495502.1, which is 16,715 bp in length and has conserved gene content and synteny with the C. macropomum mitogenome available at the National Center for Biotechnology Information (NCBI; KP188830.1).

Repeat sequences and gene annotation
We identified repeat sequences in C. macropomum using homology-based, and de novo approaches. A de novo library of repeats was created for the tambaqui using RepeatModeler2 package (RRID:SCR_015027) [27]. This library was then combined with RepBase (release 26.04, RRID:SCR_021169) [28], forming the final "teleost" library with which C. macropomum genome repeats were searched.  Splign [32] and ProSplign [33]. Those alignments are submitted to Gnomon [34] for gene prediction. Gnomon (i) merges non-conflicting alignments into putative models, then (ii) extends predictions missing a start and a stop codon or internal exon(s) using a hidden Markov model (HMM) algorithm. Finally, Gnomon (ii) builds pure ab initio predictions where it finds open reading frames of sufficient length but with no supporting alignment detected. Models built on RefSeq transcript alignments are given preference over overlapping Gnomon models with the same splice pattern. Table 5 presents a summary of the annotation of C. macropomum. A detailed description of the tambaqui genome annotation can be found on the NCBI Eukaryotic Annotation Page [35].  Detailed annotation report can be found at [36].

RESULTS AND DISCUSSION
further exploration of the tambaqui genome, especially by those who are not specialist bioinformaticians, such as geneticists working on selective breeding programs.

Evaluating the completeness of the genome assembly and annotation
The final assembly of the tambaqui is 1.2 Gbp with a scaffold N50 size of 40.163 Mbp ( Table 2). Figure 2A shows the DNA k-mer prediction of genome size done using the Illumina reads produced to polish this assembly. Further, Figure 2B presents a merqury [39] k-mer plot of the final assembly: merqury produces a mapping-free evaluation of k-mer completeness in genomes by comparing the assembly k-mers with raw reads for the specimen. In this case, we used the high-quality Illumina reads (Table 3) to plot the merqury evaluation against the genome k-mers. Figure 2B shows that (i) the k-mers in the genome are in accordance with its Illumina read k-mers, (ii) the assembly k-mers have the same distribution of the raw reads k-mer (2A), and that (iii) most of the assembly k-mers (pink color) are unique in the genome, showing that the final assembly of the tambaqui has low levels of haplotypic retention (blue color). The final phred-like merqury QV score is 36.73 (QV = 36. 73), meaning that the tambaqui assembled bases are more than 99.9% accurate. The merqury completeness score shows that 89.31% of kmers in the Illumina reads are present in the assembly, which is a good recovery of k-mers for a species with 0.6% heterozygosity. For the tambaqui genome, 93% of the assembled bases are present in the largest 54 scaffolds. We performed a first nucleotide synteny analysis of Benchmarking Universal Single-Copy Ortholog (BUSCO) genes found in the first 54 scaffolds of C. macropomum against the BUSCO genes on genome of Ictalurus punctatus [40] using busco2fasta [41] and Circos [42]. The synteny is presented in Figure 4. C. macropomum and I. punctatus shared a common ancestor ∼150 million years ago [43]. The image shows a good degree of synteny in terms of BUSCO genes; for a number of times entire chromosomes are syntenic. Figures 5  and 6 show similar analysis with C. auratus [44] and Astyanax mexicanus [45] of different levels of relatedness to C. macropomum, demonstrating the potential of this highly contiguous genome for studies of chromosome evolution.
The phylogeny presented herein (Figure 7) is consistent with other studies [53,54]. . Synteny analysis confirms moderate conservation of homologous single copy gene order between these species. While large syntenic blocks persist, a relatively large portion of C. macropomum genes are also fragmented into two or more linkage groups of the A. mexicanus genome.

Re-use Potential
Seasonal and long-term modifications in environmental conditions are well-known to be associated with periodic events of low water dissolved oxygen leading to hypoxia and even anoxia. Tambaqui is an Amazon fish species that has developed adaptions to deal with this, such as enlargement of the lower lip to grasp oxygen better to survive in hypoxic conditions. These, along with other fish adaptations to the Amazon aquatic ecosystem, are intriguing scientific questions that could be scientifically addressed using the present well-assembled and annotated tambaqui genome. Moreover, the availability of this annotated genome will pave the way for the development of tools for genomic breeding programs of tambaqui, the most important native species for aquaculture in South America.

DATA AVAILABILITY
The data sets supporting the results of this article are available in the GigaScience Database [55]. All sequencing data is available on NCBI under the BioProjects PRJNA702552 and PRJEB40318. The former contains the Sequence Read Archive (SRA) experiments with accession numbers SRX10122091 to SRX10122101. The latter comprises the assembled genome and sequence annotations with the accession number GCF_904425465.1.
The genome sequence and annotation files-including coding sequences and proteins-can be downloaded from the NCBI FTP server [37]. A data viewer is also available [38].