Chromosome-level genome assembly of the humpback puffer, Tetraodon palembangensis

The humpback puffer, Tetraodon palembangensis, is a poisonous freshwater pufferfish species mainly distributed in Southeast Asia (Thailand, Laos, Malaysia and Indonesia). The humpback puffer has many interesting biological features, such as inactivity, tetrodotoxin production and body expansion. Here, we report the first chromosome-level genome assembly of the humpback puffer. The genome size is 362 Mb, with a contig N50 value of ∼1.78 Mb and a scaffold N50 value of ∼15.8 Mb. Based on this genome assembly, ∼61.5 Mb (18.11%) repeat sequences were identified, 19,925 genes were annotated, and the function of 90.01% of these genes could be predicted. Finally, a phylogenetic tree of ten teleost fish species was constructed. This analysis suggests that the humpback puffer and T. nigroviridis share a common ancestor 18.1 million years ago (MYA), and diverged from T. rubripes 45.8 MYA. The humpback puffer genome will be a valuable genomic resource to illustrate possible mechanisms of tetrodotoxin synthesis and tolerance.

genome assembly of the humpback puffer. This assembly will be valuable for further study of mechanisms, such as tetrodotoxin synthesis and expansion defense. Comparative genomic analysis will help us to better understand the phylogenetic evolution and special gene families of the Tetraodontidae.

METHODS
All methods used to isolate DNA/RNA, construct libraries, and conduct genomic sequencing are available in a protocols.io collection (Figure 2 [7]).

Sample collection and sequencing
The sample (CNGB ID: CNS0224034) used in this study was an adult humpback puffer bought from the YueHe Flower-Bird-Fish market in Guangzhou Province, China. DNA and RNA were both isolated from blood following published protocols [8,9]. Then, a paired-end single tube long fragment reads (stLFR) library and an RNA library were constructed according to the protocol published by Wang et al. [10]. A Hi-C library was constructed from blood according to the protocol published by Huang et al. [11]. These three libraries were then sequenced on the BGISEQ-500 platform (RRID:SCR_017979) [12]. A Nanopore library was constructed with DNA isolated from blood using the QIAamp DNA Mini Kit (Qiagen) [13] and sequenced on the GridION platform (RRID:SCR_017986) [14]. In total, we obtained 146 Gb (∼312×) raw stLFR data, 21 Gb raw RNA data, 19 GB (∼49×) raw Hi-C data, and 12 GB (∼32×) raw Nanopore data (Table 1).
Raw stLFR reads were subjected to quality control to improve the assembly quality.

Genome assembly
Jellyfish (v2.2.6, RRID:SCR_005491) was used to count k-17mers of all clean stLFR reads [17]. Genomescope [18] was used to estimate the humpback puffer genome size at about 385 Mb ( Table 2 and Figure 1b). The genome size, G, was defined as G = K num ∕K depth , where the K num is the total number of k-mers, and K depth is the most frequently occurring frequency.
To assemble the humpback puffer genome, we firstly converted the format of clean stLFR reads, then used Supernova (v. 2.0.1, RRID:SCR_016756) to perform the draft assembly. Then, we used GapCloser (v. 1.12, RRID:SCR_015026) [19] to fill gaps with stLFR reads. To futher improve the assembly quality, TGSgapFiller [20] was then used to re-fill gaps with Nanopore reads, and Pilon (v. 1.22, RRID:SCR_014731) [21] was used to polish the assembly twice. At this stage, the genome assembly was about 362 Mb, with 7.1-Mb scaffold N50 and 1.8-Mb contig N50 values (Table 3).
With the genome and validated Hi-C data from HiC-Pro, the contact matrix was generated by Juicer (v3, RRID:SCR_017226). Finally, we perfomed chromosomal-level scaffolding using the 3D de novo assembly (3D-DNA) pipeline (v. 170123) [22]. This anchored 91.2% of all sequences to 18 chromosomes, with a length ranging from 11 Mb to 35 Mb (Table 4 and Figure 3).

Genome evolution
To study the evolutionary status of humpback puffer among bony fish species, we clustered gene families by alignment using protein sequences of the humpback puffer and nine other teleosts (Xiphophorus maculatus, Gasterosteus aculeatus, Sebastes schlegelii, Oryzias latipes, Gadus morhua, Oreochromis niloticus, Tetraodon nigroviridis, Danio rerio, and Takifugu rubripes) using the TreeFam v0.50 pipeline [45]. Protein-coding genes sequences for all of   The EVM gene set contains the integrated result of De novo gene predictions, homolog gene predictions and transcript annotation by EVM software. these species were downloaded from NCBI, except for S. schlegelii [46], which was obtained from the China National Genebank Nucleotide Sequence Archive (CNSA; Accession ID: CNP0000222). To improve analysis quality, we removed genes either with frameshifts, or less than 50 amino acids, as well as redundant copies, only keeping the longest transcripts for comparative genomic analysis. A total of 21,022 gene families were identified, of which 40 gene families were unique to the humpback puffer (Table 9 and Figure 5a). Of all 21,022 gene families, we identified 4461 single-copy protein-coding genes shared by all species. We used MUSCLE v3.8.31 [47] to align these orthologs, with default parameters. Then, the alignments were concatenated into a 3,584,782 amino acid "super alignment matrix". Based on this matrix, a phylogenetic tree was constructed using RAxML v8.2.4 [48], with the best amino acid substitution model-JTT. Clade support was assessed using a bootstrapping algorithm with 1000 alignment replicates (Figure 5b). Next, we calculated the divergence time among these teleosts using the MCMCTree tool included in PAML (v4.7a, RRID:SCR_014932) [49], with parameters of "-rootage 500 -clock 3 -alpha 0.431879". The fossil correction time (Table 10) was obtained from Timetree [50]. The result showed that the humpback puffer and T. nigroviridis, two species belonging to the same genus, shared a common ancestor 18.

DATA VALIDATION AND QUALITY CONTROL
To demonstrate the quality of genome assembly and gene set, we performed a qulity evaluation using the actinopterygii_odb10 database from Benchmarking Universal Single-Copy Orthologs (BUSCO v.4.1.2, RRID:SCR_015008) [51]. The results showed that 95.7% and 90.7% complete BUSCOs were covered by the genome assembly and gene set, respectively (Table 11).

REUSE POTENTIAL
We assembled the first annotated chromosome-level genome of the humpback puffer.
These resources will be helpful to study the mechanism of body expansion displayed by this fish species, the synthesis mechanism and treatment of tetrodotoxin, as well as the evolution of freshwater puffer. Futhermore, the humpback puffer genome will fill a gap missing from the Fish 10K program and in the phylogenetic tree of life.

DATA AVAILABILITY
We have deposited the project at CNGB Nucleotide Sequence Archive (CNSA) where the accession ID is CNP0001025. The genomic data can be obtained in GigaScience Database [52].