Bicolor angelfish (Centropyge bicolor) provides the first chromosome-level genome of the Pomacanthidae family

The Bicolor Angelfish, Centropyge bicolor, is a tropical coral reef fish. It is named for its striking two-color body. However, a lack of high-quality genomic data means little is known about the genome of this species. Here, we present a chromosome-level C. bicolor genome constructed using Hi-C data. The assembled genome is 650 Mbp in size, with a scaffold N50 value of 4.4 Mbp, and a contig N50 value of 114 Kbp. Protein-coding genes numbering 21,774 were annotated. Our analysis will help others to choose the most appropriate de novo genome sequencing strategy based on resources and target applications. To the best of our knowledge, this is the first chromosome-level genome for the Pomacanthidae family, which might contribute to further studies exploring coral reef fish evolution, diversity and conservation.

Protocols for BGISEQ-500, stLFR and Hi-C library preparation and construction, and genome assembly, for the Bicolor Angelfish, Centropyge bicolor [2]. https://www.protocols.io/widgets/doi?uri=dx.doi.org/10.17504/ protocols.io.bpxhmpj6 possible to explore the genetic mechanisms of body color development in coral reef fish by comparative genomic methods.

METHODS AND RESULTS
A protocols collection for BGISEQ-500, stLFR and Hi-C library construction is available in protocols.io ( Figure 2) [2].

Sample collection and genome sequencing
A C. bicolor individual was collected from the market in Xiamen, Fujian Province, China.
Low-quality reads (sequences with more than 40% of bases with a quality score lower than 8), polymerase chain reaction (PCR) duplications, adaptor sequences and reads with a high (greater than 10%) proportion of ambiguous bases (Ns) occurring in stLFR data were filtered using SOAPnuke (v1.6.5; RRID:SCR_015025) [4]. We obtained 62.6 Gbp (∼91.67×) clean data (Table 1) to assemble the draft genome. Meanwhile, HiC-Pro (v. 2.8.0) [5] was used for the quality control of raw Hi-C data, and 42.51 Gbp (∼64.19×) valid data were used to assemble the genome to the chromosome-level (Table 1).

Genome assembly
Using GenomeScope software (RRID:SCR_017014) with stLFR clean data, k-mer distribution was used to understand the genome complexity before genome assembly [6]. The genome size of C. bicolor was estimated as 662.27 Mbp (megabase pairs), with 37.6% repeat sequences and 1.16% heterozygous sites (Table 2, Figure 3). Sequencing depth = Total bases / Genome size, where the genome size is the result of k-mer estimation, as shown in Table 2. The genome size, G, was defined as G = K num /K depth , where K num is the total number of k-mers, and K depth is the most frequently occurring k-mer. We reformatted the clean stLFR data into 10× Genomics format using an in-house script [7] and assembled the draft genome using Supernova (v.2.0.1, RRID:SCR_016756) [8] with default parameters. The draft genome was 681 Mbp, with a contig N50 of 115.5 Kbp (kilobase pairs) and scaffold N50 of 4.4 Mbp (Table 3), which is similar to the estimated genome size.
Homolog-based and ab initio prediction were used to identify the protein-coding genes.
To predict gene functions, 21,774 genes were aligned against several public databases, including TrEMBL [24], SwissProt [24], KEGGViewer [25] and InterProScan [26]. As a result, 99.67% of all genes were predicted functionally (Table 9, Figure 7).  The GLEAN gene set is the integrated result of de novo gene predictions and homolog gene predictions.

Phylogenetic analysis
We downloaded the gene data of seven representative teleost fishes from NCBI to study the phylogenetic relationships between C. bicolor. These seven fishes were: Danio rerio,  Based on the phylogenetic tree and single-copy sequences, the divergence time between different species was estimated by MCMCTREE with parameters of "-model 0 -rootage 500 -clock 3". The results showed that C. bicolor was formed ∼34.95 million years ago, when differentiated from the common ancestor with L. crocea (Figure 8).

Analysis of bicolor formation in teleosts
Current studies suggest that different pigment cells produce different pigments. Some types of pigment cells already have been identified in teleost [30]. C. bicolor has an attractive body color with clear color boundaries, but the molecular mechanism underlying this remains unknown. Compared with other teleost, there are 1,081 expanded gene families and 57 specific gene families in C. bicolor (Figure 9). Functional enrichment analysis showed that notable expansion occurred in those gene families related to visual development and enzyme metabolism ( Figure 9).

RE-USE POTENTIAL
Coral reef fishes, with distinctive color patterns and color morphs, are important for understanding the adaptive evolution of fishes. In this study, we firstly assembled a high-quality, chromosome-level genome of C. bicolor, with a length of 681 Mbp, and annotated 21,774 genes. This is the first genome of a fish from the Pomacanthidae family.
These genomic data will be useful for genome-scale comparisons and further studies on the mechanisms underlying colorful body development and adaptation.

DATA AVAILABILITY
The data sets supporting the results of this article are available in the GigaScience Database [31]. Raw reads from genome sequencing and assembly are deposited at the China National Gene Bank under reference number CNP0001160, which contains sample