Genome assembly and annotation of the Brown-Spotted Pit viper Protobothrops mucrosquamatus

The Brown-Spotted Pit viper (Protobothrops mucrosquamatus), also known as the Chinese habu, is a widespread and highly venomous snake distributed from Northeastern India to Eastern China. Genomics research can contribute to our understanding of venom components and natural selection in vipers. Here, we collected, sequenced and assembled the genome of a male P. mucrosquamatus individual from China. We generated a highly continuous reference genome, with a length of 1.53 Gb and 41.18% of repeat elements content. Using this genome, we identified 24,799 genes, 97.97% of which could be annotated. We verified the validity of our genome assembly and annotation process by generating a phylogenetic tree based on the nuclear genome single-copy genes of six other reptile species. The results of our research will contribute to future studies on Protobothrops biology and the genetic basis of snake venom.


INTRODUCTION
Protobothrops mucrosquamatus belongs to the Viperidae (viper) family of snakes commonly known as the brown spotted pit viper or Chinese habu.This species is widely distributed in northern Vietnam, Laos, northern Myanmar, northeastern India, as well as southwestern and eastern China (Figure 1) [1].P. mucrosquamatus is a venomous snake with tubular venom-conducting fangs and loreal pit.Their poisoning manifests through the functional impairment of the blood circulation system of their prey [2].Compared with other terrestrial vipers, the maximum amount of single-discharging venom of P. mucrosquamatus is higher than in Trimeresurus stejnegeri, Gloydius blomhoffii and Bungarus multicinctus [3].Its toxicity per unit dose is also higher than in Deinagkistrodon acutus and T. stejnegeri [3].
Snake venom, while it may contribute to health damage in organisms [1,2,[4][5][6], can also play a role in biomedicine [5,[7][8][9], particularly in snake antivenom development, disease treatment and many other fields [10].High-quality reference genomes and transcriptomes are required to detect venom genes, get insights into toxin-manufacturing mechanisms, and design safe and effective antivenoms and other drugs [11,12].Moreover, the rapid evolution of venom proteins generally occurs under environmental stress [13,14], such as predation needs.Hence, the study of proteinaceous-venom coding genes is an excellent model system for adaptation and nature selection [15].

Context
While snake venoms are dangerous to human health, they are also a potential gold mine of bioactive proteins that can be harnessed for drug discovery [16].Also, snake genomics has huge potential for studying venom evolution and toxicology.Here, we assembled a highly contiguous genome of a male P. mucrosquamatus individual collected from Guilin, Guangxi, China, using single-tube long fragment read (stLFR) technology [17] and whole genome sequencing (WGS).The total size of the genome we generated is 1.53 Gb, including 41.18% repeat elements.This data provides new material for future research on the Protobothrops genome and the genetic basis of this snake venom.

Methods
Detailed stepwise protocols are gathered in a protocols.iocollection, with the minor adaptations outlined below [18] (Figure 2).

Sample collection and sequencing
A male P. mucrosquamatus individual was captured in Guilin, Guangxi, China.After collection and identification, the specimen was quickly frozen in −80 °C Drikold dry ice for storage and transport in order to preserve DNA and RNA molecules.Samples from the heart, stomach, liver, and kidney were utilized for RNA sequencing.A muscle sample was used for stLFR and WGS sequencing.DNA extraction, library construction and sequencing are outlined in the protocols.ioprotocols [18].
To annotate the function of genes of P. mucrosquamatus, a comprehensive analysis was conducted.BLAST searches were executed against multiple databases, including SwissProt, TrEMBL (RRID:SCR_004426), and Kyoto Encyclopedia of Genes and Genomes (KEGG; RRID:SCR_012773), with an E-value cut-off of 1 × 10 −5 .To predict motifs and domains, InterProScan (v5.52-86.0;RRID:SCR_005829) [27] as well as gene ontology (GO; RRID:SCR_002811) were employed.The results of this analysis further enriched our understanding of the genes' roles and their involvement in biological processes.

RESULTS
In this snake genomics study, 224.27Gb linked-reads data was obtained after stLFR sequencing, and 96.93 Gb short reads data was obtained after WGS sequencing, for a total of 321.20 Gb (Table 1).
We produced a high-continuity P. mucrosquamatus genome assembly, with 1.53 Gb total genome size, 39.86% GC content and 362.40 kb scaffold N50 length (Table 2).The P. mucrosquamatus genome assembly, whose maximal scaffold length reaches 5.31 M, has 149,173 scaffolds over 500 bp, with 1.51 Gb total length, occupying 98.82% of the entire genome.We foresee that this resource will provide new perspectives for the study of viper genomics.
We identified 41.18% repetitive elements in our P. mucrosquamatus genome.Long interspersed nuclear elements (LINEs) constituted the largest proportion of this assembly at  3 and 4).Using homology-based, de-novo and RNA-sequencing annotation methods, 24,799 protein-coding genes were identified in our P. mucrosquamatus genome assembly.The average gene of a P. mucrosquamatus is 1.53 bp long and contains 8.96 exons.Additionally, 387 miRNAs, 319 tRNAs and 289 snRNAs were predicted in our P. mucrosquamatus genome (Table 5).
According to our KEGG enrichment analysis, Environmental Information Processing, Organismal Systems and Metabolism pathways comprise a significant proportion of these pathways.In particular, the Signal Transduction pathways take up the largest proportion.largest number of Organismal System pathways (Figure 4a).Based on our GO analysis, 7,900 genes relate to binding and 7,740 genes to cellular processes (Figure 4b).

DATA VALIDATION AND QUALITY CONTROL
BUSCO v5.2.2 was used to evaluate the completeness and quality of our assembly [40].Our BUSCO analysis results indicate that this genome assembly has up to 83.6% completeness using the vertebrata_odb10 database (Figure 5).
To check the quality of our assembly, we constructed a phylogenetic tree using protein sequences from NCBI and CNGB for seven other kinds of amphibians and reptiles (Anolis carolinensis, Chelonia mydas, Deinagkistrodon acutus, Ophiophagus hannah, Python bivittatus, Xenopus tropicalis and Alligator mississippiensis), as well as Gallus gallus, Homo sapiens, Mus musculus, Danio rerio.The relationship among all these species reflected by the phylogenetic tree aligns with previous research, demonstrating that our data can screen related species (Figure 6).Finally, a total of 1,177 single-copy loci were found.

REUSE POTENTIAL
This genomic data will provide new resources for further studying viper biology and evolution alongside the genetic basis of viper snake venom.

DATA AVAILABILITY
The data that support the findings of this study have been deposited into the CNGB Sequence Archive (or CNSA) [41] of China National GeneBank DataBase (or CNGBdb) [42] with the accession number CNP0004048.Raw reads are available in the Short Read Archive under the BioProject ID PRJNA943598, and additional data is available in the GigaDB repository [43].

EDITOR'S NOTE
This paper is part of a series of Data Release papers presenting the reference genomes of different snake species [44].element; stLFR, single-tube long fragment read; TE, transposable element; WGS, whole genome sequencing.

The
Institutional Review Board of BGI (BGI-IRB E22017) approved sample collection, experiments, and research design in this study.Throughout this research, strict adherence to the guidelines set by BGI-IRB was ensured during all procedures.

Figure 3 .
Figure 3. Distribution of TEs in our P. mucrosquamatus genome.The TEs include DNA transposons (DNA) and RNA transposons (i.e., DNAs, LINEs, LTRs and SINEs).(a) Distribution of de novo sequence divergence rates.(b) Distribution of known sequence divergence rates.
Genes associated with the Immune (2,445) and Endocrine systems (2,033) accounted for the Gigabyte,

Figure 6 .
Figure 6.Phylogenetic tree reconstructed using single-copy genes from nuclear genomes.The numbers on the branches of the phylogenetic tree represent the branch length obtained in OrthoFinder.

Table 1 .
Summary statistics of P. mucrosquamatus sequenced reads.

Table 2 .
Summary of the features of the P. mucrosquamatus genome.

Table 3 .
Statistics for the repetitive sequences identified in our P. mucrosquamatus genome.

Table 4 .
Summary of the TEs in our P. mucrosquamatus genome.

Table 5 .
Statistics for the miRNA, tRNA, rRNA and snRNA predicted in our P. mucrosquamatus genome.

Table 6 .
Results of gene functional annotation.