Chromosomal-level genome assembly of golden birdwing Troides aeacus (Felder & Felder, 1860)

The golden birdwing Troides aeacus (Lepidoptera, Papilionidae), a significant species in Asia, faces habitat loss due to urbanization and human activities, necessitating its protection. However, the lack of genomic resources hinders our understanding of their biology and diversity, and impedes our conservation efforts based on genetic information or markers. Here, we present the first chromosomal-level genome assembly of T. aeacus using PacBio SMRT and Omni-C scaffolding technologies. The assembled genome (351 Mb) contains 98.94% of the sequences anchored to 30 pseudo-molecules. The genome assembly has high sequence continuity with contig length N50 = 11.67 Mb and L50 = 14, and scaffold length N50 = 12.2 Mb and L50 = 13. A total of 24,946 protein-coding genes were predicted, with high BUSCO score completeness (98.8% and 94.7% of genome and proteome BUSCO, respectively. This genome offers a significant resource for understanding the swallowtail butterfly biology and carrying out its conservation.


INTRODUCTION
The golden birdwing butterfly Troides aeacus (Figure 1A) is a swallowtail butterfly that is widely distributed in Asia, including Bangladesh, Myanmar, Cambodia, China, India, Laos, Malaysia, Nepal, Thailand, and Vietnam [1].The species is generally large, with a wingspan reaching ∼15 cm, and has iconic black forewings and golden-yellow hindwings carved with grey stripes and black spots [2,3].Due to its attractive appearance, it has been vastly collected and traded in curio markets [2,4,5].
Similar to other homometabolans, T. aeacus has larvae and pupae stages: five larval instar stages before transforming into its green-girdled pupal stage [3].The larvae are generally dependent on Aristolochiaceae host plants, especially of the genus Aristolochia, which can be commonly found in Asia [1][2][3]6].After emergence, the adults feed and live around nectaring flowers such as those in the genus Hibiscus, Ixora, Lantana, Mussaenda, and Spathodea [1,7].Anthropogenic activities, including deforestation, grazing, herbicide application, hunting, land reclamation, mine exploitation, and trading, have been suggested to pose threats to T. aeacus [1,3,8].In certain places, such as Hong Kong, T. aeacus has also been suggested for protection and restoration efforts to recover its lost habitat.In Taiwan, the trade of endemic subspecies such as T. aeacus is protected by the Convention on International Trade in Endangered Species of Wild Fauna and Flora [9].

CONTEXT
To date, the genomic resources in the genus Troides are confined to T. helena [10] and T. oblongomaculatus [11].In light of the high conservation value of T. aeacus and its phylogenetic importance for understanding the diversification of butterflies [12], this species has been selected for genome sequencing by the Hong Kong Biodiversity Genomics Consortium (also known as EarthBioGenome Project Hong Kong), which is formed by investigators from eight publicly funded universities in Hong Kong.Here, we report a chromosomal-level genome assembly of the golden birdwing T. aeacus.

Sample collection and species identification
A pupa of the golden birdwing T. aeacus was obtained at Lui Kung Tin, Yuen Long District, Hong Kong (22.425886 °N, 114.

DNA shearing, PacBio library preparation, and sequencing
A total of 120 μl of DNA, corresponding to 10 μg DNA, was transferred to a g-tube (Covaris Part No. 520079).The tube was then subjected to six centrifugation steps with 2,000 × g of 2 minutes each.The resultant DNA was saved in a 2 mL DNA LoBind ® Tube (Eppendorf Cat.  the SMRTbell library, respectively.Next, the library was loaded at an on-plate concentration of 50-90 pM using the diffusion loading mode.The sequencing was conducted on the Sequel IIe System with an internal control provided in the kit.The sequencing was performed in 30-hour movies, with 120 min pre-extension, connected to the software SMRT Link v11.0 (PacBio).HiFi reads were generated and collected for further analysis.One SMRT cell was used for this sequencing (Table 1).

Omni-C library preparation and sequencing
An Omni-C library was made using the Dovetail ® Omni-C ® Library Preparation Kit (Dovetail Cat.No. 21005) according to the provided protocol.In summary, 80 mg of frozen, powered tissue sample was placed in a microcentrifuge tube with 1 mL 1× PBS and formaldehyde.were used to amplify the constructed library.Size selection was carried out with SPRIselect™ Beads targeting fragments ranging between 350 bp and 1,000 bp.Finally, the concentration and fragment size of the sequencing library were examined with the Qubit ® Fluorometer, Qubit™ dsDNA HS, and BR Assay Kits, and the TapeStation D5000 HS ScreenTape, respectively.The resultant library was sequenced on the Illumina HiSeq-PE150 platform (Table 1).
Haplotypic duplications were identified and removed using purge_dups (RRID:SCR_021173) based on the depth of the HiFi reads [15].Proximity ligation data from the Omni-C library were used to scaffold the genome assembly by YaHS [16].Transposable elements (TEs) were annotated as previously described [17] using the automated Earl Grey TE annotation pipeline (version 1.2, https://github.com/TobyBaril/EarlGrey).A total of 38,780 papilionidae reference protein sequences were downloaded from NCBI as protein hits to perform genome annotation using Braker (v3.0.8;RRID:SCR_018964) [18] with default parameters.

DATA VALIDATION AND QUALITY CONTROL
During DNA extraction and PacBio library preparation, the samples were subjected to quality control with NanoDrop™ One/OneC Microvolume UV-Vis Spectrophotometer, Qubit ® Fluorometer, and overnight pulse-field gel electrophoresis.The Omni-C library was inspected by Qubit ® Fluorometer and TapeStation D5000 HS ScreenTape.
Regarding the genome assembly, the Hifiasm output was blast to the NT database, and the resultant output was used as input for Blobtools (v1.1.1;RRID:SCR_017618) [19].Scaffolds that were identified as possible contamination were removed from the assembly manually (Figure 2).A statistical kmer-based approach was applied to estimate the heterozygosity of the assembled genome heterozygosity.The repeat content and the corresponding sizes were analysed using Jellyfish (RRID:SCR_005491) [20] and GenomeScope (RRID:SCR_017014) [21] (Figure 1D; Table 4).Furthermore, telomeric repeats were inspected by FindTelomeres [22].BUSCO (v5.5.0) [23] was used to assess the completeness of the genome assembly and gene annotation with the metazoan dataset (lepidoptera_odb10).

Genome assembly of T. aeacus
A total of 27 Gb of HiFi bases were yielded with an average HiFi read length of 9,688 bp with 78X coverage (Supplementary Information 1).After incorporating 21.7 Gb Omni-C data, the resulting genome assembly was 350.66 Mb in size with 36 scaffolds, 30 of which are of chromosome length (Figure 1B-C; Table 2, 3).The genome has high contiguity with a scaffold N50 value of 12.21 Mb, and high completeness with a complete BUSCO (RRID:SCR_015008) estimation of 98.8% (lepidoptera_odb10) (Figure 1B, Table 2).While the genome size estimation was about 268.3 Mb with a 2.93% nucleotide heterozygosity rate (Figure 1D; Table 4), the assembled T. aeacus genome has a genome size similar to other swallowtail butterfly genomes, including T. helena (∼330 Mb) [10] and T. oblongomaculatus (∼348 Mb) [11].In addition, 43 telomeres were found in 25 scaffolds of the assembly genome (Table 5).Furthermore, 23,068 gene models were predicted with a BUSCO score of 94.7% (lepidoptera_odb10).

Repeat content
A total repetitive content of 29.50% was identified in the assembled genome, including 5.16% unclassified elements (Figure 1E; Table 6).Among the known repeats, long interspersed nuclear elements (LINEs) were the most abundant ones (12.01%), followed by short interspersed nuclear element (SINE) retrotransposons (6.38%) and DNA transposons

CONCLUSION AND REUSE POTENTIAL
This study presents the first chromosomal-level genome assembly of the golden birdwing T. aeacus, a useful and precious resource for further phylogenomic studies of birdwing butterfly species in terms of species diversification and conservation.

DATA AVAILABILITY
The final genome assembly was submitted to NCBI under the accession number (GCA_033220335.2).The raw reads yielded from this study were deposited on the NCBI database under the BioProject accession number PRJNA973839.The genome annotation files were deposited in figshare [28].
10538 °E) in August 2022.The pupa was snap-frozen in liquid nitrogen upon collection.The frozen pupa was then ground into a fine powder and stored at −80 °C until DNA isolation.A portion of the powder was used for species molecular identification with QIAamp DNA Mini Kit (Qiagen Cat.51306), following the provided protocol.The DNA was then used as a template for conventional PCR with the following protocol: an initial denaturation step at 95 °C for 3 minutes followed by 36 amplification cycles for denaturation of 30 seconds each at 95 °C; 30 seconds for primer annealing at 55 °C and 1 minute for extension at 72 °C; finally, an extension step at 72 °C for 3 minutes.The reaction mixture included PCR buffer, DNA template, 2 mM dNTP, 1.5 mM MgCl 2 , 0.4 mM of each forward and reverse primers (LCO1490: 5′-GGTCAACAAATCATAAAGATATTGG-3′, HCO2198: 5′-TAAACTTCAGGGTGACCAAAAAATCA-3′) [13], and Taq DNA polymerase.The PCR was performed on a T100™ thermal cycler (Bio-Rad, USA).The unpurified PCR products were sent to BG Hong Kong for Sanger sequencing.The returned sequence was validated with the chromatogram, and the resultant sequence was searched against Genbank for species validation using the BLASTN algorithm (RRID:SCR_001598).
No. 022431048) at 4 °C until library preparation.The molecular weight of the isolated DNA was examined by overnight pulse-field gel electrophoresis.The electrophoresis profile was set as follows: 5 K as the lower end and 100 K as the higher end for the designated molecular weight; Gradient = 6.0 V/cm; Run time = 15 h:16 min; included angle = 120°; Int.Sw.Tm = 22 s; Fin.Sw.Tm = 0.53 s; Ramping factor: a = Linear.The gel was run in 1.0% PFC agarose in 0.5× TBE buffer at 14 °C.A SMRTbell library was made using the SMRTbell ® prep kit 3.0 (PacBio Ref. No. 102-141-700), following the provided protocol.In summary, single-stranded overhangs of the genomic DNA were removed, and the DNA was repaired from any physical damage caused by shearing.Subsequently, both DNA ends were tailed with an A-overhang, and ligation of T-overhang SMRTbell adapters was performed at 20 °C for 30 minutes.The SMRTbell library was then purified with SMRTbell ® cleanup beads (PacBio Ref. No. 102158-300).The size and concentration of the library were assessed with the pulse-field gel electrophoresis and the Qubit ® Fluorometer, Qubit™ dsDNA HS, and BR Assay Kits (Invitrogen™ Cat.No. Q32851), respectively.A subsequent nuclease treatment step was carried out to remove non-SMRTbell structures in the library.A final size-selection step was performed to remove small DNA fragments in the library with 35% AMPure PB beads.The The fixed DNA was digested with endonuclease DNase I. Next, the concentration and size of the digested sample were examined by the Qubit ® Fluorometer, Qubit™ dsDNA HS, and BR Assay Kits (Invitrogen™ Cat.No. Q32851) and the TapeStation D5000 HS ScreenTape, respectively.Both DNA ends were polished, and ligation of the biotinylated bridge adaptor was conducted at 22 °C for 30 minutes.The subsequent proximity ligation between crosslinked DNA was performed at 22 °C for 1 hour.After ligation, the DNA was reverse crosslinked and purified with SPRIselect™ Beads (Beckman Coulter Product No. B23317) to remove the biotin that was not internal to the ligated fragments.The Dovetail™ Library Module for Illumina (Dovetail Cat.No. 21004) was used for end repair and adapter ligation.The DNA was tailed with an A-overhang, which allowed Illumina-compatible adapters to ligate to the DNA fragments at 20 °C for 15 minutes.The Omni-C library was then sheared into fragments with USER Enzyme Mix and purified with SPRIselect™ Beads.The isolation of DNA fragments with internal biotin was performed with Streptavidin Beads.Universal and Index PCR Primers from the Dovetail™ Primer Set for Illumina(Dovetail Cat.No. 25005)

Figure 2 .
Figure 2. Genome assembly quality control and contaminant detection.

Table 1 .
Summary of the genome sequencing data.
® II binding kit 3.2 (PacBio Ref. No. 102-194-100) was used for final preparation.In short, Sequel II primer 3.2 and Sequel II DNA polymerase 2.2 were annealed and bound to

Table 2 .
Details of the genome assembly statistics.

Table 5 .
Summary of telomeric repeats found in 25 scaffolds.

Table 6 .
Summary of the repetitive elements in the genome.