Annotation of segmentation pathway genes in the Asian citrus psyllid, Diaphorina citri

Insects have a segmented body plan that is established during embryogenesis when the anterior–posterior (A–P) axis is divided into repeated units by a cascade of gene expression. The cascade is initiated by protein gradients created by translation of maternally provided mRNAs, localized at the anterior and posterior poles of the embryo. Combinations of these proteins activate specific gap genes to divide the embryo into distinct regions along the anterior–posterior axis. Gap genes then activate pair-rule genes, which are usually expressed in parts of every other segment. The pair-rule genes, in turn, activate expression of segment polarity genes in a portion of each segment. The segmentation genes are generally conserved among insects, although there is considerable variation in how they are deployed. We annotated 25 segmentation gene homologs in the Asian citrus psyllid, Diaphorina citri. Most of the genes expected to be present in D. citri based on their phylogenetic distribution in other insects were identified and annotated. Two exceptions were eagle and invected, which are present in at least some hemipterans, but were not found in D. citri. Many of the segmentation pathway genes are likely to be essential for D. citri development, and thus they may be useful targets for gene-based pest control methods.


Introduction
Segmentation is the process by which repeated units of similar groups of cells are created along the anterior-posterior axis of a developing embryo. The molecular mechanisms involved in this process were first elucidated by large-scale developmental mutant screens in the insect model Drosophila melanogaster [1][2][3][4][5]. In Drosophila, segmentation begins with cytoplasmic inheritance of mRNAs that are maternally produced and provided to the oocyte. The products of these maternal-effect genes create gradients that define positional information within the embryo and activate a group of genes known as gap genes.
Gap genes are expressed in broad, well-defined domains in the early embryo and activate the next set of transcription factors: the pair-rule genes. Pair-rule genes are expressed in every other segment of the developing embryo. Together, they activate the Comparative studies in diverse arthropod species have shown that some aspects of the segmentation pathway are highly conserved, while other aspects have undergone evolutionary change [6]. The hemipteran insects that have been examined seem to employ a particularly divergent method of segmentation. Most strikingly, the pair-rule genes, which are usually considered to be the most conserved portion of the segmentation pathway among insects, have lost their pair-rule expression and function and are expressed segmentally in at least some hemipterans [7][8][9].

Context
As part of a community genome annotation project, we have annotated 25 homologs of Drosophila segmentation pathway genes in the genome of the hemipteran agricultural pest, Diaphorina citri (NCBI:txid121845), also known as the Asian citrus psyllid. A few segmentation genes were identified in a previous version of the D. citri genome [10]. However, many of those genes were incomplete because of genome assembly errors. The current genome version (v3) is much higher quality [11], allowing us to annotate genes with much higher accuracy and confidence. Except for eagle and invected, which appear to be missing from the D. citri v3 genome, we annotated all of the segmentation genes expected to be present in D. citri. Our annotations pave the way for future work aimed at understanding the expression and function of these genes during D. citri segmentation and the identification of essential genes that could be used as insect control targets.

RESULTS AND DISCUSSION
We searched the D. citri v3 genome for orthologs of genes known to be involved in  The Drosophila melanogaster numbers were determined from Flybase (RRID:SCR_006549) [16]. Ortholog numbers for Apis mellifera [17], Tribolium castaneum [18] and Acyrthosiphon pisum [19] are based on genome publications or NCBI (RRID:SCR_006472). Diaphorina citri ortholog numbers represent our final manual annotation.

Maternal effect genes
One-to-one orthologs of caudal (cad), dorsal (dl), and nanos (nos) were found in the D. citri v3 genome. dl was previously annotated in the D. citri genome v1.1 because of its role in innate immunity [10]. Here, we annotated a second isoform of dl (Table 1).
Phylogenetics of the knirps family suggests that a single ancestral gene duplicated early in the insect lineage, producing two paralogs called knirps-related (knrl) and eagle (eg) [23].
Subsequent duplications have occurred in various insect lineages. A duplication in the lineage leading to Drosophila resulted in the paralogs knirps and knirps-like (also called knirps-related) [23]. A separate duplication of knrl seems to have occurred in the hemipteroid lineage, leading to three knirps family genes (two knrl and one eg) in most hemipterans [23]. In the D. citri genome v3 we identified three potential knirps family genes (Tables 1, 2), one of which was annotated as knirps in D. citri genome v1.1 [10]. These three knirps family genes are all located on the same chromosome, within a 900-kilobase pair (Kb) region. All three predicted proteins contain the highly conserved 94-amino acid N-terminal domain and the C-terminal PIDLS motif commonly found in knirps family members [23]. However, none of them contain the GASS-domain motif that is unique to the Eg protein [23]. Owing to the lack of this signature Eagle motif, the resulting D. citri annotations were named knirps-related 1 (knrl1), knirps-related 2 (knrl2) and knirps-related 3 (knrl3). Despite the lack of GASS domain, it is possible that one of these genes is the ortholog of eg but has lost the characteristic motif. Interestingly, D. citri knrl1 has a small exon just 5′ of the highly conserved coding exon that is the first exon in most knirps family genes ( Figure 2). Similar gene structure has been reported for one knirps family gene each in D. melanogaster, the honeybee Apis mellifera, A. pisum, and the human louse Most insects have two otd genes. However, in both Drosophila and A. pisum only one otd gene has been identified. Drosophila is missing the otd-2 ortholog, while otd-1 has apparently been lost in pea aphids [24]. In the D. citri v3 genome, we found two otd genes adjacent to one another on chromosome 4 (Table 1). Phylogenetic analysis suggests that one of these genes is an otd-1 ortholog, while the other is an otd-2 ortholog (Figure 3). The genomic clustering of otd-1 and otd-2 has also been noted in other insects and crustaceans where their genomic location has been examined [25], providing further support for the identification of the D. citri genes as otd-1 and otd-2.  [26,29]. We performed BLAST searches of D. citri genome v3 with all of these hemipteran gt orthologs, but were unable to identify a D. citri gt ortholog (Table 2).  btd is a member of the Sp family of transcription factors. Recent reports indicate that the ancestral state for arthropods-and perhaps all metazoans-is likely to be the presence of three Sp members [31]. These three Sp family genes cluster into three monophyletic clades (Sp5/btd, Sp1-4/(Sp-pps) and Sp6-9 (Sp1)) [31]. Even though the ancestral state appears to be the presence of three Sp family members, btd is absent from the A. pisum genome [19]. Furthermore, repeated efforts to clone btd from Oncopeltus fasciatus have only resulted in the identification of the two non-btd Sp genes. This suggests that btd may have been lost in the lineage leading to hemipterans. We were also unable to find a true btd ortholog in either the D. citri genome v3, or in independent de novo transcriptomes (Table 2). Two Sp family members were found that appear to be orthologous to Sp1 and Spps, but these were not annotated.
(opa) in the D. citri genome v3 (Tables 1, 2). A partial copy of h had been annotated in a previous genome version [10]. Three of the pair-rule genes we annotated have closely related paralogs and required additional analysis before gene identities could be assigned.
Prd is a member of the Pax3/7 family of proteins. In Drosophila there are three Pax3/7 family genes, which are known to be involved in segmentation and neurogenesis: prd, gooseberry (gsb) and gooseberry-neuro (gsb-n). While the number of Pax3/7 genes varies in arthropods, data from insects and arachnids suggest that the roles of Pax3/7 in segmentation and neurogenesis are likely to be conserved in all arthropods [32]. In the D.
citri genome v3, we also found three Pax3/7 genes, which we named paired, gooseberry and gooseberry-neuro based on reciprocal BLAST analysis and genomic location (Tables 1, 2).
The gsb and gsb-n orthologs are discussed in more detail in the segment polarity gene section.
odd is a zinc finger transcription factor with three close relatives known as brother of odd with entrails limited (bowl), sister of odd and bowl (sob) and drumstick (drum). All four genes are in a conserved cluster in Tribolium castaneum and D. melanogaster [33]. In D. citri genome v3, drum, odd and sob are all located within a 400-Kb region, with odd and sob overlapping one another on opposite strands. It remains unclear whether the overlap is correct or results from misassembly, but the genes are almost certainly located very close together. D. citri bowl is located on the same chromosome about 20 megabase pairs (Mb) away. Separation of bowl from the rest of the cluster has also been observed in Anopheles gambiae [34].
Insects have four runt domain-containing genes: run, Runt-related A (RunxA), Runt-related B (RunxB) and lozenge (lz). All four genes are typically found in a cluster and their order and orientation is well conserved across insects [35] (Figure 4). We were able to annotate full length models for all four genes in the D. citri genome. It appears that the cluster is intact, with all four genes identified in their expected order within a 300-Kb region ( Figure 4).

Segment polarity genes
Many segment polarity genes are members of the Wnt and Hedgehog signaling pathways.
Manual annotation of the Wnt pathway genes in the D. citri genome v3 is described in a separate report [36]. Here, we report the manual annotation of the segment polarity genes gooseberry (gsb) and engrailed (en) ( Table 1). gsb and en each have a tightly linked paralog in many insects [37][38][39]. Surprisingly, we were unable to find the en paralog invected (inv) in the current genome assembly or the de novo transcriptome. However, we did find and annotate the gsb paralog gooseberry-neuro (gsb-n) in its expected position adjacent to gsb (Table 1). This positional information helped verify the identity of gsb-n, since phylogenetic analysis was inconclusive.

CONCLUSION
We searched for orthologs of 33 Drosophila segmentation genes in the D. citri v3 genome and identified and annotated 25 homologous genes. We were unable to find orthologs for 10 of the Drosophila genes, while D. citri has one segmentation gene (otd-2) whose ortholog has been lost in Drosophila. Most of the absences, except eagle and invected, were expected, based on the known phylogenetic distribution of the genes. While all the genes discussed in this report were initially identified because of their role in embryonic patterning and  Cluster information from other insects was obtained from [35]. The RD clusters in D. melanogaster and T. castaneum have three genes in a core cluster, with lozenge (lz) separated from the cluster, but on the same chromosome distal to runt (run). The RD clusters in A. mellifera and P. humanus have all four genes clustered together with lz proximal to run. The RD cluster in D. citri most closely follows the pattern seen in A. mellifera and P. humanus. D. citri lz appears to be transcribed in the opposite direction compared to other insects, but it is possible that this is attributed to local misassembly. The orientation of D. citri Runt-related B (RunxB) is uncertain, since there are tandem artifactual duplicates that are on opposite strands. We chose the RunxB copy closest to Runt-related A (RunxA) for annotation. Future assembly improvements may help resolve the gene orientation in this cluster. segmentation in Drosophila, many have other important functions, such as pole cell development, neural stem cell maintenance, sex determination, and immune function.
Analysis of expression patterns and gene function will be required to determine which of these genes are involved in D. citri segmentation.

DATA VALIDATION AND QUALITY CONTROL
Orthology of annotated genes was verified by reciprocal blasts and phylogenetic analysis.
Completeness of gene models (defined as a complete coding region) was assessed by comparing the encoded protein to orthologous proteins. Evidence used for validation of gene models is shown in Table 1.

REUSE POTENTIAL
This manual curation was carried out as a part of the D. citri community annotation project [40], with a goal to annotate gene families related to immune response, metabolism and other major functions [36,[41][42][43]. As scientists search for ways to control the spread of Huanglongbing, understanding the development pathways of its vector, D. citri, may provide insights into essential genes that could be targeted by pest control methods. The availability of accurate gene models will facilitate the design of experiments aimed at understanding the expression and function of these genes.

DATA AVAILABILITY
The gene models will be part of an updated official gene set (OGS) for D. citri that will be submitted to NCBI. The OGS (v3) will also be publicly available for download, BLAST analysis and expression profiling on Citrusgreening.org and the Citrus Greening Expression Network [40]. The D. citri genome assembly (v3), OGS (v3) and transcriptomes are accessible on Citrusgreening.org and NCBI. Accession numbers for genes used in multiple Gigabyte, 2021, DOI: 10.46471/gigabyte. 26  Species, accession number, full name and abbreviated name are provided for all orthologs used in multiple alignments and phylogenetic trees [18,21,23,[45][46][47][48][49][50]. The asterisk (*) denotes a gene that appears to be incorrectly named doublesex in GenBank, but is actually a knirps family gene.

8/12
alignments or phylogenetic trees are provided in Table 3, and all additional data is available via the GigaScience GigaDB repository [44].

EDITOR'S NOTE
This article is one of a series of Data Releases crediting the outputs of a student-focused and community-driven manual annotation project curating gene models and, if required, correcting assembly anomalies, for the Diaphorina citri genome project [11].