The genome of Tripterygium wilfordii and characterization of the celastrol biosynthesis pathway

Tripterygium wilfordii is a vine from the Celastraceae family that is used in traditional Chinese medicine (TCM). The active ingredient, celastrol, is a friedelane-type pentacyclic triterpenoid with putative roles as an antitumor, immunosuppressive, and anti-obesity agent. Here, we report a reference genome assembly of T. wilfordii with high-quality annotation using a hybrid sequencing strategy. The total genome size obtained is 340.12 Mb, with a contig N50 value of 3.09 Mb. We successfully anchored 91.02% of sequences into 23 pseudochromosomes using high-throughput chromosome conformation capture (Hi–C) technology. The super-scaffold N50 value was 13.03 Mb. We also annotated 31,593 structural genes, with a repeat percentage of 44.31%. These data demonstrate that T. wilfordii diverged from Malpighiales species approximately 102.4 million years ago. By integrating genome, transcriptome and metabolite analyses, as well as in vivo and in vitro enzyme assays of two cytochrome P450 (CYP450) genes, TwCYP712K1 and TwCYP712K2, it is possible to investigate the second biosynthesis step of celastrol and demonstrate that this was derived from a common ancestor. These data provide insights and resources for further investigation of pathways related to celastrol, and valuable information to aid the conservation of resources, as well as understand the evolution of Celastrales.


INTRODUCTION
Tripterygium wilfordii Hook. f. (NCBI: txid458696) is a perennial twining shrub belonging to the Celastraceae family. It is known in China as 'Lei gong teng' (meaning: Thunder God Vine). It is indigenous to Southeast China, the Korean Peninsula, and Japan, and has been cultivated worldwide as a medicinal plant [1,2] (Figure 1). The extract of T. wilfordii bark has been used as a pesticide in China since ancient times, and was first recorded in the Illustrated Catalogues of Plants published in 1848 [3]. The potential medicinal activity of T. wilfordii has been studied since the 1960s, with its root being used to alleviate the symptoms of leprosy patients in Gutian County, Fujian Province, China [4]. This application ignited the interest of researchers in various fields. T. wilfordii was then reported to be effective in the treatment of autoimmune diseases, such as rheumatoid arthritis and systemic psoriasis [5,6]. In recent decades, many studies have examined the potential anticancer, antidiabetic and anti-inflammatory effects of extracts of T. wilfordii [7][8][9].  For Illumina sequencing, qualified DNA was fragmented using a Covaris device (MA, USA). Fragmented DNA was end-repaired; poly(A) tail and adaptor addition was performed using the Next Ultra DNA Library Prep Kit (NEB, MA, USA), then the appropriate samples were selected by electrophoretic analysis. The size-selected product was PCR-amplified, and the final product was purified and validated using AMPure XP beads (Beckman Coulter, CA, USA) and an Agilent Bioanalyzer 2100. Using the HiSeq 2500 platform, 150-bp paired-were sequenced. Clean data were obtained by removing adaptor reads, unidentified nucleotides (N) and low-quality reads from the raw reads, and Q20, Q30 and GC content of the clean data were calculated for quality assessment (Table 1). We estimated the genome size by performing k-mer frequency analysis. The k-mer frequencies (k-mer size = 17) were obtained using Jellyfish v2.2.7 [24] with jellyfish count -G 2 -s 5G -m 17 and jellyfish stats as the default parameters.
For long-read sequencing, qualified DNA was sheared into fragments in a g-TUBE (Covaris, MA, USA) by centrifugation, and quantity and quality were controlled by an Agilent Bioanalyzer 2100. To construct a sequencing library, the fragmented DNA was end-repaired and poly(A) tail and adaptor addition was performed using Next Ultra II End Repair/dA-Tailing Module, Next FFPE DNA Repair Mix and Next Quick Ligation Module (NEB, MA, USA), respectively, according to the manufacturer's instructions. The final product was validated using an Agilent Bioanalyzer 2100. Finally, the qualified DNA library was sequenced using Oxford Nanopore Technology (ONT) on the PromethION platform.

Sequence anchoring
Hi-C library preparation and sequencing were based on a protocol described previously, with some modifications [36][37][38] Clean data were obtained by removing adaptor reads, unidentified nucleotides (N) and low-quality reads from the raw reads, and Q20, Q30 and GC content of the clean data were calculated for quality assessment (Table 2). Clean data were first mapped to the draft genome using BWA software (version: 0.7.8-r455) [29]. After removal of PCR duplicates and unmapped reads using SAMtools v1.9 [39], based on the numbers of interacting read pairs,  low-quality reads were removed to generate clean data, and Q20, Q30 and GC content of the clean data were calculated for quality assessment (Table 3). RNA-seq data was mapped back to the genome assembly of T. wilfordii using HISAT v2.0.4 with default parameters [41]. The read numbers of each gene were counted using HTSeq v0.6.1 with the parameter: -m union [42]. Fragments per kilobase of transcript per million fragments mapped (FPKM) of each gene was calculated based on the length of the gene and number of read counts mapped to this gene [43]. For full-length transcriptome sequencing by PacBio, the best quality RNA samples of each tissue were mixed together to build an isoform sequencing library using the Clontech SMARTer PCR cDNA Synthesis Kit and the BluePippin Size Selection System protocol, as described by Pacific Biosciences (PN 100-092-800-03). Samples were then sequenced on the PacBio Sequel platform. Sequence data were processed using SMRTlink 7.0 software [44] with the parameters: -minLength 50, -maxLength 15000, -minPasses 1. Error correction was achieved using the Illumina RNA-seq data with LoRDEC v0.7, with the parameters: -k 23, -s 3 [45]. Redundancy in the corrected consensus reads was removed by CD-HIT v4.6.8 [46], with the parameters: -c 0.95, -T 6, -G 0, -aL 0.00, -aS 0.99, -AS 30 to obtain the final transcripts for the subsequent analysis.
A combined strategy based on homology, gene prediction, RNA-seq and PacBio data was used to annotate gene structure. communis, Oryza sativa, and Amborella trichopoda, was inferred through all-against-all protein sequence similarity searches using OthoMCL v1.4 [67], with the parameters: -mode 3 and -inflation 1.5. Proteins containing fewer than 50 amino acids were removed, and only the longest predicted transcript per locus was retained.
Single-copy orthologous genes were retrieved from the 12 species and aligned using MUSCLE v3.8.31 with default parameters [68]. All alignments were combined to produce a super-alignment matrix, which was used to construct a maximum likelihood (ML) phylogenetic tree using RAxML v8.2.12 [69]  Divergence times between species were calculated using the MCMCtree v4.9 program implemented for phylogenetic analysis by maximum likelihood (PAML) with the default parameters [70]. The following calibration points were applied: M. truncatula-G. uralensis  Table 4. Primers used for gene cloning.

Genome-wide identification of CYP genes
The hidden Markov model (HMM) profile of Pfam PF06200 [62] was used to extract full-length CYP candidates from the T. wilfordii genome by the HMM algorithm (HMMER) [71], filtering by a length between 400 and 600 amino acids [74].

Phylogenetic analyses
Multiple sequence alignments and phylogenetic tree construction were performed using MEGA X [75], with either the neighbor-joining or ML method with a bootstrap test (n = 1000 replications).

Co-expression analysis
Gene expression pattern analysis was performed using Short Time-series Expression Miner (STEM) software [76] on the OmicShare tools platform [77]. The parameters were set as follows: the maximum unit change in model profiles between time points was 1; the maximum output profile number was 20 (similar profiles were merged); the minimum ratio of fold change of differentially expressed genes (DEGs) was no less than 2.0, and the P-value was <0.05.

Gene cloning
The complete open reading frames (ORFs) of the putative CYP genes were amplified using the primers listed in Table 4, with cDNA from T. wilfordii root used as the template.
According to the manufacturer's instructions, fragments were cloned into the entry vector pDONR207 and yeast expression vector pYesdest52 using the Gateway BP Clonase II Enzyme Kit and LR Clonase II Enzyme Kit (Invitrogen, MA, USA), respectively.

Metabolite analysis
Plant tissue was ground into powder in liquid nitrogen then freeze dried. Fifty milligrams of sample was suspended in 2 mL of 80% (v/v) methanol, set overnight at room temperature, then extracted in an ultrasonic water bath for 60 min. After centrifugation at 12,000g for 2 min, the supernatant was filtered through a 0.2-μm Millipore filter before liquid chromatography-mass spectrometry (LC-MS) analysis.
Levels of celastrol and wilforic acid A were analyzed using an Agilent 1260LC-6400 QQQ

Enzyme assays of yeast in vivo
Yeast in vivo assays were performed following a previously described protocol with some modifications [78]. The yeast expression vector constructs or empty vector were transformed into the yeast Saccharomyces cerevisiae WAT11 [79,80]

Enzyme assays in vitro
The protocol for enzyme assays in vitro was performed as described previously with some modifications [81]. Yeast transformation and target protein induction were performed as

Syntenic analyses
The genomes of T. wilfordii, O. sativa japonica and V. vinifera were compared using MCScan Toolkit v1.1 [83] implemented in Python. The genomes of O. sativa v32 and V. vinifera v32 were downloaded from Ensembl Plants [84]. Syntenic gene pairs were identified using an all-vs-all BLAST search using LAST [85], filtered to remove pairs with scores below 0.7, and clustered into syntenic blocks in MCScan. Microsynteny plots were constructed using MCScan.

Genome sequencing, assembly, and annotation
We obtained 77.86 Gb of Nanopore reads, amounting to 207.16× coverage of the 375.84-Mb genome, a size estimated by k-mer distribution analysis ( Figure 3 and Table 5).
The GC content of the genome was 37.19%, with 0.00% N ( Table 7).
Short reads obtained from Illumina sequencing in the genome survey were aligned to the genome (Table 5), which exhibited a high consistency with a 95.31% mapping rate and 93.99% coverage (Table 10).
In addition, 87.85% of expressed sequence tags (ESTs) could be identified in the assembly, indicating high coverage of the genome (Table 11).  (Tables 12 and 5).
Furthermore, we anchored 91.02% of the original 342.61-Mb assembly into 23 groups using Hi-C technology (Tables 13 and 14).  (Tables 13 and 14). The number of groups, hereafter referred to as pseudochromosomes, corresponded well to the number of chromosomes reported previously [86].
For genome annotation and gene expression profile analyses, roots, stems, young leaves, mature leaves, flower buds and flowers of T. wilfordii plants were collected prior to RNA-seq using the Illumina platform. Furthermore, RNA samples from different tissues were mixed, then sequenced using the PacBio platform to obtain full-length transcriptome sequences (Table 5). A combined strategy involving de novo prediction, homology prediction, RNA-seq and PacBio read alignment was used to construct the gene structure for the T. wilfordii genome. The final set of annotated genes amounted to 31,593 genes, with an average length of 3180 bp and an average coding sequence length of 1182 bp (Table 15).  (Table 16).
Repeat sequence annotation showed that the T. wilfordii genome contained 44.31% repetitive sequences. Among these sequences, tandem repeats (small satellites and microsatellites) and interspersed repeats accounted for 0.95% and 43.36%, respectively.   Integrated distributions of the genes, repeats, noncoding RNA densities, and all detected segmental duplications are shown in Figure 5.

Comparative genomic analysis
To identify evolutionary characteristics and gene families, the T. wilfordii genome was and 485 of these shared families were single-copy gene families ( Figure 6). Gene family numbers were compared between T. wilfordii and four fabid species. As shown in Figure 7A (Figures 5B and 8).
These results were consistent with a previously proposed phylogenetic order, in which Celastrales and Malpighiales were found to be sister to each other [88].

Genome-wide identification and analysis of CYP candidates involved in celastrol biosynthesis
The candidate gene identification procedure is illustrated in Figure 9. these were annotated phylogenetic analysis with CYPs from A. thaliana. Thirty-five CYPs related to triterpenoid oxidases were identified, belonging to different subfamilies, including CYP716, CYP72, CYP71, CYP93, CYP705A and CYP81, which were previously reported to be functionally associated with diverse triterpenoid structural modifications ( Figure 10) [89]. On the other hand, the expression patterns of the 213 identified CYPs were identified with TwOSC1 and TwOSC3, which are the two committed enzymes involved in the biosynthesis of the precursor of celastrol in T. wilfordii [21]. Based on RNA-seq data for various tissues, 20 profiles of gene coexpression were obtained, of which only profiles #3 and #13 showed significance (P-value < 0.05) ( Figure 11). Profile #3 contained TwOSC3, and 45 CYPs showed similar expression patterns, while profile #13 included TwOSC1, and 51 CYPs had coexpression trends ( Figure 12). This suggests that these CYPs are potentially involved in the biosynthesis of celastrol.
To narrow down the candidate genes, the 35 CYPs identified by phylogenetic analysis were compared, and the genes showed patterns of coexpression with TwOSC1 and TwOSC3. As shown in Figure 13A, nine and seven triterpenoid biosynthesis-related CYPs showed patterns of coexpression with TwOSC1 and TwOSC3, respectively. However, no CYPs were common between the TwOSC1 group and TwOSC3 group. Based on tissue expression profiles, the 16 CYPs were clustered separately into two clades with TwOSC1 and TwOSC3, in which TwOSC3 exhibited root-specific expression, while TwOSC1 was highly expressed in leaves and other aerial parts ( Figure 13B).
Gene-to-gene and gene-to-metabolite Pearson's correlation coefficients (r) were calculated using the tissue expression profiles of the 16 outstanding CYPs mentioned above, as well as three other known genes related to celastrol biosynthesis (TwHMGR1, TwFPS1 and TwDXR) [90][91][92], and the known intermediate product and celastrol concentrations [21].
As shown in Figure 13C, seven CYPs positively correlated with celastrol biosynthesis-related genes, with high Pearson's r and significant P-values. In addition, these CYPs highly

Heterologous expression and characterization of putative CYPs
The full-length ORFs of TwCYP712K1, TwCYP712K2 and TwCYP712K3 were successfully clones and separately expressed in yeast fed with friedelin or 29-hydroxy-friedelan-3-one. However, no new peaks could be detected from the yeast strains expressing the enzymes and supplemented with friedelin compared with the empty vector (EV) control (see GigaDB [87]). This probably because the hydrophobic substrate could not be transported into the yeast cells. When fed with 29-hydroxy-friedelan-3-one, both TwCYP712K1 and TwCYP712K2 converted the substrate to a new compound possessing the same mass charge ratio (m/z) as the polpunonic acid standard, while TwCYP712K3 showed no such activity in this assay ( Figure 15A and C). To further explore the enzyme activities, microsomes were extracted from yeast cells and the proteins incubated with friedelin or 29-hydroxy-friedelan-3-one for 12 h. As shown in Figure 15B, TwCYP712K1 and TwCYP712K2 converted 29-hydroxy-friedelan-3-one to a new compound in vitro, consistent with previous yeast in vivo assays. In addition, no new peak was detected from the enzyme reactions supplemented with friedelin compared with the EV control (see GigaDB [87]).

Evolutionary analyses of TwCYP712K1 and TwCYP712K2
Genome analysis showed that TwCYP712K1 (tw18g06010.1) and TwCYP712K2 (tw18g06210.1) were located in pseudochromosome 18 within an approximately 200-kb region. This led us to examine the evolutionary relationship between specific CYPs. We hypothesized that a gene duplication event must have occurred during evolution. However, amino acid alignment indicated only 70.57% identity between TwCYP712K1 and

DISCUSSION
In this study, we provided a high-quality reference genome of T. wilfordii with a 340.12 Mb genome assembly (90.5% of the 375.84 Mb estimated genome size) and 3.09 Mb contig N50, and successfully anchored 91.02% of the sequences into 23 pseudochromosomes (Table 12).
The quality of our genome is close to that of the recently published T. wilfordii genome (348.38 Mb total contigs and 4.36 Mb contig N50), which was sequenced and assembled using Illumina, PacBio and Hi-C sequencing [86]. They also identified a key CYP gene that can catalyze the oxidation of a methyl group to the acid moiety of dehydroabietic acid in triptolide biosynthesis, another clinically used specialized metabolite in T. wilfordii.
Based on this genomic data, 35 CYP genes related to triterpenoid structure modification were identified according to phylogenetic analysis ( Figure 8); 16 of these were co-expressed with TwOSC1 or TwOSC3 according to tissue-specific transcript profiles (Figures 11 and 12).
These genes could be divided into two groups: the TwOSC1 group was highly expressed in leaves or other aerial parts, and the TwOSC3 group was specifically expressed in roots The gradient bar represents the expression levels from high (red) to low (blue) with log2 normalization, and the gray color represents the empty value. R, root; YL, young leaf; L, leaf; S, stem; FB, flower bud; F, flower. The numbers 1-3 represent three biological replicates. (C) Matrix of Pearson's correlation coefficient and corresponding P-value of compounds, biosynthesis-related genes and CYP candidates. The lower triangle matrix represents Pearson's correlation coefficient, and the upper triangle matrix presents P-values (green indicates P-values < 0.01, and gray indicates nonsignificant correlation). The gradient bar represents the correlation coefficient of positive or negative correlation from high to low. The black boxes enclose the highly correlated genes or compounds. (D) Phylogenetic analysis of putative CYPs in celastrol biosynthesis and CYPs known to catalyze structural modifications on triterpenoid scaffolds A phylogenetic tree was built using the maximum-likelihood method with a bootstrap test (n = 1000 replications). Allene oxide synthases AtCYP74A1 was set as an outgroup.
testing revealed that the expression levels of six CYPs significantly correlated with the expression patterns of genes involved in celastrol biosynthesis and the accumulation patterns of celastrol and its biosynthetic intermediates ( Figure 13C). A more subdivided phylogenetic tree showed that three putative CYPs were clustered close to CYP712K4, which was cloned from M. ilicifolia ( Figure 13D), another plant belonging to the Celastraceae family. CYP712K4 encodes an enzyme that catalyzes the oxidation of the C-29 position of friedelin to produce polpunonic acid [94].
Both in vivo and in vitro assays revealed that TwCYP712K1 and TwCYP712K2 could use 29-hydroxyfriedelan-3-one as a substrate to produce a new compound as the only product, indicating the oxidation of 29-hydroxyfriedelan-3-one at the C-29 position catalyzed by CYPs ( Figure 15D). However, the peak of product from reactions of TwCYP712K1 and TwCYP712K3 were deviated with the peak of polpunonic acid standard. To demonstrate Bars are means ± SD from three independent biological replicates, n.d=not detected in our experimental condition; the value of not detected compounds was set to 0. that these were same compound, future experiments will need to add another reaction containing 29-hydroxyfriedelan-3-one, buffer, enzyme and polpunonic acid standard (small amount, mixed into the reaction before LC-MS) in our follow on research. Comparative genome analysis showed that TwCYP712K1 and TwCYP712K2 derived from a common ancestor ( Figure 17). Although they catalyzed the same reaction and were located close to each other on the same chromosome, the identity of the amino acid sequence (70.57%) was not high (Figure 16). This suggests that TwCYP712K1 and TwCYP712K2 did not come from recent gene duplication, but separated during the evolution of the Celastraceae family. Interestingly, important catalytic activity for polpunonic acid biosynthesis in T. wilfordii was conserved in both these enzymes. As more genomes of the Celastraceae family are released, further evolutionary details of TwCYP712K1 and TwCYP712K2 can be investigated. There are many reports of genes encoding certain natural product pathways being grouped together in gene clusters to catalyze the biosynthesis of plant specialized metabolism, including triterpenoids [93, 96, 97]. However, neither the CYPs we identified, nor the signature enzymes TwOSC1 (tw21g04301.1) and TwOSC3 (tw20g03871.1), were clustered together.

POTENTIAL FOR REUSE
We reported the reference genome assembly of T. wilfordii and provided a useful strategy for screening the genes involved in plant specialized metabolism. For further exploration, the genome can be used for comparative genomic analyses; for example, to resolve the controversial phylogenetic relationships within the COM (Celastrales, Oxalidales and Malpighiales) clade [88]. Additionally, full-length transcriptome and tissue-specific RNA-seq data can be used to mine all the biosynthetic pathway genes of celastrol, as well as the biosynthetic pathways of the diterpenoid and alkaloid active ingredients.

DECLARATIONS ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Not applicable.

CONSENT FOR PUBLICATION
Not applicable.