A draft genome assembly of the eastern banjo frog Limnodynastes dumerilii dumerilii (Anura: Limnodynastidae)

Amphibian genomes are usually challenging to assemble due to their large genome size and high repeat content. The Limnodynastidae is a family of frogs native to Australia, Tasmania and New Guinea. As an anuran lineage that successfully diversified on the Australian continent, it represents an important lineage in the amphibian tree of life but lacks reference genomes. Here we sequenced and annotated the genome of the eastern banjo frog Limnodynastes dumerilii dumerilii to fill this gap. The total length of the genome assembly is 2.38 Gb with a scaffold N50 of 285.9 kb. We identified 1.21 Gb of non-redundant sequences as repetitive elements and annotated 24,548 protein-coding genes in the assembly. BUSCO assessment indicated that more than 94% of the expected vertebrate genes were present in the genome assembly and the gene set. We anticipate that this annotated genome assembly will advance the future study of anuran phylogeny and amphibian genome evolution.


INTRODUCTION
The recent powerful advances in genome sequencing technology have allowed efficient decoding of the genomes of many species [1,2]. So far, genome sequences are available publicly for more than one thousand species sampled across the animal branch of the tree of life. These genomic resources have provided vastly improved perspectives on our knowledge of the origin and evolutionary history of metazoans [3,4], facilitated advances in agriculture [5], enhanced approaches for conservation of endangered species [6], and  libraries (2 kb × 3, 5 kb × 3, 10 kb × 2, and 20 kb × 2). All the 14 libraries were subjected to paired-end sequencing on the HiSeq 2000 platform following the manufacturer's instructions (Illumina, San Diego, CA, USA), using PE100 or PE150 chemistry for the short-insert libraries and PE49 for the mate-paired libraries [26] (Table 1).
The raw sequencing data from each library were subjected to strict quality control by SOAPnuke (v1.5.3, RRID:SCR_015025) [27] prior to downstream analyses (see protocols.io [28] for detailed parameters for each library). Briefly, for the raw reads from each library, we trimmed the unreliable bases at the head and tail of each read where the per-position GC content was unbalanced or the per-position base quality was low across all reads; we removed the read pairs with adapter contamination, with high proportion of low-quality or unknown (N) bases; we removed duplicate read pairs potentially resulted from polymerase chain reaction (PCR) amplification (i.e. PCR duplicates); and we also removed the overlapping read pairs in all but the 170 bp and 250 bp libraries where the paired reads were expected to be overlapping. As shown in Table 2, data reduction in the short-insert libraries were mainly caused by the truncation of the head and tail of each read and the discard of read pairs with too many low-quality bases. But it is noteworthy that PCR duplication rates for all the short-insert libraries are extremely low (0.2%-2.6%), indicating that sequences from these libraries are diverse. In contrast, data reduction in the mate-paired libraries were mainly due to the discard of PCR duplicates, which made up 22.6%-83.0% of the raw data (Table 2). A total of 176 Gb of clean sequences were retained  for genome assembly after these strict quality controls, representing 69 times coverage of the estimated haploid genome size of L. d. dumerilii in terms of sequence depth, and 1,093 times in terms of physical depth (Table 1).

Genome size estimation and genome assembly
To obtain a robust estimation of the genome size of L. d. dumerilii, we conducted k-mer analysis with all of the clean sequences (131 Gb) from the four short-insert libraries using a Gigabyte, 2020, DOI: 10.46471/gigabyte.2 4/12 Figure 3. A 21-mer frequency distribution of the L. d. dumerilii genome data. The first peak at coverage 21X corresponds to the heterozygous peak. The second peak at coverage 42X corresponds to the homozygous peak.  (Table 3), which was calculated as the number of effective k-mers (i.e. total k-mers -erroneous k-mers) divided by the homozygous peak depth following Cai et al. [30]. It is worth noting that, the presence of a distinct heterozygous peak, which displayed half of the depth of the homozygous peak in the k-mer frequency distribution, suggests that the diploid genome of this wild-caught individual has a high level of heterozygosity ( Figure 3). The rate of heterozygosity was estimated to be around 1.17% by followed by GapCloser (v1.10.1, RRID:SCR_015026) [9] for gap filling with the clean reads from the four short-insert libraries.

Protein-coding gene annotation
Similar to repetitive element annotation, both homology-based and de novo predictions were employed to build gene models for the L. d. dumerilii genome assembly [37]. For homology-based prediction, protein sequences from diverse vertebrate species (see [37]

Assembly and annotation of the L. d. dumerilii genome
We assembled the nuclear genome of a female eastern banjo frog L. d. dumerilii (Figure 2) with ∼176 Gb (69X) clean Hiseq data from four short-insert libraries (170 bp × 1, 250 bp × 1, 500 bp × 1, and 800 bp × 1) and ten mate-paired libraries (2 kb × 3, 5 kb × 3, 10 kb × 2, and 20 kb × 2) (Tables 1-2). The final genome assembly comprised 520,896 sequences with contig  Note: N50 is the length of the shortest scaffold (or contig) for which longer and equal length scaffolds (or contigs) cover at least 50 % of the assembly. L50 is the smallest number of scaffolds (or contigs) whose summed length makes up 50% of the assembly size. For BUSCO assessment, C represents complete BUSCOs, S represents complete and single-copy BUSCOs, D represents complete and duplicated BUSCOs, F represents fragmented BUSCOs and M represents missing BUSCOs. and scaffold N50s of 10.2 kb and 286.0 kb, respectively, and a total length of 2.38 Gb, which is close to the estimated genome size of 2.54 Gb by k-mer analysis (Tables 3-4 and Figure 3). There are 242 Mb of regions present as unclosed gaps (Ns), accounting for 10.2% of the assembly. The GC content of the L. d. dumerilii assembly excluding gaps was estimated to be 41.0% (Table 4). The combination of homology-based and de novo prediction methods masked 1.21 Gb of non-redundant sequences as repetitive elements, accounting for 56.4% of the L. d. dumerilii genome assembly excluding gaps (Table 5). We also obtained 24,548 protein-coding genes in the genome assembly, of which 67% had complete ORFs.  (Table 6).

Data validation and quality control
Two strategies were employed to estimate the completeness of the L. d. dumerilii genome assembly. First, all the clean reads from the short-insert libraries were aligned to the genome assembly using BWA-MEM (BWA, version 0.7.16, RRID:SCR_010910) with default parameters [44]. We observed that 99.6% of reads could be mapped back to the assembled genome and 85.6% of the inputted reads were mapped in proper pairs as accessed by samtools flagstat (SAMtools v1.7, RRID:SCR_002105), suggesting that most sequences of the L. d. dumerilii genome were present in the current assembly. Of note, by comparing the genomic distributions of the properly paired reads and the remaining mapped reads in the final assembly, we observed that the reads that could not be mapped in proper pairs tended Gigabyte, 2020, DOI: 10.46471/gigabyte.2 7/12 Table 6. Summary of protein-coding genes annotated in the L. d. dumerilii genome. to locate on the ends of scaffolds, the flanking regions of assembly gaps and the genomic regions annotated as tandem repeats (

Re-use potential
Here, we report a draft genome assembly of the eastern banjo frog L. d. dumerilii. It represents the first genome assembly from the family Limnodynastidae (Anura: Neobatrachia). Although the continuity of the assembly in terms of contig and scaffold N50s is modest, probably due to the high repeat content (56%) and heterozygosity (1.17%), the completeness of this draft assembly is demonstrated to be high according to read mapping and BUSCO assessment. Thus, it is suitable for phylogenomics and comparative genomics analyses with other available anuran genomes or phylogenomic datasets. In particular, the high-quality protein-coding gene set derived from the genome assembly will be useful for deducing orthologous relationships across anuran species or reconstructing the ancestral gene content of anurans. Due to evolutionary importance of Limnodynastes frogs in Australia, the genomic resources released in this study will also support further research on the biogeography of speciation, evolution of male advertisement calls, hybrid zone dynamics, and conservation of Limnodynastes frogs.

DATA AVAILABILITY
The raw sequencing reads are deposited in NCBI under the BioProject accession PRJNA597531 and are also deposited in the CNGB Nucleotide Sequence Archive (CNSA) with accession number CNP0000818. The clean reads that passed quality control, the genome assembly, and the protein-coding gene and repeat annotations are deposited in the GigaScience GigaDB repository [48]. The genome assembly is also deposited in NCBI under accession number GCA_011038615.1.