svaRetro and svaNUMT: modular packages for annotating retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data

Nuclear integration of mitochondrial genomes and retrocopied transcript insertion are biologically important but often-overlooked aspects of structural variant (SV) annotation. While tools for their detection exist, these typically rely on reanalysis of primary data using specialised detectors rather than leveraging calls from general purpose structural variant callers. Such reanalysis potentially leads to additional computational expense and does not take advantage of advances in general purpose structural variant calling. Here, we present svaRetro and svaNUMT; R packages that provide functions for annotating novel genomic events, such as nonreference retrocopied transcripts and nuclear integration of mitochondrial DNA. The packages were developed to work within the Bioconductor framework. We evaluate the performance of these packages to detect events using simulations and public benchmarking datasets, and annotate processed transcripts in a public structural variant database. svaRetro and svaNUMT provide modular, SV-caller agnostic tools for downstream annotation of structural variant calls.

potential in-frame gene fusions [8], classify complex multi-SV events like chromothripsis [9] or chromoplexy [10] and extra-chromosomal DNA detection [11]. However, few tools exist for downstream analysis or annotation of SVs from general purpose SV callers. Much SV annotation software is tightly coupled to specific SV callers (e.g., LINX depends on features of GRIDSS v2) or has its own integrated detector that operates directly on BAM files rather than on an SV call set (e.g., AmpliconArchitect [11] and RetroSeq (RRID:SCR_005133) [12]).
Failure to reuse SV calls from general purpose callers, where these already exist, and reanalysis of alignment files by possibly multiple, specialised callers results in inefficiencies in analysis pipelines. This, in turn, leads to extended run time and higher computing costs.
It also misses the opportunity to benefit from improvements in general purpose callers over time. This can be avoided through tools that make appropriate use of SV callsets from general purpose tools. Two biological phenomena currently underserved by annotation tools for general purpose SV calls are Nuclear Mitochondrial insertion (NUMTs) [13] and retroposed transcript (RT) insertion.
The insertion of mitochondrial DNA (mtDNA) fragments into the nuclear DNA during the double-strand break repair has been observed in multiple species, including human and yeast [14][15][16][17][18]. NUMTs are present in the normal genome, having integrated during evolution. Active NUMTs have been observed in cancer; it was proposed that NUMTs are formed during mitosis when the nuclear membrane breaks down, allowing mtDNA to escape from degrading mitochondria, which is accelerated in cancer, and migrate into the nuclear genome [19]. Somatic NUMT events in human cancer cells have not been extensively studied, and further investigation is needed to understand their extent and role in cancer development [19,20]. Despite their potential biological significance, these events are often overlooked [14]. Dinumt is a NUMT detection tool that identifies NUMTs from whole genome sequencing (WGS) data [13,21]. Dinumt uses paired reads in which one read maps to mtDNA or known reference NUMTs, and the other read maps to nuclear DNA.
Reads that map to nuclear chromosomes are clustered by locations and the insertion sites of the mtDNA sequences are estimated. Read pairs with one read not uniquely mapped are discarded, thus limiting its ability to detect NUMTs in repetitive regions.
RTs are associated with LINE element reactivation in cancer [22], but also occur in the germline leading to processed pseudogenes [23]. RTs can interfere with the expression of their parent genes [24], generate antisense transcripts [25], and compete for microRNA binding with their parent genes [26]. Additionally, mutations introduced by the process may drive cancer evolution, particularly when the RT is inserted into another gene. GRIPper is a tool that identifies possible insertion sites of RTs by looking for discordant paired reads where one read maps to an exon region [27]. The insertion loci are predicted using soft-clipped reads. GRIPper does not look for exon-exon junctions, which are present in most RTs.
Here, we present two R packages for the analysis of SVs calls. svaNUMT and svaRetro provide flexible frameworks to analyse and explore NUMTs and RTs. By using SV callsets generated by general purpose callers, our packages are computationally efficient and avoid reanalysing primary sequencing data. Integrated into the Bioconductor framework, svaNUMT and svaRetro are compatible with many existing tools for further downstream data analysis and applications. Structural variant (SV) calls are first loaded as variant call format (VCF) objects with VariantAnnotation [28], then converted into breakend-centric GRanges with StructuralVariantAnnotation [29]. svaRetro takes as input the Granges data and a TxDb annotation object, which stores the transcript metadata. The output of svaRetro is a list of GRanges grouped by the source gene of the retrotransposed transcripts. svaNUMT requires only the GRanges object as input. The results are grouped by events and the locations of the breakends. The output can be easily converted to BEDPE format, which is commonly used for downstream analyses.

IMPLEMENTATION
Input and output data format svaRetro and svaNUMT are designed to work with SV callsets generated by generic callers.
In typical use, SV calls in a VCF file are loaded into a breakend-centric GRanges object using VariantAnnotation (RRID:SCR_000074) [28] and StructuralVariantAnnotation (RRID:SCR_018683) [29]. The packages then search for evidence supporting events of interest. For RT detection, svaRetro also requires a TxDb object that stores transcript metadata. The TxDb object can be loaded via pre-existing annotation packages or generated from existing data [30]. The output of the packages are lists of GRanges objects that can be converted to various data formats, including BEDPE, supporting further analysis ( Figure 1).
Both svaRetro and svaNUMT take as input a GRanges object with a breakend-centric notation, where a GRanges record is used to represent each breakend, and a breakpoint consists of a pair of breakends. Although breakpoint-centric data structures are available for SV representation (e.g., Pairs object in rtracklayer (RRID:SCR_021325) [31]), we have chosen the breakend-centric notation because it simplifies frequent operations in the analysis, such as overlap finding with genes and repeats. Users can choose threshold parameters before calling svaRetro and svaNUMT to increase sensitivity or precision.
Details of each parameter are also described in the documentation of the R packages.
The output formats of svaRetro and svaNUMT are GRanges to support flexible downstream analyses. In cases where alternative formats are required, StructuralVariantAnnotation [29] provides functions for format conversion between BEDPE (RRID:SCR_006646) [32] and Pairs with GRanges. Metadata associated with each input SV call is preserved in the output of svaRetro and svaNUMT, enabling further filtering of the results as required.  insertion sites and exon-exon junctions, if available. Each candidate insertion site is annotated by the potential source transcript(s) and whether exon-exon junctions are detected for the source transcript(s). Exon-exon junction calls are annotated by the exon indices, corresponding transcripts satisfying the minscore threshold, and National Center for Biotechnology Information (NCBI) gene symbols.

Identifying retrotransposed transcripts
RT insertion sites can be discovered on both 5′ and 3′ sides, only one side, or none. An insertion site could be missed, even when the breakend is reported in the SV callset, owing to a sizable 5′ truncation despite the tolerant threshold, a 5′ inversion, or a combination of rearrangements.

Identifying nuclear-mitochondrial genome fusion events
svaNUMT searches for NUMT events by identifying SVs (in breakend notation) supporting the fusion of nuclear chromosome and mitochondrial genome. In the event of mtDNA integration in nuclear genomes, it is expected that split reads and discordant reads are detected near the integration sites. These features, when picked up by an SV caller, are represented as translocation events between mtDNA and nuclear DNA in the SV calls, given that the mitochondrial reference genome is included in the library (see Figure 3). In practice, the inserted mtDNA read sequences may also map to nuclear chromosomes, with many of these sites annotated in known NUMTs [35]. The maximum distance allowed between the insertion sequence and known NUMTs are set by the maxgap_numtS parameter. Alternatively, if the insertion sequence cannot be mapped to a locus with confidence, the NUMT event can be reported in the SV calls as an insertion with a DNA sequence. The min_len parameter sets the minimum length allowed of the insertion sequences. The insertion sequence may not match perfectly with the mitochondria reference genome (chrM), and sequences with an alignment score lower than the min_Align parameter are discarded. svaNUMT can identify NUMTs in all of the above scenarios.
A NUMT event consists of two insertion sites, which can be linked by phasing nearby events. svaNUMT annotates linked insertion sites where possible. The maximum distance

Benchmarking and application
We evaluated svaRetro and svaNUMT on simulated and human cell line data and compared the results with existing tools for RT and NUMT detection. While there were tools developed for transposable elements from WGS data, few tools were available to detect RTs. To the best of our knowledge, GRIPper is the only published RT detection tool [27]. To benchmark NUMT detection, svaNUMT was evaluated against dinumt [13]. For svaRetro and svaNUMT, we used GRIDSS to call SVs on these samples. For the cell lines, results from each tool were compared and discrepancies inspected manually.

Testing on simulated data
We next tested svaRetro and svaNUMT using 500 non-overlapping simulated events on chromosome 1 (chr1). To generate the simulated events, chromosome 1 was first divided into 570 uniform intervals. Of these, 507 overlapped (at least 80% overlap) the set of high-confidence Tier 1 regions defined by Zook et al. [36]. Intervals not in high confidence regions were excluded. A final set of 500 intervals was then randomly selected. A transcript sequence, randomly selected from the UCSC annotation database [32], accompanied by a polyadenylation sequence, was inserted into a random location of each interval. Simulated NUMTs were generated through insertions of 500 mtDNA sequences with polyadenylation on the chr1 sequence, where insertion sites were selected using the same method as

Testing on cell line data
We applied svaRetro, GRIPper, svaNUMT and dinumt on a human germline HG002, a GIAB cell line using 60× coverage WGS [41], and a tumour cell line, COLO829, derived from a cutaneous melanoma with the matched lymphoblastoid (normal) cell line [34].  transcript sequence mapped to a processed pseudogene (on chr1) as well as the source gene (on chr16). GRIDSS reported this event as a translocation of the in-reference pseudogene from chr1; therefore, this event was not reported by svaRetro (see file in Zenodo [39]). The rest of the RT events were successfully identified by svaRetro. In addition, six germline low confidence NUMTs were discovered by svaNUMT, but no sufficient evidence was observed in the sequencing reads.

Runtime
Runtimes of svaNUMT and svaRetro were compared to those of dinumt and GRIPper respectively on HG002 and COLO829 (Table 1). svaNUMT and svaRetro were run using R (version 4.1.3) on an Intel i5 processer and 10 gigabytes (GB) memory. GRIPper was run using four threads.

Application to gnomAD-SV database
We established a catalog of nonreference RTs using svaRetro on the gnomAD-SV dataset [42], where RTs were largely unannotated. NUMT annotation was not applicable because mitochondrial SVs were excluded from the database. In total, 46,926 candidate insertion sites were detected by svaRetro, including single-exon transcript insertions and/or with insertions with only one side of the insertion detected. The distribution of all source genes and candidate insertion sites are shown in Figure 5. Using the 'PASS' filter, 598 high-confidence insertion sites were supported by exon-exon junctions and high-quality SV breakpoint calls. We classified these high-confidence insertion sites by the repeat class using RepeatMasker [43]. Figure 6 shows that RT insertions can be detected in both nonrepetitive and repetitive sequences.

CONCLUSION
The R/Bioconductor packages svaRetro and svaNUMT were developed to identify and annotate retrotransposed transcripts and NUMT insertions. Existing annotation tools are often tightly coupled to specific SV callers, or they integrate their own detectors that reanalyse the primary data (alignments). Although some downstream analyses may be unable to avoid such specialised callers, this general approach leads to inefficient use of the computational resources. To address these challenges, svaRetro and svaNUMT provide a modular infrastructure to annotating calls from general purpose SV callers.
A potential shortcoming of this approach is that svaRetro and svaNUMT may be coupled  RepeatMasker [43] repeat class (repClass) annotations of the high confidence RT insertion loci detected from the gnomAD-SV database using svaRetro. 246 (41.1%) of the RT insertion loci were in non-repeat regions. Out of the remainder, the majority were within short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs), including 213 (35.6%) within SINEs and 50 (8.3%) within LINEs. In the online version of this paper this is presented in an interactive form created by the code and frictionless data package presented alongside this work [40]. figure-6.html Integrated into the R/Bioconductor framework [44,45], the packages are compatible with many other available tools for more comprehensive downstream analyses.

DATA AVAILABILITY
Data and scripts supporting the results of this article are available via the Zenodo repository [39]. The README file includes a more detailed description of each of the files/figure. The Frictionless Data Package and snapshots of the code are also available in the GigaScience Press GigaDB repository [40].

EDITOR'S NOTE
A CODECHECK certificate for this paper is available confirming that the relevant figures in the paper could be independently reproduced [46]. This was carried out in parallel with generating frictionless data wrappers and using these to create dynamic versions of the figures.

Consent for publication
Not applicable.