ensemblQueryR: fast, flexible and high-throughput querying of Ensembl LD API endpoints in R

We present ensemblQueryR, an R package for querying Ensembl linkage disequilibrium (LD) endpoints. This package is flexible, fast and user-friendly, and optimised for high-throughput querying. ensemblQueryR uses functions that are intuitive and amenable to custom code integration, familiar R object types as inputs and outputs as well as providing parallelisation functionality. For each Ensembl LD endpoint, ensemblQueryR provides two functions, permitting both single- and multi-query modes of operation. The multi-query functions are optimised for large query sizes and provide optional parallelisation to leverage available computational resources and minimise processing time. We demonstrate improved computational performance of ensemblQueryR over an exisiting tool in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase whilst using a third of the RAM. Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through Docker and singularity images, making this tool widely accessible to the scientific community.

of ensemblQueryR over an exisiting tool in terms of random access memory (RAM) usage and speed, delivering a 10-fold speed increase whilst using a third of the RAM.Finally, ensemblQueryR is near-agnostic to operating system and computational architecture through Docker and singularity images, making this tool widely accessible to the scientific community.
Subjects Software and Workflows, Bioinformatics, Genetics, Software Engineering

STATEMENT OF NEED Background
Linkage disequilibrium (LD) is the non-random association of alleles arising from different loci [1].In population genetics, LD is a measure of the frequency with which an allele of one variant is correlated with an allele of a proximal variant within a particular population [2].
There are many applications for LD measures in genomics workflows.For example, in the context of genome-wide association studies (GWAS), which have been used to detect associations between genetic variants and a wide range of human phenotypes, downstream interrogation of local LD structure is required to identify the potential 'causal'variant at a nominated locus that exerts an effect on the downstream phenotype.Equally, in expression quantitative trait loci (eQTL) analyses, which aim to uncover associations between genetic variants and the expression of a cis or trans gene (eGene), LD information is required for the identification of the potential causal variant affecting the expression of the eGene.Further downstream, LD information is useful for functional annotations, where genetic variants or regions in LD with a target variant can aid in the identification of biological processes that might be affected by the GWAS-or eQTL-implicated target variant.
As such, it is important that the LD information for a range of human populations can be easily queried by researchers in an efficient and accessible way.
Despite the widespread usage of LD measures in genomic research, the majority of tools available at present are web-based.Although these offer user-friendly interfaces and can be useful for one-off or small queries, they do not promote reproducibility and are not suited to workflow-oriented researchers wishing to submit multiple large queries.Programmatic tools offer a solution to these problems; however, very few tools for the retrieval of LD metrics exist.
To our knowledge, only one R package provides a programmatic interface for LD metric retrieval.LDlinkR (version 1.2.3) [3] provides an R-based interface to the web-based tool LDlink [4], permitting retrieval of LD metrics using a range of query types.However, LDlinkR has a number of key limitations with respect to speed and query handling.Firstly, the user is required to obtain an access token by signing up on the NCBI website, which is then supplied as an argument to all LDlinkR functions.This requirement is in place to limit user queries, meaning that attempts to speed up the tool using parallelisation easily exceed query limits and cause the tool to return timeout errors.This can result in the user's access token being blocked.Secondly, a number of functions for retrieving LD metrics are configured for singular queries only -such as the LDpair and LDproxy functions -meaning that the user must write custom code to submit more than one query at one time.As such, although LDlinkR is a useful programmatic alternative to the LDlink web tool, it is not suited to fast, high-throughput multi-query retrieval of LD metrics.
Ensembl (RRID:SCR_002344) is another widely used source of LD metrics, offering an application programming interface (API) that supports an array of query configurations [5,6].However, some challenges are presented by direct API usage as its usage requires some technical expertise.Additionally, it is not easily integrable with typical R workflows, precludes the input of standard R objects (such as data frames, lists or vectors), does not output data in an intuitive format and is not easily adaptable to high-throughput workflows.To our knowledge, no R package has been developed to facilitate querying the Ensembl API and, in particular, to retrieve Ensembl LD metrics.In light of this, and to address the limitations of current tools, we present ensemblQueryR.Our R package provides fast, efficient, user-friendly querying of Ensembl LD data, with a focus on intuitive, high-throughput R workflow integration.ensemblQueryR has been made freely available (DOI: 10.5281/zenodo.7837882)[7,8].The package can also be used in Docker (RRID:SCR_016445) [9] or Singularity [10] containers, for which the images can be found on Docker Hub [11] or the Singularity image repository [12].

Implementation
Our approach ensemblQueryR provides a suite of functions that wrap around three Ensembl API 'endpoints'.These endpoints operate to retrieve data from Ensembl databases through the following query configurations: This function takes a data frame containing genomic coordinate(s) and retrieves LD metrics (D′ and R 2 ) for all rsID within the defined region(s).
1. Window: retrieval of the LD metrics for a variant and all the other variants in a window around the target variant; 2. Pair: retrieval of the LD metrics between a pair of target variants; 3. Region: retrieval of the LD metrics between all pairs of variants in a defined target region.
Single-query and multi-query wrapper functions are provided for each of these Ensembl API endpoints, all of which are described in detail in Table 1.
To make ensemblQueryR useful in a high-throughput context, the main challenge is that the Ensembl API endpoints are configured to handle single queries.To address this, ensemblQueryR's three multi-query functions (with names ending in 'Dataframe', as described in Table 1) take data frame objects as input, where each row needs to be submitted as a separate query to the Ensembl API.The base R lapply function (Version 4.0.5)[14] is then used to apply the corresponding single-query function over the input data frame, iteratively formulating an API query from each data frame row.
Building on ensemblQueryR's high-throughput capabilities, we implemented optional parallelisation for all multi-query functions.Each multi-query function (those with names ending in 'Dataframe'in Table 1) has an argument that allows the user to set a number of 'cores'to parallelise the query across.Using this functionality can significantly reduce run-time, particularly for larger queries where the parallelisation overheads represent a small proportion of the overall memory requirements.For example, with a query size of

Benchmarking
LDlinkR is an alternative R package that offers LD metric retrieval.As such, it was important to benchmark against this tool to demonstrate the utility of ensemblQueryR for high-throughput querying.Of the functions contained in the LDlinkR and ensemblQueryR packages, two functions are particularly comparable in their functionality.Both LDpair (from LDlinkR) and ensemblQueryLDwithSNPpair (from ensemblQueryR) take a pair of reference SNP cluster identifiers (rsIDs) as input, while the output is a table containing the LD metrics for the query pair.As such, these functions were selected for benchmarking.To compare the performance of the two functions, the computation speed and RAM usage at three query sizes representing a range of throughputs -100, 1,000 and 10,000 queries -were assessed (Figure 1).For each function and query size combination, performance (speed and peak RAM usage) was tested ten times to account for temporal fluctuations in processing speed and peak RAM usage, thus enabling a precise performance assessment.
Firstly, comparing execution speed, we found that, on average (across the ten tests), ensemblQueryLDwithSNPpair was 10.2 times faster in the 100-query test, taking an average of 0.208 min compared to the 2.12 min for LDpair (Figure 1b).The 1,000-query test found that ensemblQueryLDwithSNPpair was, on average, 9.92 times faster than LDlinkR, taking an average of 1.97 min compared to the 19.5 min for LDpair.Finally, in the 10,000-query test, LDpair was unable to produce a final results table in seven out of ten tests, in these instances returning an error message ('Bad Gateway (HTTP 502)').In contrast, ensemblQueryLDwithSNPpair produced a final results table in all tests, demonstrating its reliability for high-throughput querying.Looking at the only three successful tests of LDpair, we found that ensemblQueryLDwithSNPpair was, on average, 10.9 times faster than LDpair, taking an average of 18.5 min compared to the 202 min (>3 h) for LDpair.
These speed improvements are likely due to request rate limits from the server side, which are higher for Ensembl, thus enabling fast concurrent or parallel requests.
Secondly, we compared the peak RAM usage -the maximum RAM utilised at any time during function execution -between ensemblQueryLDwithSNPpair and LDpair (Figure 1a).
We found that across query sizes, ensemblQueryLDwithSNPpair had approximately a third (range: 20.8-49.6%) of the peak RAM usage of LDpair.These peak RAM usage improvements are likely due to a focus on within-function reductions of intermediate object storage and a reduction of the number of operations carried out within the ensemblQueryLDwithSNPpair function.
We conclude that, by comparing analogous functions (ensemblQueryLDwithSNPpair and LDpair), ensemblQueryR provides a performance improvement over LDlinkR with respect to both speed (×10) and memory usage (1/3), underscoring the utility of our tool in the context of high-throughput workflows.

Usage
The following code provides an implementation example for ensemblQueryR.In this case, the target variant was taken from the OpenTargets homepage [15] with rsID rs4129267.To find out which variants, within a set of genomic window sizes, are in LD (R 2 > 0.8 and D′ > 0.8) with the target variant, the ensemblQueryLDwithSNPWindow function was implemented as in Figure 2.  for peak RAM usage (Figure 1a) and time measured in minutes (Figure 1b) for the three query sizes.

Limitations
It is important to note that there are some limitations to this R package.First, although this package enables high-throughput querying of the Ensembl API, there is an inherent limit to the number of queries that can be submitted arising from the API query limit, which is set at 54,000 requests per hour [5,6].On the Ubuntu system used to develop ensemblQueryR (Ubuntu server 16.04 LTS with kernel version 4.4.0-210-generic,total RAM 251 G), 54,000 API requests via ensemblQueryLDwithSNPpairDataframe took ∼1.93 h on a single core, making the per-hour request rate 27,867.As such, even query sizes of 54,000 can be run unparallelised and are unlikely to exceed the Ensembl API hourly rate.However, this request limit must be considered by users when applying parallelisation to large queries.
The second limitation is that the parallelisation library used to enable the multi-core functionality is the R package 'parallel'(version 3.6.2) [19], which works on OSIX systems (Mac, Linux and other Unix-based operating systems) but does not work on Windows.

Scope for future development
At present, this package provides wrappers for the Ensembl API endpoints that retrieve LD data.However, the Ensembl API offers ∼109 other endpoints [5], all of which have the potential to be wrapped into R functions and included in this package.As such, there is scope for the usefulness of this package beyond LD metrics and further development will expand its utility to R users across an array of bioscience disciplines.
Docker and singularity images.In order to make this tool widely accessible to the research community, we have provided a Docker image (Figure3).It is based on a Rocker (RRID:SCR_024215)[16] with R version 4.0.0 and Ubuntu version 20.04 LTS (Focal Fossa) pre-installed.To start a container using the 'ensemblqueryr'Docker image and launch an

Table 1 .
The functions comprising the ensemblQueryR package and their relationship to the three LD Ensembl API endpoints.