DAPT: A package enabling distributed automated parameter testing

Modern agent-based models (ABM) and other simulation models require evaluation and testing of many different parameters. Managing that testing for large scale parameter sweeps (grid searches), as well as storing simulation data, requires multiple, potentially customizable steps that may vary across simulations. Furthermore, parameter testing, processing, and analysis are slowed if simulation and processing jobs cannot be shared across teammates or computational resources. While high-performance computing (HPC) has become increasingly available, models can often be tested faster with the use of multiple computers and HPC resources. To address these issues, we created the Distributed Automated Parameter Testing (DAPT) Python package. By hosting parameters in an online (and often free) “database”, multiple individuals can run parameter sets simultaneously in a distributed fashion, enabling ad hoc crowdsourcing of computational power. Combining this with a flexible, scriptable tool set, teams can evaluate models and assess their underlying hypotheses quickly. Here, we describe DAPT and provide an example demonstrating its use.


STATEMENT OF NEED Introduction
Evaluating a new computational model requires testing many parameter sets and validating the results [1,2], collectively called model exploration (ME) [3]. For complex models with many parameters to explore, computational time can be high and managing the testing pipeline, processing the results, and storing the data can quickly become cumbersome. To mitigate this, tools to facilitate ME on high-performance computing (HPC) resources such as Extreme-scale Model Exploration with Swift (EMEWS) [4] and Open MOdeL Experiment (OpenMOLE) [5], have been developed. EMEWS and OpenMOLE distribute large scale ME jobs to HPC systems. Additionally, they can adaptively explore parameter spaces to achieve some predefined simulation outputs or goals.
However, there are several complications with HPC and by extension these ME software packages. The "headless" (non-graphical) nature of HPC means that people unfamiliar with command-line terminals may struggle to utilize the resources. This can be a particular challenge that slows onboarding for multidisciplinary team members, or people less familiar with servers. Sharing data produced from a simulation created on an HPC can also be challenging. After prototyping simulation models and analysis workflows on desktop workstations, it can be time consuming to adapt them to HPC resources, particularly for applications that require a graphical user interface (GUI) or software not supported on the HPC platform. Finally, not all teams may have low-cost access to HPC and cloud compute resources.
As a result, many teams come to rely upon a single member to run the model exploration, either on a personal computer or HPC. Since compute and processing times may already take up a considerable portion of project time, concentrating this work on one team member or one compute system compounds this already existing problem. One way to combat this is by splitting up the parameter sets among the team (or a broader community), having each team or community member run them on their computer or HPC resource, and then uploading the results to a shared storage solution. This distributed computational approach has been automated and shown to be effective on large scale projects such as Folding@Home (F@H) [6]. F@H uses a community of people and organizations who volunteer their computational resources to simulate protein folding. However, F@H is not ideal for small groups because the code is closed-source, requiring the team to develop the software anew, and requires the use of servers to assign jobs. Moreover, F@H is tailored to one specific scientific problem; it was not used to facilitate independent third-party scientific workflows.
There are many ways to leverage distributed computing for model testing. For example, F@H uses a client-server architecture. With this approach, clients (computers operated by community volunteers) get simulation parameters from a job distribution server. This server also maintains a database of simulation parameters and job statuses. An alternative is a database-centric design. In this approach, each client interacts directly with the database to gather parameters and update the job status. This second method removes the need for a centralized server, making setup and maintenance much simpler, as only the database needs to be managed. Furthermore, depending on the database requirements, there are many freely available cloud platforms which can be used to store parameters. For example, Google Sheets can be used as an online "database" that stores tests in each row and parameters in each column.
To address the issues discussed above, we created DAPT (Distributed Automated Parameter Tester). In particular, we aimed to (1) make ME more broadly accessible to small teams with diverse programming backgrounds through a simple Python library, (2) allow small teams to pool their individual computational resources to perform concurrent, distributed ME using a database-centric architecture, and (3) provide easy integration of "off-the-shelf" cloud resources and storage services for simple inclusion in ME pipelines. By adhering to these design principles, once the workflow is created, new teammates or even those simply with idle computing resource can contribute to a team's parameter studies through straightforward code sharing. Thus, DAPT allows for ad hoc crowdsourcing of computational power to create a small-scale, F@H-like testing environment.
Computational models require large amounts of parameter testing and simulations to explore and validate a model. To our knowledge, there are no software packages that allow pipelines to easily connect with application programming interfaces (APIs) and enable serverless ad hoc crowdsourcing of computing power. We created DAPT to allow easy integration of low-cost (or free) cloud services (e.g. Google Sheets and Box) into ME pipelines and enable all members of a team to pool their computing resources to run simulations, rather than just one person. Each parameter set is encoded into a job, corresponding to one row in the database. Each job must have a unique id associated with it and a status attribute. The id attribute is used as a unique job identifier, and the status stores which job task is currently being completed.
The status field is initially empty and is updated as the job proceeds. The value "successful" indicates a job is complete, "failed" means the job finished unsuccessfully, and any other value shows which task of the user-defined pipeline the job is currently completing. Other attributes can be included in the table to add additional information. For example, the attributes is found in the documentation [7]. To test stochastic models that require multiple runs at the same location in parameter space, parameter sets can be re-run in as many new, unique jobs as needed.
The class that brings all the components together is the Param class, short for Parameter.
The Param class interacts with the Database instance to manage the compute jobs. The next parameters to be tested are retrieved using the next_parameters() method. This method returns the parameters from the next entry in the database with an empty status attribute.
Other constraints can be placed on this method, such as the required computational power to run a parameter set. This method also marks the status as "in progress" and populates any related fields present (e.g. successful() and failed() methods are used to mark that a job was completed successfully or with a problem, respectively.
DAPT also makes it easy to interact with cloud storage providers through the Storage (storage) package. These modules support uploading, downloading, deleting, and renaming files. While not required for core functionality, these modules allow data to be easily uploaded to a shared location, or downloaded for a job or further processing. This nanoHUB [10]. The code for this example can be found in the paper_example.py file in the DAPT example for this paper [11].
When creating a PhysiCell model, diffusion and cell parameters are defined in the C++ code and loaded from an Extensible Markup Language (XML) file. A sample of a PhysiCell settings file is shown in Listing 1. There are several parameters, but the parameters of interest for our example are <attached_worker_migration_bias> and <unattached_worker_migration_bias>, located within the <user_parameters> tag. These parameters range from zero to one. As the bias approaches zero, the cell migration path approaches a random walk, while cell migration paths become more directed and deterministic as the bias approaches one.
In this example, the XML tags will be represented as a path from the root of the file. For instance, the <attached_worker_migration_bias> tag is represented as /user_parameters/attached_worker_migration_bias. Using the dapt.tools.create_XML() function, a dictionary containing these paths can be used to update the XML settings file.
The keys in the dictionary are paths with parameter values as the values. This method is beneficial, as the code necessary to update the settings is not hard-coded. Another attribute could then be added to the database without changing the testing script.

5/10
Listing 1. The skeleton of a PhysiCell settings file. The "attached_worker_migration_bias" is a custom variable which changes the migration bias of workers attached to cargo. For this example, three jobs will be run as shown in Table 2. As explained earlier, the id and status attributes are required. The start-time, end-time, and comment attributes are optional, but they provide additional information. These parameters are saved in a comma separated values (CSV) file named parameters.csv. This file is updated as the jobs run, showing the progress that has been made.
The code for this example is shown in Listing 2. The three DAPT modules that are used are Config, Delimited_file, and Param. The configuration for this example is stored in config.json and has two options: last-test and num-of-runs. The first option is used to store the current job id, which is needed for DAPT to resume a job if the program crashes or is stopped. The second option allows the number of jobs to run to be specified which is all jobs in this case. The full contents of the config file should be: "last-test":null, "num-of-runs":-1, saved as "config.json". The folder structure of this project has the Python script (Listing 2), config.json, and parameters.csv inside the PhysiCell directory. The first two lines of code import the required modules. The os module is used for interacting with the file system and the platform module is used to detect which operating system is being used. dapt imports all of the DAPT modules needed. Lines four through six instantiate the three DAPT modules needed. The config file is passed to the Param class, enabling the settings to be used.
The next line gets the parameter set using the next_parameters() method. If there are no more parameters to run and thus no more jobs in the database, then None is returned. Lines  Table 3.

Future directions
In the next version of DAPT, we plan to implement logging using the Python logging library.
Logging is useful for keeping track of errors and providing more detail for debugging.
Additionally, we will allow notifications to be sent to users when certain events have  grows. We also plan to integrate different APIs at a lower level to allow bots (e.g., Slack Bot) to generate notifications and control parameter testing. Furthermore, we plan to allow DAPT to be used in a tool via a command line interface (CLI). The Python scripting capability will not be removed, as having that level of control can be desirable. However, using DAPT directly in a CLI should increase efficiency in developing a testing pipeline.

AVAILABILITY OF SOURCE CODE AND REQUIREMENTS
DAPT (RRID:SCR_021032) is primarily hosted on GitHub [14]. It is licensed under the BSD 3-clause license. All operating systems that support Python versions 3.6 through 3.9.1 (most recent version at time of publishing) can run DAPT. The best way to install DAPT is by using the Python Package Index (pip) version 20.2.4 or newer. To install DAPT run pip install dapt in the terminal (Linux/Mac OS) or command prompt (Windows). DAPT can also be installed from source. The documentation for DAPT is hosted on ReadTheDocs [15].

DATA AVAILABILITY
Snapshots of our code and other supporting data are openly available in the GigaScience Repository, GigaDB [16].

ETHICAL APPROVAL
Not applicable.