NiemaGraphGen: A memory-efficient global-scale contact network simulation toolkit

Epidemic simulations require the ability to sample contact networks from various random graph models. Existing methods can simulate city-scale or even country-scale contact networks, but they are unable to feasibly simulate global-scale contact networks due to high memory consumption. NiemaGraphGen (NGG) is a memory-efficient graph generation tool that enables the simulation of global-scale contact networks. NGG avoids storing the entire graph in memory and is instead intended to be used in a data streaming pipeline, resulting in memory consumption that is orders of magnitude smaller than existing tools. NGG provides a massively-scalable solution for simulating social contact networks, enabling global-scale epidemic simulation studies.

simulate the contact network itself. For simulating contact networks, cuPPA [11] and cuPPA-Hash [12] provide GPU-accelerated solutions for massively-parallelized simulation of ultra-large scale-free networks under the Copy Model [13], but they do not support the simulation of contact networks under other graph models. Which is a critical feature for epidemiologists hoping to fine-tune simulations to the contact patterns of a given outbreak or population of interest.

IMPLEMENTATION
NiemaGraphGen (NGG) is a memory-efficient undirected graph generation tool that enables the simulation of global-scale contact networks. NGG is intended to be used in data-streaming epidemic simulation pipelines and thus avoids storing the entire contact network in memory, resulting in faster runtime as well as memory consumption that is orders of magnitude smaller than existing tools ( Figure 1). NGG is written in C++ and has no dependencies beyond a modern C++ compiler (and optionally the command line make tool for convenience). When NGG is compiled, a separate executable is produced for each model. NGG is also available via a Docker container on DockerHub (niemasd/niemagraphgen). NGG currently supports the following stochastic and By default, NGG uses 4-byte unsigned integers to represent nodes in the network, which supports networks with up to 2 32 − 1 ≈ 4.3 billion nodes, but users can use 2-byte (up to 2 16 − 1 = 65,535 nodes) or 1-byte (up to 2 8 − 1 = 255 nodes) unsigned integers to reduce memory consumption, or they can use 8-byte unsigned integers (up to 2 64 − 1 ≈ 18 quintillion nodes) to support larger networks at the cost of higher memory consumption.
By default, NGG outputs networks in the tab-delimited edge list format used by FAVITES [3]. Output files in this format can then be used as input files within FAVITES, which will then be able to simulate a transmission network, viral phylogeny, and sequences along the given contact network. However, for ultra-large simulation studies, plain-text edge list representations of networks may result in extremely large files. To address this NGG also implements a proprietary compact binary output format that uses exactly 2b|E| + 1 bytes to represent a network with |E| edges in which nodes are represented using b-byte unsigned integers. Both supported output formats are highly structured and can thus be compressed reasonably well using standard compression tools (e.g. gzip). FAVITES does not currently support this compact binary format, so contact networks output in this binary format will not be usable as input files in the current version of FAVITES (v1.2.8), but support for this binary format will be implemented into FAVITES in the near future. Code examples for loading contact networks in NGG's output formats can be found in the NGG GitHub Wiki (https://github.com/niemasd/NiemaGraphGen/wiki).

Memory-efficient graph sampling
In this subsection, we discuss the memory-efficient graph sampling algorithms implemented within NGG. Most models implemented in NGG are sampled in (1) memory.

Complete graph
The complete graph, in which every node has an edge to every other node, is trivial to sample in (1) memory (Algorithm 1).

Path graph
The path graph, in which n nodes are connected in a linear path, is trivial to sample in (1) memory (Algorithm 2).

Barbell graph
The barbell graph, which consists of two complete graphs with n 1 nodes (Algorithm 1) connected by a path graph with n 2 nodes (Algorithm 2), can be sampled in (1) memory (Algorithm 3).

Cycle graph
The cycle graph, which consists of a single n-node cycle, is trivial to sample in (1) memory: it is simply a path graph (Algorithm 2) with a single additional edge connecting the start and end nodes (Algorithm 4).

Ring lattice graph
The ring lattice graph, in which every node has an edge to each of its k neighbors (where k must be even), is essentially a generalization of the cycle graph. Specifically, Cycle (n) is equivalent to RingLattice (n, 2). The ring lattice graph can be sampled in (1) memory (Algorithm 5).

Erdős-Rényi model
The Erdős-Rényi model is a random graph model for generating networks, and it has two parameters: the total number of nodes in the network (n) and the probability that any of the ( n 2 ) possible edges is included (p). A naive algorithm can be used to sample graphs under the model in (1) memory (Algorithm 6).
However, the time complexity of the naive algorithm is (n 2 ), making it unsuitable for ultra-large large networks. Instead, an alternative algorithm can also be implemented in (1) memory (Algorithm 7), which is faster than the naive algorithm when the expected number of edges (p( n 2 )) is relatively low (i.e., the network is relatively sparse) [17], as is the case with social contact networks.

Barabási-Albert model
The Barabási-Albert model is a random graph model for generating scale-free networks, and it has two parameters: the total number of nodes in the network (n) and the number of edges to attach from new nodes to existing nodes (m). An algorithm exists to sample graphs

Newman-Watts-Strogatz model
The Newman-Watts-Strogatz model, an extension of the Watts-Strogatz model [18], is a random graph model for generating connected networks with small-world properties.
Unlike the Watts-Strogatz model, which may yield in disconnected graphs, the Newman-Watts-Strogatz model is guaranteed to yield connected graphs. The Newman-Watts-Strogatz model begins by sampling RingLattice (n, k), and for each edge (u, v) in in the initial ring lattice, a new "shortcut" edge (u, w) is added with probability p. This motivates a naive sampling algorithm (Algorithm 9).
However, the naive algorithm requires all edges of the graph to be stored in memory, which results in prohibitively large memory requirements for ultra-large networks. An alternative memory-efficient algorithm can be devised. There are n nodes, and in the original ring lattice, each node has k edges. Therefore, the initial ring lattice graph has nk∕2 undirected edges, meaning we sample from Bernoulli (p) exactly nk∕2. The total number of successful Bernoulli trials is thus a single sampling from Binomial (nk∕2, p). Further, each node has n − k −1 possible new edges that can be added during the "shortcut"-adding step; Gigabyte, 2022, DOI: 10.46471/gigabyte.37

5/11
these edges can be represented by a matrix with n rows (representing u) and n − k −1 columns (representing w): If (u, v) is selected, then (v, u) cannot be selected because the graph is undirected. Thus, we can disregard the bottom-right portion of the matrix. We can then represent each cell of the matrix with its corresponding index in an array representation. For example, for n = 7 and k = 2 (X denotes "disregarded"): With this representation, sampling "shortcut" edges can be reduced to an efficient algorithm: randomly select a collection of Binomial (nk∕2, p) integers from Uniform(0, n(n−k −1) 2 − 1) without replacement, then map from the selected integers to their corresponding cells in the matrix, and finally map from cells in the matrix to edges (u, w).
Define a "full" row to be a row without any X symbols (i.e., no disregarded cells), and define an "empty" row to be a row that only contains X symbols (i.e., all cells were disregarded). The last column in the first row contains node n − k∕2 −1, and the last column in the last full row has node n −1, so there are (n − 1) − (n − k∕2 + 1) + 1 = k∕2 + 1 non-empty rows: 0 through k∕2. Thus, for rows 0 through k∕2 (i.e., the full rows of the matrix), we can imagine the following representation in which cells are filled with the corresponding index of the array representation of the matrix: Row k∕2 + 1 has exactly 1 empty cell, row k∕2 + 2 has exactly 2 empty cells, etc.
Thus, the first row that is completely empty (i.e., n − k − 1 empty cells) is row k∕2 + (n − k − 1) = n − k∕2 − 1. Thus, the remaining portion of the matrix from which

N. Moshiri
"shortcuts" can be sampled can be represented as follows (X denotes "disregarded", and Y denotes "not disregarded"): This is simply a (n − k − 2)-dimensional square matrix with a triangle in the upper-left.
We can now use these findings to define an efficient algorithm that only has to keep the "shortcut" edges in memory, rather than all edges (Algorithm 10).

Benchmarking experiment
To benchmark network generation runtime and memory consumption, we used NetworkX, iGraph, and NGG to simulate 10 replicate networks of various sizes, and we used the GNU time command line tool to measure total runtime and peak memory usage. We chose to explore Complete, Erdős-Rényi, Barabási-Albert, and Newman-Watts-Strogatz graphs in The results of the benchmarking experiment can be found in Figure 1. iGraph was excluded from the Newman-Watts-Strogatz simulations because iGraph does not support sampling from the Newman-Watts-Strogatz model. Furthermore, NetworkX was unable to run to completion on larger network sizes due to memory requirements that exceeded the 8 GB memory of the benchmarking machine. In all scenarios, NGG was the fastest and least memory-intensive of the three tools. With respect to Complete graphs, NGG is marginally faster than NetworkX and iGraph, and the peak memory usage of NGG is orders of magnitude smaller than both NetworkX and iGraph, with the gap widening as network size grows. With respect to Erdős-Rényi graphs, NGG is ∼4× faster than NetworkX and ∼1.5× faster than iGraph, and its peak memory usage is orders of magnitude smaller than both tools, with the gap again widening as network size grows. With respect to Barabási-Albert graphs, NGG is ∼4× faster than NetworkX and ∼1.5× faster than iGraph, and its peak memory usage is consistently ∼20× smaller than NetworkX and ∼3× smaller than iGraph.
With respect to Newman-Watts-Strogatz graphs, NGG is ∼3× faster than NetworkX, and its peak memory usage is ∼100× smaller than NetworkX, with the gap widening as network size grows. Importantly, aside from the Barabási-Albert and Newman-Watts-Strogatz models, all network models implemented in NGG have constant memory usage regardless of network size.

CONCLUSIONS
We introduce NiemaGraphGen (NGG), a memory-efficient graph generation tool that enables the simulation of global-scale contact networks. We benchmarked NGG against the two most popular network simulation tools, NetworkX and iGraph, and we showed that NGG was consistently fastest and had orders of magnitude lower memory consumption than the other tools (typically constant with respect to network size).

DATA AVAILABILITY
The data sets supporting the results of this article, along with all relevant scripts and commands, are available in the following GitHub repository: https://github.com/niemasd/NiemaGraphGen-Paper.
The same data and scripts can be found in the following portable Code Ocean environment ( Figure 2) [19]. Snapshots of the code is also available in the GigaScience GigaDB repository [20].

ETHICAL APPROVAL
Not applicable.