aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3

Amazon Simple Storage Service (Amazon S3) is a widely used platform for storing large biomedical datasets. Unintended data alterations can occur during data writing and transmission, altering the original content and generating unexpected results. However, no open-source and easy-to-use tool exists to verify end-to-end data integrity. Here, we present aws-s3-integrity-check, a user-friendly, lightweight, and reliable bash tool to verify the integrity of a dataset stored in an Amazon S3 bucket. Using this tool, we only needed ∼114 min to verify the integrity of 1,045 records ranging between 5 bytes and 10 gigabytes and occupying ∼935 gigabytes of the Amazon S3 cloud. Our aws-s3-integrity-check tool also provides file-by-file on-screen and log-file-based information about the status of each integrity check. To our knowledge, this tool is the only open-source one that allows verifying the integrity of a dataset uploaded to the Amazon S3 Storage quickly, reliably, and efficiently. The tool is freely available for download and use at https://github.com/SoniaRuiz/aws-s3-integrity-check and https://hub.docker.com/r/soniaruiz/aws-s3-integrity-check.

data transmission, producing unintended changes to the data and corrupting the transferred files. To identify faulty data transfers in real-time, Amazon Simple Storage Service (Amazon S3) permits using checksum values through the AWS Command Line Interface (AWS CLI) tool. This approach consists of locally calculating the Content-MD5 or the entity tag (ETag) number associated with the contents of a given file; this checksum value is then inserted within the AWS CLI command used to upload the file to an Amazon S3 bucket. If the checksum number assigned by Amazon S3 is identical to the local checksum calculated by the user, then both local and remote file versions are the same: the file's integrity is proven.
However, this method has disadvantages. First, the choice of the checksum number to calculate (i.e., either the Content-MD5 [11] or the ETag number) depends on the characteristics of the file, such as its size or the server-side encryption selected. This requires the user to evaluate the characteristics of each file independently before deciding which checksum value to calculate. Second, the checksum number needs to be included within the AWS CLI command used to upload the file to an Amazon S3 bucket; thus, the user needs to upload each file individually. Finally, this process needs to be repeated for each file transferred to Amazon S3, which, as the number of files forming a dataset increases, can exponentially increase the time required to complete a data transfer.
To overcome these challenges, we developed aws-s3-integrity-check, a bash tool to verify the integrity of a dataset uploaded to Amazon S3. The aws-s3-integrity-check tool offers a user-friendly and easy-to-use front-end that requires one single command with a maximum of three parameters to perform the complete integrity verification of all files contained within a given Amazon S3 bucket, regardless of their size and extension. In addition, the aws-s3-integrity-check tool provides three unique features: (i) it is used after the data has been uploaded, providing the user with the freedom to transfer the data in batches to Amazon S3 without having to manually calculate individual checksum values for each file; (ii) to complete the integrity verification of all files contained within a given dataset, it only requires the submission of one query to the Amazon S3 application programming interface (API), thus not congesting the network; and (iii) it informs the user of the result from each checksum comparison, providing detailed per-file information. Concerning the latter, aws-s3-integrity-check can produce four types of output: (i) if the user does not have read access to the indicated Amazon S3 bucket, the tool produces an error message and stops its execution; (ii) if a given file from the provided folder does not exist within the indicated Amazon S3 bucket, the tool produces a warning message and continues its execution; (iii) if a local file exists within the remote bucket, but its local and remote checksum values do not match, the tool produces a warning message and continues its execution; (iv) if the local file exists within the remote bucket and its local and remote checksum values match, the integrity of the file is marked as proven. All outputs are shown on-screen and stored locally in a log file.
The aws-s3-integrity-check tool is freely available for download and use [12,13], also within a Docker format [14].

Our approach
Our purpose was to enable the automatic integrity verification of a set of files transferred to Amazon S3, regardless of their size and extension. Therefore, we created the aws-s3-integrity-check tool, which: (i) reads the metadata of the totality of files stored within  a given Amazon S3 bucket by querying the Amazon S3 API only once; (ii) calculates the  checksum value associated with every file contained within a local folder by using the same  algorithm applied by Amazon S3; and (iii) compares local and remote checksum values,  informing the user if both numbers are identical and, consequently, if the remote version of  the S3 object coincides with its local version. To identify different file versions, Amazon S3 uses ETag numbers, which remain unalterable unless the file object suffers any change to its contents. Amazon S3 uses different algorithms to calculate an ETag number, depending on the characteristics of the transferred file. More specifically, an ETag number is an MD5 digest of the object data when the file is: (i) uploaded through the AWS Management Console or using the PUT Object, POST Object, or Copy operation; (ii) is plain text; or (iii) is encrypted with Amazon S3-managed keys (SSE-S3). However, if the object has been server-side encrypted with customer-provided keys (SSE-C) or with AWS Key Management Service (AWS KMS) keys (SSE-KMS), the ETag number assigned will not be an MD5 digest. Finally, if the object has been created as part of a Multipart Upload or Party copy operation, the ETag number assigned will not be an MD5 digest, regardless of the encryption method [15]. When an object is larger than a specific file size, it will be automatically uploaded using multipart uploads. The ETag number assigned to it will combine the different MD5 digest numbers assigned to smaller sections of its data.
In order to match the default values published within the guidelines corresponding to the AWS CLI S3 transfer commands [16], 8 MB is the default multipart chunk size and the maximum file size threshold for aws-s3-integrity-check to calculate the ETag number. To automatise the calculation of the ETag value in cases where the file size exceeds the default value of 8 MB, the aws-s3-integrity-check tool uses the s3md5 bash script (version 1.2) [17]. The s3md5 bash script consists of several steps. Using the same algorithm used by Amazon S3, the s3md5 script splits the files larger than 8 MB into smaller parts of that same size and calculates the MD5 digest corresponding to each chunk. Secondly, the s3md5 script concatenates all the bytes from the individual MD5 digest numbers produced, creating a single value and converting it into binary format before calculating its final MD5 digest number. Thirdly, it appends a dash with the total number of parts calculated to the MD5 hash. The resulting number represents the final ETag value assigned to the file. Figure 1 shows a complete overview of the approach followed (please, refer to the Methods section for more details).

TESTING Datasets
To test the aws-s3-integrity-check tool, we used 1,045 files stored across four independent Amazon S3 buckets within a private AWS account in the London region (eu-west-2). The rationale behind the inclusion of these four datasets during the testing phase of the aws-s3-integrity-check tool was the variability of their project nature, the different file types and sizes they contain, and their availability within two different public repositories (i.e., the European Genome-phenome Archive (EGA) [18] and the data repository GigaDB [19]).
These four datasets occupied ∼935 GB of cloud storage space and contained files ranging between 5 Bytes and 10 GB that were individually uploaded to AWS using the AWS CLI sync command (version 1) [20]. No specific server-side encryption was indicated while using the sync command. In addition, all the configuration values available for the aws s3 sync (D) The aws-s3-integrity-check tool compares the local and remote ETag numbers assigned to each local file. The output of each phase of the tool is shown on-screen and logged within a local file. Table 1. Details of the four datasets used during the testing phase of the aws-s3-integrity-check tool. All datasets were independently tested. The log files produced by each independent test are available on GitHub [21]. All processing times were measured using the in-built time Linux tool (version 1.7) [22]. Processing times refer to the time (in minutes and seconds) required for the aws-s3-integrity-check tool to process and evaluate the integrity of the totality of the files within each dataset.

File types
Using the aws-s3-integrity-check tool, we successfully verified the data integrity of multiple file types detailed in Table 2.

Testing procedure
We performed two-sided tests. We used the aws- These three datasets (i.e., one EGA dataset and two GigaDB projects) were then uploaded to three different Amazon S3 buckets, which were respectively named "rnaseq-d" (EGAS00001006380), "mass-spectrometry-imaging" [23], and "tf-prioritizer" [25] (Table 1). In all three cases, the data was uploaded to Amazon S3 by using the following aws s3 command: $ aws s3 sync --profile aws_profile path_local_folder/ s3://bucket_name/ To verify that the data contents of the remote S3 objects were identical to the contents of their local version, we then ran the aws-s3-integrity-check tool by using the following command structure: Next, we used the aws-s3-integrity-check tool to test the integrity of a local dataset downloaded from an S3 bucket. In this case, we used data from the EGA project with accession number EGAS00001003065. Once we obtained access to the EGAS00001003065 repository, we downloaded all its files to a local folder. We then uploaded this local dataset to an S3 bucket named "ukbec-unaligned-fastq" (Table 1). When the data transfer to Amazon S3 finished, we downloaded these remote files to a local folder by using the S3 command sync as follows: $ aws s3 sync --profile aws_profile s3://ukbec-unaligned-fastq/ path_local_folder/.
To test that the local version of the downloaded files had identical data contents as their remote S3 version, we ran the aws-s3-integrity-check tool employing the following command synopsis:

$ bash aws_check_integrity.sh [-l|--local <path_local_folder>] [-b|--bucket <ukbec-unaligned-fastq>] [-p|--profile <aws_profile>].
Finally, we tested whether the aws-s3-integrity-check tool could detect any differences between a given local file that had been manually modified and its S3 remote version and inform the user accordingly. Therefore, we edited the file "readme_102374.txt" from the dataset [23] and changed its data contents by running the following command: $ (echo THIS FILE HAS BEEN LOCALLY MODIFIED; cat readme_102374.txt) > readme_102374.tmp && mv readme_102374.t{mp,xt} We then run the aws-s3-integrity-check tool employing the following command synopsis:

$ bash aws_check_integrity.sh [-l|--local <path_local_folder>] [-b|--bucket <mass-spectrometry-imaging>] [-p|--profile <aws_profile>].
As expected, the aws-s3-integrity-check tool was able to detect the differences in data contents between the local and the S3 remote version of the "readme_102374.txt" file by producing a different checksum number from the one originally provided by Amazon S3. The error message produced was "ERROR: local and remote ETag numbers for the file 'readme_102374.txt' do not match.". The output of this comparison can be checked on the log file "mass-spectrometry-imaging.S3_integrity_log.2023.07. 31-22.59.01.txt" ( Table 1).
The aws-s3-integrity-check tool also demonstrated minimal use of computer resources by displaying an average CPU usage of only 2% across all tests performed.

Testing configuration
The four datasets tested were stored across four Amazon S3 buckets in the AWS London region (eu-west-2) ( Table 1). All four S3 buckets had the file versioning enabled and a server-side SSE-S3 encryption key type selected.
The aws-s3-integrity-check tool is expected to work for files that have been uploaded to Amazon S3 by following these two uploading criteria: 1. Uploaded by command line using any of the aws s3 transfer commands, which include the cp, sync, mv, and rm commands.  [14].

Support
The source code corresponding to the aws-s3-integrity-check tool is hosted on GitHub [13].
Also, from this repository, it is possible to create new issues and submit tested pull review requests. Issues have been configured to choose between the "Bug report" and "Feature request" categories, ultimately facilitating the creation and submission of new triaged and labelled entries.
The aws-s3-integrity-check tool relies on the s3md5 bash script (version 1.2) [17] to function. To ensure the availability and maintenance of the s3md5 bash script to users of the aws-s3-integrity-check tool, the source s3md5 GitHub repository [17] has been forked and made available [30]. Any potential issues emerging on the s3md5 bash script that may affect the core function of the aws-s3-integrity-check tool can be submitted via the Issues tab of the forked s3md5 repository. Any new issue will be triaged, maintained, and fixed on the forked GitHub repository within the "Bug Report" category, before being submitted via a pull request to the project owner.

Limitations
Here, we presented a novel approach for optimising the integrity verification of a dataset transferred to/from the Amazon S3 cloud storage service. However, there are a few caveats to this strategy. First, the user has to have read/write access to an Amazon S3 bucket. Second, this tool requires that the user selects JavaScript Object Notation (JSON) as the preferred text-output format during the AWS authentication process. Third, the aws-s3-integrity-check tool is only expected to work for files that have been uploaded to Amazon S3 using any of the aws s3 transfer commands (i.e., cp, sync, mv, and rm) with all the configuration parameters set to their default values, including multipart_threshold and multipart_chunksize. In particular, it is essential that the file size thresholds for the file multipart upload and the default multipart chunk size remain at the default 8 MB values.
Fourth, the bash version of this tool is only expected to work across Linux distributions.
Finally, the Dockerized version of this tool requires three extra arguments to mount three local folders required by the Docker image, which may increase the complexity of using this tool.

METHODS
A stepwise protocol summarising how to check the integrity of a dataset stored on Amazon S3 is available on protocols.io [31] (Figure 2).

Main script
The main script is formed by a set of sequential steps whose methods are detailed below.
To parse command options and arguments sent to the aws-s3-integrity-check bash tool,  The information contained within each entry corresponds to the metadata of a given S3 object. The aws-s3integrity-check bash tool used the keys "Key" and "ETag" during the integrity verification of each file. the user and suggested different AWS authentication options. For the correct performance of this tool, it is required that the user selects JSON as the preferred text-output format during the AWS authentication process.
To obtain the ETag number assigned to the totality of the files contained within the indicated Amazon S3 bucket, we used the AWS CLI command list-objects (version 1) [34] as follows: $ aws s3api list-objects --bucket "$bucket_name" --profile "$aws_profile"' In this way, we reduced to one the number of queries performed to the AWS S3 API, known as s3api, which considerably reduced the overall network overload. The output of the function list-objects was a JSON object ( Figure 3).
If the local path indicated through the parameter [-l|--local <path_local_folder>] existed, was a directory, and the user had read access over its contents, the tool looped through its files. For each file within the folder, the aws-s3-integrity-check bash tool checked whether the filename was among the entries retrieved within the JSON metadata object and indicated within the "Key" field. If that was the case, it meant that the local file existed on the indicated remote Amazon S3 bucket, and we could proceed to calculate its checksum value. Before calculating the checksum value of the file, the tool evaluated the data content of the file. If it was smaller than 8 MB, the tool calculated its Content-MD5 value by using the  [17] with the command "s3md5 8 path_local_file".
To obtain the ETag value that Amazon S3 assigned to the tested file the moment it was stored on the remote bucket, we filtered the JSON metadata object using the fields "ETag" and "Key" and the function select (jq library, version jq-1.5-1-a5b5cbe, [38]). We then compared the local and remote checksum values; if the two numbers were identical, the integrity of the local file was proven. We repeated this process for each file in the local folder [-l|--local <path_local_folder>].
To inform the user about the outcome of each step, we use on-screen messages and log this information within a local file in a .txt format. Log files are stored within a local folder named "log/" located in the same path in which the main bash script aws-check-integrity.sh is located. If a local "log/" folder does not exist, the script creates it. Figure 1 shows a complete overview of the approach we followed. with the absolute path to the local folder containing the local version of the remote S3 files to be tested. This argument is used to mount the local folder as a local volume to the Docker image, allowing Docker to have read access over the local files to be tested.

Docker image
Important: the local folder should be referenced by using the absolute path.

DATA AVAILABILITY
All datasets used during the testing phase of the aws-s3-integrity-check tool are available within the EGA and GigaDB platforms.
The dataset stored within the Amazon S3 bucket 'mass-spectrometry-imaging' was generated by Guo et al. [28] and is available on the GigaDB platform [23].
The dataset stored within the Amazon S3 bucket 'tf-prioritizer' was generated by Hoffmann et al. [29] and is available on the GigaDB platform [25].
The dataset stored within the Amazon S3 bucket 'rnaseq-pd' was generated by Feleke, Reynolds et al. [24] and is available under request from EGA with accession number EGAS00001006380.
The dataset stored within the Amazon S3 bucket 'ukbec-unaligned-fastq' was a subset of the original dataset generated by Guelfi et al. [26], and is available under request from EGA with accession number EGAS00001003065.
The log files produced during the testing phase of the aws-s3-integrity-check tool are available at https://github.com/SoniaRuiz/aws-s3-integrity-check/tree/master/logs.