Atria: an ultra-fast and accurate trimmer for adapter and quality trimming

With advances in next-generation sequencing, adapters attached to reads and low-quality bases directly and implicitly hinder downstream analysis. For example, they can produce false-positive single nucleotide polymorphisms (SNP), and generate fragmented assemblies. There is a need for a fast trimming algorithm to remove adapters precisely, especially in read tails with relatively low quality. Here, we present Atria, a trimming program that matches the adapters in paired reads and finds possible overlapped regions using a fast and carefully designed byte-based matching algorithm (O (n) time with O (1) space). Atria also implements multi-threading in both sequence processing and file compression and supports single-end reads. Compared with other trimmers, Atria performs favorably in various trimming and runtime benchmarks of both simulated and real data. We also provide a fast and lightweight byte-based matching algorithm, which can be used in various short-sequence matching applications, such as primer search and seed scanning before alignment.

In addition to adapter trimming, Atria (RRID:SCR_021313) integrates trimming and filtering methods, such as consensus calling for overlapped regions, quality trimming, homopolymer trimming, N trimming, hard clipping from both ends, and read complexity filtration.

IMPLEMENTATION
The adapter-finding algorithms used in Atria (RRID:SCR_021313) can be categorized in the following portions: DNA encoding, matching algorithm, matching and scoring, decision rules, consensus calling, quality trimming, and IO optimization ( Figure 2).

DNA ENCODING
The DNA encoding algorithm is developed based on BioSequences, a Julia (RRID:SCR_021666) package from BioJulia [6]. The original BioSequences package encodes DNA bases A, C, G, T as four-bit codes 0001, 0010, 0100, 1000, respectively. Extended codes are also supported, such as N (1111), S (0110), and gap (0000). DNA sequences are encoded It is noticeable that the smallest addressable unit of memory is 1 byte (8 bits) while each DNA is encoded in 4 bits, so only the even indices of sequences can be directly extracted (defining indices start from 0) (Figure 2A). The extraction from odd indices requires extra operations, which is avoidable in many scenarios of a well-designed algorithm.
We denote a UInt64 extracted from the memory position n of sequence a by a n . a n is a 16-mer and represents the subsequence of a indexed from 2n to 2n + 15, which is denoted by a[2n: 2n + 15] (Figure 2A).

Matching algorithm
Given two sequences a and b, we plan to match the 16-base-long head of a to each index of b.
However, only the even indices of b can be extracted from memory without bitwise operations. Therefore, we prepared two UInt64 of a: a 0 and a − . a 0 is the 16-mer UInt64 loaded from the position 0 of a, and a − can be computed from the following bitwise operations: (a 0 ≫ 4|a 1 ≪ 4). In this way, a 0 represents the subsequence of a indexed from 0 to 15 (a[0:15]), and a − represents a [1:16] (Figure 2B).
In this way, the problem of matching the 16-base-long head of a to each index of b is converted to the problem of matching two 16-mers, a 0 , and a − , to each addressable memory position of b. The latter requires less bitwise operations.
The number of mismatches K is computed in the formula: where count_ones counts the number of ones in the binary representation of the UInt64.
Let k denote the user-defined number of mismatches allowed in the 16-mer comparison of UInt64 a n and b n (k = 2 by default). After matching a 0 and a − to each addressable memory position of b, if the minimum number of mismatches is not greater than k, the smallest index of b of the minimum mismatches is reported.
Therefore, the complexity of the matching algorithm is O (n) time with O (1) space, so its speed is extremely fast. One limitation is that when computing the number of mismatches of a n and b n , and if they have ambiguous bases in the same indices, the number of mismatches is underestimated. Another limitation is that the algorithm does not handle indels. Those limitations are compensated in the design of adapter matching, scoring, and decision rules.

Matching and scoring
We implement four pairs of matching to utilize properties of paired-end reads thoroughly: (1) matching adapter 1 head to read 1, (2) matching adapter 2 head to read 2, (3) matching read 1 head to reverse complement of read 2 and (4) matching read 2 head to reverse complement of read 1 ( Figure 2C). If the maximum number of bases matched of (1) and (2) is less than a user-defined cut-off (default is 9), (3) and (4) will be performed with a loosed k (= k original + 1). If the largest number of matched bases of the four matches is greater than the cut-off, and some matches do not meet the requirement, we will re-run those matches with a loosed k (= k original + 3) at the insert size indicated from the best match. If the new number of matched bases is greater than the cut-off, the old match will be discarded.
The scoring system measures the matching reliability of the whole 16-mer rather than each base. The Phred quality score Q of each base is converted to the probability P of that the corresponding base being correct using the formula: Then, the average base quality P of 16-mer sub-sequence a at the memory position n is computed: Gigabyte, 2021, DOI: 10.46471/gigabyte.31

4/18
Notably, if the read quality is too low, it would imply an invalid match. However, in reality, invalid matches are filtered out by the kmer-based algorithm. To solve the discordance, we limit the lower bound of P to 0.75 manually. The matching score S between a n and b m is defined as: where count_ones counts the number of ones in the binary representation of the UInt64. When sequence a is a user-defined adapter, P a = 1 is used. Generally, the matching score S is ranged from 0-16.

Decision rules
This module infers correct adapter positions from the four pairs of matching described in the previous section. It is illustrated in Figure 2D. First, in each read, the adapter and paired-end matches are compared. The one with the higher matching score is chosen. If both matches support the same adapter position, the matching score of the read is the sum of adapter and paired-end matching scores. Then, the matches of the two paired-end reads are compared using the same strategy. If one read finds an ideal adapter (matching score >10 by default) while the other read is too short to check or the average base accuracy of its 16-mer is less than 0.6 (Phred Q < 5), both reads will be trimmed. If the matching score of a given read pair is less than 10 (by default), the read pair will not be trimmed.
Other read pairs will be further examined to reduce false positives, which are usually adapter matches at read tails. A read tail is defined as the last several bases (default is

Consensus calling
In this module, the overlapped base pairs of read 1 and 2 are corrected to the corresponding bases with higher quality scores. It has three steps, prediction, assessment, and correction.
In the prediction step, Atria (RRID:SCR_021313) makes a preliminary estimate of whether a read pair contains an overlapped region. If adapters are trimmed and the remaining lengths of read 1 and 2 are the same, the prediction passes. If no adapter can be trimmed, two additional matching and scoring are required. The head of the reverse complement of read 2 is matched to read 1, and the head of the reverse complement of read 1 is matched to read 2. If the two matches reach a consensus, the prediction passes. Otherwise, the prediction fails and consensus calling is skipped.
In the assessment step, Atria (RRID:SCR_021313) compares the whole overlapped region using a similar matching algorithm, except that ambiguous bases (N, 1111) are converted to gaps (0000) before matching. If the ratio of mismatch is greater than a user-defined value (28% by default), the assessment fails, and consensus calling is skipped.
In the correction step, each base pair in the overlapped region is corrected to the corresponding base with the highest quality score.

Quality trimming
Atria (RRID:SCR_021313) implements a traditional sliding window algorithm to remove the low-quality tail. The sliding window scans from the front of the read and computes the average Phred quality score of the sliding window. If the average quality is less than a given threshold, the read tail is removed. which is then wrapped and encoded to FASTQ objects parallelly using multi-threading. On the contrary, in the writing process, Atria (RRID:SCR_021313) unboxes and decodes FASTQ objects to string vectors in parallel and writes sequentially to files. In addition, pigz (parallel gzip) and pbzip2 (parallel bzip2) are called for compression and decompression when needed [7,8]. Atria (RRID:SCR_021313) also support running with a single thread.

Performance of adapter trimming on a simulated dataset
We simulated 8.9 G bases with 100-bp paired-end reads from the Arabidopsis thaliana reference genome using the Skewer modified ART, a public NGS read simulator to allow adapters in the reads [9,10]. The simulation profile was trained from a 101-bp paired-end public dataset SRR330569, and the 33-bp adapter pair used in read simulation is AGATCGGAAGAGCACACGTCTGAACTCCAGTCA and AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT [11].  [14], SeqPurge v2012_12 [15], Trim Galore (RRID:SCR_011847) v0.6.5 [4] and Trimmomatic (RRID:SCR_011848) v0.39 [5]. Only adapter trimming was used, and other trimming and filtration were disabled. Detailed command line arguments are listed in Table S1, which is available in GigaDB that presents trimming speed on the 8.9 G bases 100 -bp paired-end simulated data [16]. where TP is the number of reads trimmed correctly, TN is the number of untrimmed reads without adapters, FP is the number of over-trimmed reads, and FN is the number of under-trimmed reads [3,10].
To compare speed and efficiency, elapsed time (wall time) and average CPU consumption of each trimmer were recorded in different threading (1-32 threads) for uncompressed and gzip compressed data formats ( Figure 3 and Table S1). Efficiency was defined as the fraction of processing speed to the percent of CPU utilized, so it was a better measurement, especially in CPU-intensive scenarios, such as running on a server with a job scheduling system or trimming multiple samples at the same time. Ktrim and Atria (RRID:SCR_021313) were two of the fastest trimmers in terms of speed and efficiency, from 1-16 threads ( Figure 3 and Table S1). For uncompressed data, Trimmomatic (RRID:SCR_011848) was faster than Atria (RRID:SCR_021313) using 8-32 threads, but its real CPU usage was much greater than Atria (RRID:SCR_021313) ( Figure 3 and Table S1). The speed and efficiency of AdapterRemoval (RRID:SCR_011834) and Skewer (RRID:SCR_001151) were generally 2-4 times less than Atria (RRID:SCR_021313), and Atropos was the slowest one ( Figure 3 and Table S1). SeqPurge did not support the output of uncompressed data, so it was only tested in the compressed benchmark.
When trimming compressed data, the speed of AdapterRemoval (RRID:SCR_011834), Skewer (RRID:SCR_001151), Fastp (RRID:SCR_016962), Atropos and Trimmomatic (RRID:SCR_011848) remained constant when the number of threads increased from four to 32, because they failed to utilize more than four CPU in the IO process, while Atria (RRID:SCR_021313) and Trim Galore (RRID:SCR_011847) did not have the limitation ( Figure 3 and Table S1). Atria (RRID:SCR_021313) was faster than Trim Galore (RRID:SCR_011847), and the efficiency of Atria (RRID:SCR_021313) was constantly two to three times greater than Trim Galore (RRID:SCR_011847) ( Figure 3 and Table S1). SeqPurge showed strange speed curves; when assigning a single thread to SeqPurge, the average CPU usage was 300%, and the speed and average CPU usage dropped when assigning 8-32 threads ( Figure 3 and Table S1). In addition, Ktrim did not support output compressed files, so we ignored it. Atria (RRID:SCR_021313) was usually the fastest trimmer when trimming compressed files.

Detailed statistics of adapter trimming accuracy on a simulated dataset
The previous portion benchmarks on a whole dataset. This section evaluates trimming accuracy regarding different read properties, including adapter presence or absence, base error, and adapter length. To achieve the goal, Atria (RRID:SCR_021313) integrates a benchmarking toolkit for read simulation and trimming analysis.  The read simulation method was inspired by the way in which sequencers read DNA.
First, an original DNA fragment (insert) with a given original insert size is simulated, base by base. Adenine, thymine, cytosine, and guanine are randomly chosen repetitively. Then, the insert and adapter sequences are copied base by base with an error profile, which simulates the procedure of sequencing by synthesis. The error profile defines substitution rate, insertion rate, and deletion rate.
Twenty-one million read pairs were simulated with a uniform read length (100 bp), different error profiles, adapter length, and original insert sizes. The baseline error profile comprises a 0.1% substitution rate, 0.001% insertion rate, and 0.001% deletion rate, inspired by an Illumina error profile analysis [17]. We chose 1×, 2×, 3×, 4×, and 5× baseline error profiles; 16, 20, 24, 28, and 33 adapter lengths; and 66-120 even insert sizes. In this way, the reads with the least insert size have full lengths of adapters. The reads with 66-98 original insert sizes contain adapters, and the reads with 100-120 original insert sizes are free from adapter contamination, except for few reads with a 100-bp insert size containing indels.
Therefore, in each condition combination, 30,000 read pairs were simulated to avoid random errors. Reads were trimmed with the same method described in the last section. The average trimming performance among different conditions is shown in Figure 4A. When adapters are present, Atria (RRID:SCR_021313) trims 99.9% adapters accurately, and SeqPurge, Fastp (RRID:SCR_016962), and Atropos follow closely with an accuracy of 99.7% ( Figure 4A1). When adapters are absent, AdapterRemoval (RRID:SCR_011834), Skewer (RRID:SCR_001151), Trimmomatic (RRID:SCR_011848), Atropos, and Atria (RRID:SCR_021313), successfully leave 100.0% reads intact, and Fastp (RRID:SCR_016962) falls behind with 99.8% accuracy ( Figure 4A2). Figure 4B illustrates the trimming accuracy on different read error profiles. When adapters are present, the accuracy of all trimmers drops as error rates increase ( Figure 4B1). Atria (RRID:SCR_021313) keeps the highest accuracy from 100.0% to 99.9%, and is rarely affected by different error rates ( Figure 4B1) Figure 4B1). With no adapter present in reads, the accuracy is influenced very little by error profiles (Figure 4B2), so the performance is similar to that shown in Figure 4A2. In addition, adapter lengths ranging from 16-33 bp are not relevant to the accuracy of most trimmers, including Atria (RRID:SCR_021313) ( Figure 4C).

Performance of adapter trimming on a real sequencing dataset
RNA-seq paired-end dataset (SRR330569). SRR330569 is a real RNA-seq dataset sequenced from Drosophila simulans with 5.46 G bases and 2 × 101-bp read length. It contains the 37-bp adapter sequences AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG and AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT in read 1 and read 2, respectively. Mapping statistics were collected using SAMTools (RRID:SCR_002105) Stat v1. 10 [20]. Skewer (RRID:SCR_001151) did consensus calling after adapter trimming, and no option was provided to disable it. To achieve benchmark parity, Skewer (RRID:SCR_001151) was compared to Atria (RRID:SCR_021313) with consensus calling enabled, and other trimmers were compared to Atria (RRID:SCR_021313) without consensus calling. Time trimming was recorded in accordance with a common scenario: inputs were gzip-compressed and trimmed with eight threads, and outputs were also gzip-compressed to reduce massive disk use. All tested trimmers worked in the scenario except that Ktrim could not output gzip files (Table 2). Atria (RRID:SCR_021313) was the fastest program to process and output compressed data in terms of wall time (Table 2). It also achieved the highest number of reads mapped and paired, and the highest percentage of properly paired reads with or without quality trimming. Usually, higher base mapping is accompanied by a higher error rate in the mapping process, so it is important to interpret the two metrics together. Atria (RRID:SCR_021313) had the lowest mapping error rate of 8.1833‰ and the forth highest number of bases mapped ( Table 2). The trimmers (AdapterRemoval (RRID:SCR_011834), Fastp (RRID:SCR_016962), and Atropos) with the highest three error rates have the highest number of bases mapped (Table 2). Our program was usually more than 5% better than other trimmers for data without quality trimming (Table 2). Mapping statistics of data without quality trimming were usually worse than with quality trimming, except for Atria (RRID:SCR_021313). Specifically, the properly paired rates of other trimmers without quality trimming were 0.5-4% less than with quality trimming ( Table 2). Quality trimming also increased the number of mapped and paired reads and reduced the number of unmapped reads ( Table 2).  Atria (RRID:SCR_021313) was also the fastest trimmer in the scenario (3 min 3 s) ( Table 1).
SeqPurge and Trim Galore (RRID:SCR_011847) finished the task in more than 4 minutes, while others spent more than 11 minutes (Table 2).
In adapter trimming-only statistics, SeqPurge had the highest mapped and paired reads (54,446,104) and the highest properly paired reads (97.0%) ( The algorithm also has its limitations. It only reports the number of matched bases and does not report the positions of mismatches, so it cannot be used solely for sequence alignment. Besides, the algorithm does not handle inserts and deletions (indels).
However, the average indel rate of an Illumina library is 10 −6 to 10 −5 [17], and the low indel rate is almost negligible in real data analysis. In addition, Atria (RRID:SCR_021313) matches four pairs in different locations to compensate for the limitation. If one match fails because of an indel, other matches will suggest the real adapter positions.
In the runtime benchmark, we compared the performance of trimmers using extremely high CPU cores. Efficiency usually marginally decreased as CPU usage increased owing to

CONCLUSIONS
We introduce not only Atria (RRID:SCR_021313), a cutting-edge trimming software for sequence data, but also the ultra-fast and lightweight byte-based matching algorithm. The algorithm can be used in various short-sequence matching applications, such as primer search and seed scanning before alignment. Atria (RRID:SCR_021313) is implemented in Julia (RRID:SCR_021666), a programming language designed specifically for high performance.

DATA AVAILABILITY
The datasets SRR330569, and ERR4695159 analyzed during the current study are available in the Sequence Read Archive from the National Center for Biotechnology Information [11,26].
The Atria source codes, releases, documents, and benchmark scripts can be downloaded from Atria's Github page [25]. Snapshots of the code are also available in the GigaScience GigaDB repository [16].

ETHICAL APPROVAL
Not applicable.

COMPETING INTERESTS
The authors declare that they have no competing interests.