LT1, an ONT long-read-based assembly scaffolded with Hi-C data and polished with short reads

We present LT1, the first high-quality human reference genome from the Baltic States. LT1 is a female de novo human reference genome assembly, constructed using 57× nanopore long reads and polished using 47× short paired-end reads. We utilized 72 GB of Hi-C chromosomal mapping data for scaffolding, to maximize assembly contiguity and accuracy. The contig assembly of LT1 was 2.73 Gbp in length, comprising 4490 contigs with an NG50 value of 12.0 Mbp. After scaffolding with Hi-C data and manual curation, the final assembly has an NG50 value of 137 Mbp and 4699 scaffolds. Assessment of gene prediction quality using Benchmarking Universal Single-Copy Orthologs (BUSCO) identified 89.3% of the single-copy orthologous genes included in the benchmark. Detailed characterization of LT1 suggests it has 73,744 predicted transcripts, 4.2 million autosomal SNPs, 974,616 short indels, and 12,079 large structural variants. These data may be used as a benchmark for further in-depth genomic analyses of Baltic populations.

research in Lithuanians has mainly utilized single nucleotide polymorphism (SNP) genotyping [5,[10][11][12][13] or exome sequencing [14,15]. To expand the scope of analyses and increase the possibility of new findings, whole-genome sequencing (WGS) using long-read technologies is an optimal solution; it enables the discovery of novel genomic variations [16], reveals accurate breakpoints of the structural variations (SV), and covers some of the complex repeat regions [17][18][19]. Consequently, resolving haplotypes is also relevant to high-quality de novo whole-genome assembly and phasing [17].

CONTEXT
Here, we utilized PromethION, the long-read sequencing platform from Oxford Nanopore Technologies (ONT), as a backbone to construct the first Lithuanian reference genome, LT1, using the genome of a healthy female with Lithuanian ancestry. The ONT PromethION long-read-based genome assembly was polished using BGI-500 short-reads and scaffolded by utilizing Hi-C chromatin conformation capture data.
The final assembly had an NG50 value of 138 megabase pairs (Mbp) and 4,699 scaffolds, covering 92.75% of GRCh38 (Genome Reference Consortium Human Build 38) [20]. SV analyses using long-read data identified 12,079 consensus SVs (confirmed by both SVIM [21] and NextSV2 [22]); however, more than half of the SVs (62.27% of 12,079) lacked information in all major databases, such as gnomAD [23], indicating that human SVs are an under-investigated area of population genetics.
Our high-quality assembly is the first step towards increasing the availability of human genome assemblies from the Baltic States and will serve as a valuable resource for further studies in population genomics.

Sample preparation, library construction, and sequencing
A Lithuanian human (NCBI:txid9606) female with three generations of ethnic family history was recruited for sequencing. Standard ethical procedures were applied by the Genome Research Foundation with IRB-REC-20101202 -001. The volunteer signed an informed consent agreement, and a 20 mL blood sample was drawn using heparinized needles and collected into anticoagulant-containing tubes (dipotassium ethylenediaminetetraacetic acid; DNA was extracted from the single donor's peripheral blood (5 mL) using a DNeasy Blood & Tissue Kit from QIAGEN, according to the manufacturer's protocol. The quality and concentration of the extracted DNA were evaluated using NanoDrop™ One/OneC UV-Vis spectrophotometer (ThermoFisher Scientific™). Short-read whole genome sequencing and library construction were conducted by the Beijing Genomics Institute (BGI) on the BGISEQ-500 platform (RRID:SCR_017979) using DNBseq™ 100-basepair (bp) paired-end sequencing.
Sequencing libraries for long reads were prepared using the 1D ligation sequencing kit (SQK-LSK109) (Oxford Nanopore Technologies, UK) following the manufacturer's instructions. The products were quantified using the Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA) and the raw signal data were generated on the PromethION R9.4.5 platform (Oxford Nanopore Technologies, UK). Base-calling from the raw signal data was carried out using a default ONT basecaller MinKNOW v19.05.1 with the Flip-Flop HAC (High Accuracy) model (Oxford Nanopore Technologies, UK).

De novo assembly of the LT1 genome
To generate the de novo LT1 genome assembly, we prepared a bioinformatic pipeline including: a preprocessing step, contig assembly, map assembly, gene prediction, and post-analysis. The processes used in the pipeline are summarized in Figure 1.
A total of 142.09 gigabase pairs (Gbp) of short paired-end genomic raw reads were produced by the BGISEQ-500 sequencer, which resulted in a 47× sequencing depth of coverage ( In total, 172.22 Gbp raw long reads, giving 57× coverage, were produced from PromethION sequencing (Table 1). Ultra-long reads constituted 0.0075% of the long reads.
Base-called raw reads with low quality were filtered by the default function of MinKNOW.
One de novo assembly was performed using wtdbg2 v2.5 (RRID:SCR_017225) [27] with cleaned (filtered and trimmed) long reads. Parameters for the assembly were set as '−x ont −g 3g −L 5000'. For error correction of assembled contigs, we utilized a two-step strategy. second step, error correction using Medaka v0.11.5 [29] was performed with a pre-trained model for Flip-Flop. To improve SNP and indel accuracy of the assembly, we polished the consensus with short reads using two rounds of Pilon v1.23 (RRID:SCR_014731) [30]. The second de novo assembly was performed using the Shasta v0.4.0 [31] assembler with the default parameters. For error correction of assembled contigs, MarginPolish v1.3 [32] and HELEN v0.0.1 [33] were used with default options. Using the two assemblers allowed us to compare the outcome of the two methods and choose one outperforming method for downstream analyses.
Owing to the absence of LT1 parental genome data, a read-based phasing of the assembly was performed using Medaka and WhatsHap v1.0 [39] and shared on the LT1 webpage [40]. Since the variant calling module of Medaka includes variant calling and phasing steps with sequenced reads from ONT using WhatsHap, filtered and trimmed PromethION reads were mapped against the assembled scaffolds, and assembled scaffolds were phased using Medaka. As a result of read-based phasing, 2,299,025 variants were phased from 3,901,968 variants, and the number of phased blocks was 8879 (See Table S1 in GigaDB [41]). Phased genome sequences were extracted using Bcftools v1.9 (RRID:SCR_005227) [42] with a command-line of "bcftools consensus -H 1 -f reference.fasta phased.vcf.gz > haplotype1.fasta".

Constructing a genome browser and BLAST database
To construct a genome browser, we first compiled all the data, including predicted gene models and evidence resources. The LT1 browser was built using JBrowse v1.16.9 (RRID:SCR_001004) [54]. A BLAST database for LT1 gene set v1 was built by SequenceServer v1.0.12 [55] and can be accessed via the Lithuanian genome webpage [40].

DATA VALIDATION AND QUALITY CONTROL LT1 genome assembly statistics
The contig assembly using wtdbg2 resulted in 2.73 Gbp in 4,490 contigs with an NG50 of 12 Mbp. The Shasta assembly had 2.8 Gbp assembled into 11,009 contigs with an NG50 of 6.3 Mbp. Both contig assemblies were corrected using long-reads and polished with short-reads, as described in the Methods. The wtdbg2 assembly had higher contiguity and quality (the total number of contigs, NG50, quality value (QV) statistics); therefore, it was selected as the main assembly for the LT1 genome and subsequent analyses ( Table 2). Hi-C data were used for scaffolding ( Figure 2). After scaffolding, we produced 4,700 scaffolds with a total length of 2.73 Gbp and an NG50 of 138 Mbp (Table 3), which includes an mtDNA. The number of scaffolds is higher than the original 4,490 contigs because we had to manually split some misassemblies that were found when we applied the Hi-C data. The longest scaffold was mapped to chromosome 2 and spanned 218 Mbp, which covers 92.6% of chromosome 2. To estimate the quality of the LT1 assembly, we compared it with GRCh38 and 'CHM13 Chromosome X v0.7' from T2T using Dot and QUAST. No significant misassemblies (such as translocations and inversions) were identified while analyzing NUCmer alignment plotted in Dot; notably, a comparison between LT1 and CHM13 chromosome X displayed a higher breadth of coverage (94.50%, Figure 3, supporting the fact that LT1 is a relatively high-quality assembly. The LT1 genome covered 92.75% of GRCh38 (excluding alternative contigs and chromosome Y in GRCh38), as shown in Figure 4 and Table 4. This shows that LT1 is significantly shorter than GRCh38 and there is still room for assembly improvement. QUAST analysis showed that the LT1 genome had, 479 and 842 misassemblies against GRCh38 and CHM13v1.1 (without ALT sequences), respectively (Table 4)    GigaDB [41]) and 41,601 transcripts (see Table S3 in GigaDB [41]), respectively. BLAST analysis results are available on the LT1 genome webpage [40]. For SVs, we use two SV callers, SVIM and NextSV2, and an SV analysis tool SURVIVOR. We identified a union of 31,167 SVs, of which 12,079 SVs were shared insertions, deletions, and inversions ( Figure 5, Table 7).

Variant identification
The total number of deletions called by each tool after QC filtering differed minimally; however, the number of insertions identified by SVIM was two times higher than by   NextSV2 (Table 7). This is probably because insertions, usually, are more difficult to call than deletions [68]. CNVnator detected 95 duplications, which was similar to the 97 detected  by Nextsv2, when the minimum read support was 10; however, only one duplication was  (Tables 7 and S6 in GigaDB [41]).
Almost 97% of SVs assigned to a genomic region were located in the introns [41].
Regardless of the SV type, small SVs (30-200 bp) constituted a significant fraction (64.77%), with another peak spiking around 300 bp ( Figure 6). We confirmed that 81.92% of insertions forming this second peak (in the length range 250-350 bp) were ALU sequences. The longest insertion among the consensus SVs was detected on chromosome 7 in the LOC101928283 gene, spanning 1,003 bases. The largest deletion was in chromosome 1 (45,516 bases) and has been annotated in both the Database of Genomic Variants (DGV; RRID:SCR_007000) [68] and gnomAD (RRID:SCR_014964) [23] databases, despite lacking a precise annotation for a location or a clinical phenotype. It is predicted to be benign (Table S7 in GigaDB [41]).

REUSE POTENTIAL
We present the first Lithuanian reference genome, LT1. ONT's PromethION long-read and BGI-500 short-read sequencing technologies were combined with Hi-C chromatin  *Two columns were used for this database annotation. Bottom row denotes statistics which are not available with 'full' annotations and were presented based on 'split' type annotations (see Table S7 in GigaDB [41]).
conformation capture to complete the genome assembly. It was built with sufficient sequencing data to cover the genome and a high-quality assembly was constructed as the first reference genome from the Baltic States. BUSCO assessment revealed that LT1 gene prediction had more fragmented and missing genes against GRCh38 than initially expected. Even though long DNA reads usually provide more accurate SV calling, our SV analyses with long-read data showed most SVs (62.27%) could not be annotated using currently available public databases. This indicates that SV is still an under-investigated area of population genetics. More ethnic references and regional genomic variation datasets (variomes) with phenotype association studies are needed to patch these remaining gaps in our knowledge to completely map and understand the biological features of the human genome structure.
This genome assembly could be used as a genomic reference representative of Lithuanian people in comparative genomics. Owing to its relatively high depth of coverage of long-read sequencing, this genome can be reused as a template to accurately map the autosomal and X chromosome genomic variation (both SNPs and SVs) of samples from the Baltic region. Moreover, here we provide the first long-read SV set for a healthy Lithuanian individual, which could be used in disease studies to filter out SVs in Lithuanian populations that do not cause any critical or early-onset disease phenotypes. Lastly, the unprocessed sequencing data; short reads, long reads, and Hi-C data, can be reused for any population genomics study.

DATA AVAILABILITY
The whole genome sequence analyzed in this study has been deposited at the National Center for Biotechnology Information (NCBI) under BioProject ID PRJNA635750, in the NCBI BioSample database under accession number SAMN15052346, and in the NCBI Sequence Read Archive (SRA) database under accession number PRJNA635750. Hi-C reads are

ETHICAL APPROVAL
This study was approved by the Institutional Review Board of the Genome Research Foundation (reference: IRB-REC-20101202 -001). The anonymous sample donor provided informed consent to participate in whole genome sequencing and the following analysis in compliance with the Declaration of Helsinki. Informed consent was recorded by their signing of a written consent form.

CONSENT FOR PUBLICATION
The consent form signed by the anonymous sample donor included a section about data publication, to which the sample donor specifically consented.