American triatomine species occurrences: updates and novelties in the DataTri database

The causative agent of Chagas disease (Trypanosoma cruzi) is transmitted to mammals, including humans, mainly by insect vectors of the subfamily Triatominae (Hemiptera: Reduviidae). Also known as “kissing bugs”, the subfamily currently includes 157 validated species (154 extant and three extinct), in 18 genera and five tribes. Here, we present a subdataset (7852 records) of American triatomine occurrences; an update to the most complete and integrated database available to date at a continental scale. New georeferenced records were obtained from a systematic review of published literature and colleague-provided data. New data correspond to 101 species and 14 genera from 22 American countries between 1935 and 2022. The most important novelties refer to (i) the inclusion of new species, (ii) synonymies and formal transferals of species, and (iii) temporal and geographical species records updates. These data will be a useful contribution to entomological surveillance implicated in Chagas disease.

this integrated New American dataset is complemented by a dataset on triatomine species present in Argentina (henceforth referred to as the "Argentinean dataset"), also stored in GBIF ( Figure 1).
This work is the result of an exhaustive review of public information combined with substantial interinstitutional collaboration, which integrated not only geographical but also ecological data for all American triatomine species spanning 24 American countries. This geodatabase may contribute not only towards improving knowledge of the geographical distributions of every American triatomine species, but also to designing improved strategies for health promotion and vector control. We believe it will be of practical use for both the academic and educational community, as well as for those institutions responsible for public health promotion, prevention and vector control activities.
Subsequently, in 2020, another 5893 records were incorporated (amounting to 27,708 records). Then, it was modified to Darwin Core format [7] and stored in the GBIF platform, split into two complementary datasets: the "American dataset" (11,791 records, spanning 1926-2022) and the "Argentinean dataset" (15, T. sordida (Stål, 1859)) that are included in the Argentinean dataset and recorded as having (or having had) at least part of their geographical distribution within Argentine territory.

New American triatomine occurrence data
To build a new subdataset of American triatomine occurrences (the American subdataset) and integrate it into the New American dataset available at present, a total of 7852 occurrence records were compiled from 2 years (2020-2022) of systematic reviews ( Figure 1b, Table 1). The new records (N = 7852) are identified consecutively from the "catalogNumber = 27631" to the "catalogNumber = 35482" in the New American dataset.  Azeredo-Oliveira & Fernandez Madeira, 2020), which was described in 2020 after the last update of the Argentinean dataset (and therefore not included in the latter), occurrence records of all the American triatomine species described up to date [2] are included between both datasets (Table 1).
In the New American dataset, Jamaica was included for the first time, with records of Nesotriatoma obscura species [35]. In summary, more than 70% of the species present in the "American dataset" now have new records incorporated in the "New American dataset" (for more details, see Table 1).
The new compiled information included in the American subdataset spans 1935 to 2022.
Date information was available for 90% of the records, while 30% of them comprise data from the last 4 years. One noteworthy update to the temporal patterns of some species is that Rhodnius pictipes (Stål, 1872) has been distributed in Trinidad and Tobago since 1985 [36].
The addition of new records (N = 7852) into the New American dataset (reaching 19,600 records), plus those records included in the Argentinean dataset (N = 15,917) equates to a total of 35,517 occurrence records for 142 American triatomine species (Figure 2).

Information source types and compilation of triatomine species data
To build the American subdataset, data for each triatomine species were obtained through a detailed and exhaustive review of information. No specific temporal range limits were set to obtain the greatest possible amount of new data from as many American countries as possible. Several public bibliographic repositories were used online (BioOne, Google Scholar, PLoS, PubMed, Scielo, ScienceDirect, Wiley) and were reviewed using terms such as "Chagas disease", "Triatominae" and "Trypanosoma cruzi" without language restriction.
We also reviewed the public and open access triatomine bibliographic database BibTri [37].
Where published articles mentioned unpublished datasets, we contacted the authors and asked them to provide geographic coordinates, or at least locality data, to georreference them.

Data georeferencing process
To rigorously associate each record to a specific location in the geographical space, data must have information expressed in geographic coordinates (latitude and longitude). If no geographic coordinates were available, the site name was used together with information on administrative divisions to attain an accurate location using Google Earth [38]. If the geographic coordinates were not expressed in decimal degrees, they were converted using a coordinate conversion application [39]. Where only the geographic coordinates were available, the corresponding administrative divisions were completed using GeoLoc [40].
The datum (coordinate system and set of reference points used to locate places on Earth) used for all geographic records was WGS84 (World Geodetic System 1984). The final dataset was built after data quality control.

Description of American subdataset fields
We compiled all relevant and available information associated with each triatomine species and attached the data to each dataset field, including characteristics of the specimens collected and of the sampled sites. To better describe the fields (based on Darwin Core terms [7]) used to systematize the information, they were grouped into the following six categories: (1) identifiers (including fields used to identify each record, e.g. occurrence ID, institution code, language of the resource, associated references, etc.); (2) systematic (including fields used for systematic information, e.g. scientific name, scientific name authorship, taxon rank and taxon remarks); (3) geographical (including fields with information such as administrative divisions, coordinates, georreference sources, etc.); (4) temporal (including fields related to the event date such as year, month and day);

Systematic fields
When appropriate, the "taxonRemarks" field included notes and/or references about synonyms or formal transferals of the species described in the corresponding record.

Geographical fields
The "locality" field refers to the site name nearest to the geographic coordinates -not necessarily the name of the locality where the specimens were collected. If the more accurate geographic information was the municipality name, the coordinates correspond to its centroid.

Temporal fields
When a group of specimens' information corresponded to a certain time period but with specific dates available, the data were split into different records.

Sampling fields
The "habitat" field refers to the type of habitat where the triatomines were collected, and classified into three categories: domicile, peridomicile and sylvatic. When specific habitat information was aggregated, the habitat was expressed as a combination of two or three of those categories (e.g., domicile-peridomicile, domicile-sylvatic, peridomicile-sylvatic or domicile-peridomicile-sylvatic).
For the "SamplingProtocols" field, the available information was classified into two major categories: (i) active search, when the searching involved specialized staff; and (iii) passive collection, when different types of traps (e.g., light or Noireau traps) were used.

DATA VALIDATION AND QUALITY CONTROL
The American subdataset was subjected to exhaustive quality control. First, each datum was extracted by one person and checked by two other people to ensure accuracy and to verify no duplication of records. Subsequently, data were checked to avoid errors (e.g., typing, georeferencing, incorrect locations, synonyms, errors in spelling of administrative divisions) that might have arisen during compilation or data entry. To correct and remove typographical errors and spelling mistakes in the names of administrative divisions, we used OpenRefine software (RRID:SCR_021305) [41], which helps to detect these types of errors in large datasets.
All geographic coordinates were checked using open GIS software (QGIS, RRID:SCR_018507 [42]) to detect georeferenced errors and incorrect locations, ensuring that each point corresponded to a location on the continent and in the correct country. Any outlier coordinates that were geographically distant from the known distribution of a given species were investigated to ensure correctness. When validating geographic coordinates, we detected that some occurrence data from public sources were located outside the continent or within continental waterbodies. These data may have been erroneously georeferenced by the authors of the original scientific publication; however, when we considered these data were sufficiently valuable to be incorporated, we carried out the following procedure: if the "country", "stateprovince", "municipality" and "locality" fields were provided by the authors, we assigned the correct geographic coordinate, taking as a reference the name of the locality contributed by the authors. To detect taxonomic synonym errors, we used the most recent triatomine review of currently valid species [2]. If any species name was suspected to be outdated, we consulted current bibliography or requested the expert opinion of colleagues.
Finally, we improved the quality of our final dataset using the GBIF data validator [43] to identify and address potential issues prior to the dataset publication through the Integrated Publishing Toolkit (IPT, [44]).

REUSE POTENTIAL
As the information contained within the dataset has been collected using different influenced by the great contribution of colleagues who led a kissing bug community science program [45], and are co-authors of this work (Rachel Curtis-Robles and Sarah A. Hamer).
Three important notes about the data from that program are: (1)  Brazil are also the countries with the largest amount of data collected). An explanation for the latter factor goes beyond the goal of this paper. For habitat sampling, we recognize a potential bias in favor of the domiciliary and peridomiciliary habitats because these are the habitats of major epidemiological importance and the target of vector control campaigns.
Additionally, the paucity of sylvatic habitat data also results from the difficulty of sampling procedures in the large variety of sylvatic habitats used by triatomines. Finally, it is worth noting that about 37% of the records lack available date information; thus, we recommend that any analysis based on this dataset should use methods that take such biases into account.
Despite the information biases described above, the American subdataset described in this paper, and integrated in the New American dataset, plus the complementary Argentinean dataset, constitute a valuable compilation of geographic data on American triatomines, which is as complete, updated and integrated as possible. Thus, all datasets mentioned herein better represent the number of species and countries, and have more accurate geographic coordinates. Since these datasets are hosted in an open and public repository, we hope that they will contribute towards fulfilling national and international goals, such as promoting the exchange of biological information, increasing and improving the accessibility of such information, providing biological data produced and compiled in several countries, and enhancing knowledge of both the biodiversity and epidemiological data related to Chagas disease.

DATA AVAILABILITY
The dataset "Datos de ocurrencia de triatominos americanos del Laboratorio de Triatominos del CEPAVE (CONICET-UNLP)" has been published by Centro de Estudios Parasitológicos y de Vectores (CEPAVE) [46] and is available in the GBIF repository under a CC0 public domain waiver [5].

EDITOR'S NOTE
This paper is part of a series of Data Release articles working with GBIF and supported by the Special Programme for Research and Training in Tropical Diseases (TDR), hosted at the World Health Organization [47].