Mosquito alert: leveraging citizen science to create a GBIF mosquito occurrence dataset

The Mosquito Alert dataset includes occurrence records of adult mosquitoes collected worldwide in 2014–2020 through Mosquito Alert, a citizen science system for investigating and managing disease-carrying mosquitoes. Records are linked to citizen science-submitted photographs and validated by entomologists to determine the presence of five targeted European mosquito vectors: Aedes albopictus, Ae. aegypti, Ae. japonicus, Ae. koreicus, and Culex pipiens. Most records are from Spain, reflecting Spanish national and regional funding, but since autumn 2020 substantial records from other European countries are included, thanks to volunteer entomologists coordinated by the AIM-COST Action, and to technological developments to increase scalability. Among other applications, the Mosquito Alert dataset will help develop citizen science-based early warning systems for mosquito-borne disease risk. It can also be reused for modelling vector exposure risk, or to train machine-learning detection and classification routines on the linked images, to assist with data validation and establishing automated alert systems.


DATA DESCRIPTION Background
Vector-borne diseases (VBDs) are infections caused by pathogens transmitted by carrier species (vectors), most of which are arthropods. VBDs are a major global health issue, with 80% of the world's population at risk of one or more of these diseases [1]. VBDs account for 17% of the global burden of communicable diseases, with over 1 billion infections and over 700,000 deaths caused by VBDs annually [1]. Many of these diseases, once limited to tropical and subtropical zones, are now increasingly seen in temperate areas [1,2].
Among VBDs, mosquito-borne diseases (MBDs) account for a large share of cases. In 2017 the World Health Organization estimated over 347 million MBD cases and over 447,000 deaths caused by MBDs annually [1]. Of the 3591 known species of mosquitoes (order Diptera; family Culicidae) [3], only a fraction are involved in disease transmission or cause considerable nuisance to human and animal populations. These include invasive species that are spreading throughout Europe owing to globalization and climate change [2,4].
There are five mosquito vectors of primary concern in Europe: four Aedes invasive mosquitoes (AIMs) and the native Culex pipiens (northern house mosquito; NCBI:txid7175). Obtaining field information with traditional mosquito surveillance tools is notoriously costly and time-consuming, and a major drawback of these tools is that they lack scalability.
Costs can be significantly reduced by combining citizen science approaches with traditional ones for targeted surveillance [20,21], and using big data spatial modelling techniques to produce risk maps of vector presence and abundance, human-vector interactions, and disease transmission zones at local or regional scales [22,23]. Citizen science and the use of Entomological Network on a private web-based platform, the digital Entolab. In addition to these species of interest, expert entomologists also identify other species of mosquitoes (not targeted) and even other insect groups. These identifications are also valuable from an educational perspective, as they help citizen scientists understand the differences between targeted and non-targeted mosquitoes/insects. Note that only the five target species of interest are included in the dataset presented here. Since manual inspection of digital images is not a scalable option, the Mosquito Alert database of expert-validated images has been used to train a deep learning model to find Ae. albopictus [30] and the other target species (work in progress). This artificial intelligence system will not only be a helpful pre-selector for the expert validation process but also an automated classifier giving quick feedback to the app participants, which is expected to contribute to long-term motivation.
In this dataset we must differentiate between two periods: the period 2014 to August 2020, and the period September 2020 to 2021. In 2014-2020 the project was mainly focused in Spain, funded by various national sources. Therefore, most of the reports are from Spain.
During this period, the system targeted only two invasive species: Ae. albopictus and Ae. Alert was carried out in combination with pan-European harmonized field entomological sampling (AIMSurv campaigns [36]) under the framework of the AIM-COST Action. Data outputs of these activities are presented separately in this special collection.

Study extent
There are no limitations in terms of the geographic areas from which citizens can participate, so data can be sent from all over the world. Nevertheless, Mosquito Alert's main coverage has been in Spain, with increasing coverage in Europe since 2020, mainly in the Netherlands, Italy, and Hungary ( Figure 1). The temporal coverage of the dataset is from June 18, 2014 to September 20, 2021 and its temporal distribution is represented in Figure 2.
In the dataset presented here, only the five target species are included: Ae. albopictus, Ae. aegypti, Ae. japonicus, Ae. koreicus, and Cx. pipiens.

Sampling
There is no predetermined sampling frequency: participants can send as much data as they like, wherever and whenever they choose. Data sampling may be more intense in some periods owing to dissemination events (e.g. project appearances in TV, science fairs, etc.) but is also naturally modulated by mosquito seasonal prevalence and activity patterns.

Method steps
There are typically five steps to build an occurrence record: 1 An anonymous citizen scientist observes an adult mosquito (dead or alive).  4 Photographs attached to the report are evaluated independently by three entomologists. Each expert assigns a label to the report, indicating their degree of certainty as to whether the photographs show the target species. A "not sure" label is used if an expert is not able to classify a report. A report is flagged if, for any reason, the report needs further discussion or should be temporarily omitted from the public view. The final taxonomic classification comes from an average of the three expert validations. 5 The report is released into the public domain after validation by the three entomologists, and is reviewed by a senior entomologist who also checks flagged reports.
Citizen scientists can include several pictures of the same specimen in asingle report, so one of the three experts is responsible for choosing the final image released to the public domain (public map), which is the one published in the GBIF dataset. The selection criteria is to choose the mosquito image that best represents the observation, or the one most valid for species determination.

DATA VALIDATION AND QUALITY CONTROL
The Digital Entomology Network comprises several experts, including the so-called   rounding half-down strategy implies a conservative approach in the certainty evaluation: if one of the expert expresses doubt, the overall value is decreased.
The validation procedure allows an expert to label a report with 'not sure' in case of pictures with insufficient information. Those records are not included in the current dataset, since only confirmed or probable mosquito records are valid occurrences. For each record, the corresponding entomologist experts who reviewed it are cited by name or by a group label (e.g. institution, team name, etc.). The 'anonymous expert' label is assigned to experts who wish to remain anonymous.

REUSE POTENTIAL
This dataset and the citizen science system that produced it can reach many entomological (vector) surveillance and management objectives. Firstly, owing to its scalability and large networking capability, Mosquito Alert can be used as an early warning system (EWS) to detect invasive species across scales, from city to continental scales. At local scales, these types of data can help optimize vector control, as citizen scientists provide information about nuisance and presence of mosquitoes in almost real time. Mosquito reduction campaigns may combine top-down strategies of mosquito (larvae) control (undertaken by public health agencies) with bottom-up strategies promoting social action and behavioral change to reduce the proliferation of domestic and peri-domestic breeding sites. Secondly, if combined with other data sources, these data can be used to make risk assessments, such as the characterization of critical areas and seasonal variability for disease risk transmission. They can also be used for data augmentation and calibration in mosquito distribution models of seasonal and inter-annual patterns, as well as and spatial suitability maps. Thirdly, the associated images contribute to efforts to train machine-learning models for image flow optimization procedures in digital-based EWS and mosquito detection and classification.

DATA AVAILABILITY
The dataset described here is hosted in the GBIF-Spain repository [38]. The associated multimedia dataset (mosquito pictures) is available on the BioImage Archive repository [39].

EDITOR'S NOTE
This paper is part of a series of Data Release articles working with GBIF and supported by the Special Program for Research and Training in Tropical Diseases (TDR), hosted at the World Health Organization [40].

ETHICAL APPROVAL
This dataset involves human participation through a mobile phone app from which citizen scientists send text and image data. Participants must accept the Mosquito Alert User Agreement [41] in order to use the app, and participation is anonymous.

COMPETING INTERESTS
The authors declare that they have no competing interests.