Introduction
Socioeconomic data play an important role in infectious diseases research by providing insights into how socioeconomic factors such as income, education, and occupation impact the spread, prevention, and treatment of diseases. Socioeconomic data can help identify vulnerable populations, understand patterns of disease transmission, and the effectiveness of interventions, as outlined in this publication (Khalatbari-Soltani et al., 2020).
The quality control of socioeconomic data in the context of infectious diseases is crucial for ensuring that such data can be reliably used for research and to generate evidence that can feed into policymaking. Quality control measures should be implemented throughout the entire data lifecycle, from data collection (or generation) to storage, analysis, sharing, and re-use. This requires that a comprehensive and systematic approach is applied throughout the data lifecycle. The approach must focus on, among other things, representativeness, clarity, timeliness, error detection, interoperability, and continuous monitoring.
Quality dimensions
Quality Dimension | Description |
---|---|
Accuracy | The extent to which data accurately reflect the real world, the extent to which data are affected by observational biases and measurement errors. |
Completeness | The extent to which missing or incomplete data occur. |
Timeliness | The extent to which data are collected in (near) real-time and available in a timely manner for its use, the degree to which data are up-to-date. |
Validity | The extent to which data meet pre-set criteria (~ validation rules). |
Relevance | The extent to which data satisfy users’ needs. |
Integrity | The extent to which data can be traced and connected to other data. |
Consistency | The extent to which a dataset aligns or is uniform with other datasets. |
Representativeness | The extent to which data reflect real-world population characteristics, extent to which data can be generalized to a target population. |
Clarity | The ease with which data consumers can understand the metadata. |
Currency | The extent to which data have real-time value. |
Uniqueness | The extent to which a dataset is free from duplicate data. |
A review on data quality dimensions for Big Data can be found in this publication (Ridzuan et al., 2024).
Considerations
Below are considerations that must be taken into account when working with socioeconomics data, along with the quality dimensions that they are related to.
Define the target population
- Quality dimension: Representativeness
- It is essential that the sampling population accurately represents the demographics most affected by the disease. This includes accounting for age, gender, socioeconomic status, geographical location, and other relevant factors, to ensure that the findings can be generalised to the wider population at risk.
Define the collection strategy
- Quality dimension: Clarity and Relevance
- When data are collected using surveys, it is vital to design questions that are relevant and easily understood by participants. The questions should be straightforward, culturally appropriate, and tailored to capture the necessary socioeconomic variables that impact, or are impacted by, the infectious disease in question. Additionally, when harvesting data from existing databases, it is crucial to ensure that the data extracted are pertinent and clearly defined. This involves understanding the database schema, ensuring compatibility with the research objectives, and verifying the accuracy and relevance of the socioeconomic variables included. The data extraction process should be well-documented and consistent, ensuring that the harvested data maintain their integrity and usefulness for the intended analysis.
Timeliness
- Quality dimension: Currency and Relevance
- Socio-economic data related to infectious diseases must be current and collected promptly to reflect the rapidly changing dynamics of disease spread and its socioeconomic impacts, especially during an outbreak. Timely data collection enables timely intervention and policy response.
Missing values
- Quality dimension: Completeness and population coverage
- Strategies should be applied to address any gaps in data, and to ensure coverage of all relevant sub-populations.
Key factors impacting data quality
Technical aspects
- Lack of standardized data and metadata formats
- Absence of technical solutions
- Lack of detailed information for specific searches
- Semantics: terminology variations
- Unstructured data
- Challenges with patient identification and linkage with other data sources
Motivation
- Lack of stimulants to use evidence in public health decision-making
- Lack of communication of benefits
Economic aspects, resources
- “Lack of investments in people, infrastructure, and organizational processes for collecting, storing, analyzing, and sharing data”
- Human resources
- Insufficient qualified and motivated workforce
- High workload
- Lack of supervision
Political aspects
- Lack of clear policies/regulations
- Uncertainty about the role of data owners
- Conflicts of interests
Legal, ethical barriers
- Privacy and data protection
- Data minimization
A review of data quality measures in health research is provided in this publication (Andrade et al., 2023).
Approaches to assess and improve data quality
Pilot tests
Conduct pilot tests or preliminary surveys with a subset of the target population to ensure that the survey design works effectively. This helps to identify potential issues with the clarity, relevance, and understanding of the questions before full-scale data collection.
When the data are harvested from existing databases, conduct pilot extractions to verify the data retrieval process. This involves testing the extraction protocols on a small scale to ensure that the data pulled from databases are accurate, relevant, and compatible with the study’s objectives. Pilot testing the data extraction helps to identify and address any discrepancies or issues in data format, completeness, and alignment with the research questions.
Ensuring that you extract the correct data from the database(s) in question helps with representativeness, as you can assess whether data collection methods effectively reach all segments of the population. Examining population coverage helps to identify under-served or marginalised groups whose experiences may be underrepresented in the data, thus enabling more equitable decision-making.
Timeliness
Implement efficient data collection protocols that ensure rapid access to, and sharing of data. Real-time or near-real-time data collection and processing systems are critical for monitoring the socioeconomic impacts of infectious diseases and for timely decision making.
Error and inconsistency detection
Perform regular error detection and data cleaning processes. This includes identifying outliers, inconsistencies, and missing data, and then applying corrections accordingly, to improve the overall quality and reliability of the data.
Outliers (i.e. data points that significantly differ from other observations) must be identified. It is key to make a distinction between outliers caused by erroneous data (e.g. due to errors in data entry or measurement), and those caused by real, but perhaps rare, events, as they must be handled differently. Tests like the Z-score and interquartile range (IQR) can be used for outlier detection, but these methods assume normally distributed data, so normalization (such as log transformation) is often necessary. To ensure that outliers are not true rare events, it’s important to check related variables from the same samples (multivariate outlier detection). Additionally, consulting domain experts and cross-referencing with other datasets can help validate the outlier’s authenticity. This publication (Boukerche et al., 2020) provides an overview of outlier detection methods.
Some methods/tools for detecting errors and inconsistencies are:
- Data Visualisation Tools:
- Missing data visualization: R package naniar
- Python libraries: matplotlib, seaborn, missingno
- Statistical Analysis Software:
- Data diagnosis:
- R package dlookr
- Data validation:
- R package validate R package
- R package data.validator R package
- Identify duplicates:
- Test missingness patterns:
- R package naniar
- Outlier detection:
- Python libraries: pandas, NumPy
- Data diagnosis:
- Develop Data Quality Assessment Framework (DQAF) or implement already existing ones
- Cross-Validation Techniques:
- Scikit-learn - Cross-validation (python)
An overview of R packages on quality assessment is provided in this publication (Mariño et al.,2022).
Data cleaning
Data cleaning procedures are essential for ensuring the accuracy and reliability of socioeconomic data. This involves identifying and rectifying errors, inconsistencies, missing data, and outliers that may arise during data collection or entry. Cleaning procedures may include de-duplication to remove redundant entries, standardisation of formats to ensure uniformity, and imputation techniques to address missing values. Existing software, such as python libraries (e.g., pandas or dedupe), can be used for data cleaning. Existing software, such as python libraries (e.g., pandas or dedupe), can be used for data cleaning.
Some tools for data cleaning:
- Statistical Analysis Software:
- Remove duplicates:
- Handling missing data:
- Multiple imputation: R package mice (Multivariate Imputation by Chained Equations)
- R package missMethods
Interoperability
- Semantic harmonisation: Use standard codes and terminologies for variables to ensure consistency across different data sources and studies.
- Syntactic harmonisation: Adopt standardised data formats to facilitate data integration and comparability across various datasets.
Statistical tests for data quality
Conduct statistical tests to assess data reliability and validity. Reliability tests assess the consistency of the data over time, ensuring that data trends are dependable and not subject to random fluctuations. Validity tests, on the other hand, examine the accuracy of the data in measuring what they are intended to measure. For instance, in the context of infectious diseases, validity tests might assess whether socio-economic variables accurately reflect the true impact of the disease on different demographics.
Check completeness
Completeness is the dimension that takes into account the number of variables in a given data model that have actual values. This involves scrutinising datasets for missing values, ensuring that all relevant variables are captured.
Continuous monitoring
Implement ongoing monitoring mechanisms to continuously assess data quality. This includes tracking data collection processes, identifying and addressing data quality issues promptly, and updating data collection protocols as needed. This entails establishing protocols for ongoing assessment of data collection processes, automated checks for anomalies or inconsistencies, and regular audits to verify adherence to quality control procedures.
Regular communication with data sources
Establishing regular communication channels with data sources fosters a collaborative approach to quality control. This enables swift resolution of any issues or discrepancies identified during quality checks, as stakeholders can promptly provide clarification or rectify errors. Additionally, ongoing dialogue with data sources facilitates mutual understanding of data requirements and collection methodologies, allowing for adjustments to be made in real-time to improve data quality.
Quality labels
QUANTUM project) is a EU-funded project that aims to create a common label system for Europe that guarantees the quality and utility of datasets for scientific and health innovation purposes. The repository will include the compilation of data quality management and quality assurance methods, pipelines, tools (open-source) and experiences (including case studies), initially from QUANTUM and QUANTUM consortium partners but also from the wider community, and other related relevant initiatives. Feel free to provide any comment and/or relevant content to be added there! QUANTUM is an EU-funded project (2024-2026) that aims to create a common label system for Europe that guarantees the quality and utility of datasets for scientific and health innovation purposes. This label system will enable researchers, policymakers, and healthcare professionals to identify high-quality data for research and decision making.
More information
Skip tool tableTools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
data.validator R package | Validate dataset by columns and rows using convenient predicates inspired by 'assertr' package. | ||
dedupe | Python library that uses machine learning to perform fuzzy matching, deduplication, and entity resolution quickly on structured data. | Training | |
dlookr | A collection of tools that support data diagnosis, exploration, and transformation. | ||
dplyr | dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. | Socioeconomic data | Training |
matplotlib | Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. | Human biomolecular data | Tool info Training |
mice | Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in Van Buuren and Groothuis-Oudshoorn (2011) |
||
missingno | missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. | ||
missMethods | Supply functions for the creation and handling of missing data as well as tools to evaluate missing data methods | ||
naniar | The package naniar is used for exploring missing data structures with minimal deviation from the common workflows of ggplot and tidy data (Wickham, 2014, Wickham, 2009). | ||
NumPy | Python library for scientific computing. | Socioeconomic data | Tool info Training |
outliers | A collection of some tests commonly used for identifying outliers. | ||
pandas | Open source data analysis and manipulation tool, built on top of the Python programming language. | Socioeconomic data | Tool info Training |
pyod | A comprehensive but easy-to-use Python library for detecting anomalies in multivariate data. | ||
Scikit-learn | Machine learning tools in Python | Socioeconomic data | Tool info Training |
seaborn | Python data visualization library that provides a high-level interface for drawing attractive and informative statistical graphics. | Training | |
validate R package | Declare data validation rules and data quality indicators; confront data with them and analyze or visualize the results. |