Socioeconomic data

Introduction

Socioeconomic data play an important role in infectious diseases research by providing insights into how socioeconomic factors such as income, education, and occupation impact the spread, prevention, and treatment of diseases. Socioeconomic data can help identify vulnerable populations, understand patterns of disease transmission, and the effectiveness of interventions, as outlined in this publication (Khalatbari-Soltani et al., 2020).

The quality control of socioeconomic data in the context of infectious diseases is crucial for ensuring that such data can be reliably used for research and to generate evidence that can feed into policymaking. Quality control measures should be implemented throughout the entire data lifecycle, from data collection (or generation) to storage, analysis, sharing, and re-use. This requires that a comprehensive and systematic approach is applied throughout the data lifecycle. The approach must focus on, among other things, representativeness, clarity, timeliness, error detection, interoperability, and continuous monitoring.

Quality dimensions

Quality Dimension	Description
Accuracy	The extent to which data accurately reflect the real world, the extent to which data are affected by observational biases and measurement errors.
Completeness	The extent to which missing or incomplete data occur.
Timeliness	The extent to which data are collected in (near) real-time and available in a timely manner for its use, the degree to which data are up-to-date.
Validity	The extent to which data meet pre-set criteria (~ validation rules).
Relevance	The extent to which data satisfy users’ needs.
Integrity	The extent to which data can be traced and connected to other data.
Consistency	The extent to which a dataset aligns or is uniform with other datasets.
Representativeness	The extent to which data reflect real-world population characteristics, extent to which data can be generalized to a target population.
Clarity	The ease with which data consumers can understand the metadata.
Currency	The extent to which data have real-time value.
Uniqueness	The extent to which a dataset is free from duplicate data.

A review on data quality dimensions for Big Data can be found in this publication (Ridzuan et al., 2024).

Considerations

Below are considerations that must be taken into account when working with socioeconomics data, along with the quality dimensions that they are related to.

Define the target population

Quality dimension: Representativeness
It is essential that the sampling population accurately represents the demographics most affected by the disease. This includes accounting for age, gender, socioeconomic status, geographical location, and other relevant factors, to ensure that the findings can be generalised to the wider population at risk.

Define the collection strategy

Quality dimension: Clarity and Relevance
When data are collected using surveys, it is vital to design questions that are relevant and easily understood by participants. The questions should be straightforward, culturally appropriate, and tailored to capture the necessary socioeconomic variables that impact, or are impacted by, the infectious disease in question. Additionally, when harvesting data from existing databases, it is crucial to ensure that the data extracted are pertinent and clearly defined. This involves understanding the database schema, ensuring compatibility with the research objectives, and verifying the accuracy and relevance of the socioeconomic variables included. The data extraction process should be well-documented and consistent, ensuring that the harvested data maintain their integrity and usefulness for the intended analysis.

Timeliness

Quality dimension: Currency and Relevance
Socio-economic data related to infectious diseases must be current and collected promptly to reflect the rapidly changing dynamics of disease spread and its socioeconomic impacts, especially during an outbreak. Timely data collection enables timely intervention and policy response.

Missing values

Quality dimension: Completeness and population coverage
Strategies should be applied to address any gaps in data, and to ensure coverage of all relevant sub-populations.

Key factors impacting data quality

Technical aspects

Lack of standardized data and metadata formats
Absence of technical solutions
Lack of detailed information for specific searches
Semantics: terminology variations
Unstructured data
Challenges with patient identification and linkage with other data sources

Motivation

Lack of stimulants to use evidence in public health decision-making
Lack of communication of benefits

Economic aspects, resources

“Lack of investments in people, infrastructure, and organizational processes for collecting, storing, analyzing, and sharing data”
Human resources
- Insufficient qualified and motivated workforce
- High workload
- Lack of supervision

Political aspects

Lack of clear policies/regulations
Uncertainty about the role of data owners
Conflicts of interests

Legal, ethical barriers

Privacy and data protection
Data minimization

A review of data quality measures in health research is provided in this publication (Andrade et al., 2023).

Approaches to assess and improve data quality

Pilot tests

Conduct pilot tests or preliminary surveys with a subset of the target population to ensure that the survey design works effectively. This helps to identify potential issues with the clarity, relevance, and understanding of the questions before full-scale data collection.

When the data are harvested from existing databases, conduct pilot extractions to verify the data retrieval process. This involves testing the extraction protocols on a small scale to ensure that the data pulled from databases are accurate, relevant, and compatible with the study’s objectives. Pilot testing the data extraction helps to identify and address any discrepancies or issues in data format, completeness, and alignment with the research questions.

Ensuring that you extract the correct data from the database(s) in question helps with representativeness, as you can assess whether data collection methods effectively reach all segments of the population. Examining population coverage helps to identify under-served or marginalised groups whose experiences may be underrepresented in the data, thus enabling more equitable decision-making.

Timeliness

Implement efficient data collection protocols that ensure rapid access to, and sharing of data. Real-time or near-real-time data collection and processing systems are critical for monitoring the socioeconomic impacts of infectious diseases and for timely decision making.

Error and inconsistency detection

Perform regular error detection and data cleaning processes. This includes identifying outliers, inconsistencies, and missing data, and then applying corrections accordingly, to improve the overall quality and reliability of the data.

Outliers (i.e. data points that significantly differ from other observations) must be identified. It is key to make a distinction between outliers caused by erroneous data (e.g. due to errors in data entry or measurement), and those caused by real, but perhaps rare, events, as they must be handled differently. Tests like the Z-score and interquartile range (IQR) can be used for outlier detection, but these methods assume normally distributed data, so normalization (such as log transformation) is often necessary. To ensure that outliers are not true rare events, it’s important to check related variables from the same samples (multivariate outlier detection). Additionally, consulting domain experts and cross-referencing with other datasets can help validate the outlier’s authenticity. This publication (Boukerche et al., 2020) provides an overview of outlier detection methods.

Some methods/tools for detecting errors and inconsistencies are:

Data Visualisation Tools:
- Missing data visualization: R package naniar
- Python libraries: matplotlib, seaborn, missingno
Statistical Analysis Software:
- Data diagnosis:
  - R package dlookr
- Data validation:
  - R package validate R package
  - R package data.validator R package
- Identify duplicates:
  - R package dplyr
  - Python library pandas
- Test missingness patterns:
  - R package naniar
- Outlier detection:
  - R package outliers
  - Python library pyod
- Python libraries: pandas, NumPy
Develop Data Quality Assessment Framework (DQAF) or implement already existing ones
Cross-Validation Techniques:
- Scikit-learn - Cross-validation (python)

An overview of R packages on quality assessment is provided in this publication (Mariño et al.,2022).

Data cleaning

Data cleaning procedures are essential for ensuring the accuracy and reliability of socioeconomic data. This involves identifying and rectifying errors, inconsistencies, missing data, and outliers that may arise during data collection or entry. Cleaning procedures may include de-duplication to remove redundant entries, standardisation of formats to ensure uniformity, and imputation techniques to address missing values. Existing software, such as python libraries (e.g., pandas or dedupe), can be used for data cleaning. Existing software, such as python libraries (e.g., pandas or dedupe), can be used for data cleaning.

Some tools for data cleaning:

Statistical Analysis Software:
- Remove duplicates:
  - R package dplyr
  - Python library pandas, dedupe
- Handling missing data:
  - Multiple imputation: R package mice (Multivariate Imputation by Chained Equations)
  - R package missMethods

Interoperability

Semantic harmonisation: Use standard codes and terminologies for variables to ensure consistency across different data sources and studies.
Syntactic harmonisation: Adopt standardised data formats to facilitate data integration and comparability across various datasets.

Statistical tests for data quality

Conduct statistical tests to assess data reliability and validity. Reliability tests assess the consistency of the data over time, ensuring that data trends are dependable and not subject to random fluctuations. Validity tests, on the other hand, examine the accuracy of the data in measuring what they are intended to measure. For instance, in the context of infectious diseases, validity tests might assess whether socio-economic variables accurately reflect the true impact of the disease on different demographics.

Check completeness

Completeness is the dimension that takes into account the number of variables in a given data model that have actual values. This involves scrutinising datasets for missing values, ensuring that all relevant variables are captured.

Continuous monitoring

Implement ongoing monitoring mechanisms to continuously assess data quality. This includes tracking data collection processes, identifying and addressing data quality issues promptly, and updating data collection protocols as needed. This entails establishing protocols for ongoing assessment of data collection processes, automated checks for anomalies or inconsistencies, and regular audits to verify adherence to quality control procedures.

Regular communication with data sources

Establishing regular communication channels with data sources fosters a collaborative approach to quality control. This enables swift resolution of any issues or discrepancies identified during quality checks, as stakeholders can promptly provide clarification or rectify errors. Additionally, ongoing dialogue with data sources facilitates mutual understanding of data requirements and collection methodologies, allowing for adjustments to be made in real-time to improve data quality.

Quality labels

QUANTUM project) is a EU-funded project that aims to create a common label system for Europe that guarantees the quality and utility of datasets for scientific and health innovation purposes. The repository will include the compilation of data quality management and quality assurance methods, pipelines, tools (open-source) and experiences (including case studies), initially from QUANTUM and QUANTUM consortium partners but also from the wider community, and other related relevant initiatives. Feel free to provide any comment and/or relevant content to be added there! QUANTUM is an EU-funded project (2024-2026) that aims to create a common label system for Europe that guarantees the quality and utility of datasets for scientific and health innovation purposes. This label system will enable researchers, policymakers, and healthcare professionals to identify high-quality data for research and decision making.

More information

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
data.validator R package	Validate dataset by columns and rows using convenient predicates inspired by 'assertr' package.
dedupe	Python library that uses machine learning to perform fuzzy matching, deduplication, and entity resolution quickly on structured data.		Training
dlookr	A collection of tools that support data diagnosis, exploration, and transformation.
dplyr	dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.	Socioeconomic data	Training
matplotlib	Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.	Human biomolecular data	Tool info Training
mice	Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in Van Buuren and Groothuis-Oudshoorn (2011) .		Training
missingno	missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.
missMethods	Supply functions for the creation and handling of missing data as well as tools to evaluate missing data methods
naniar	The package naniar is used for exploring missing data structures with minimal deviation from the common workflows of ggplot and tidy data (Wickham, 2014, Wickham, 2009).
NumPy	Python library for scientific computing.	Socioeconomic data	Tool info Training
outliers	A collection of some tests commonly used for identifying outliers.		Training
pandas	Open source data analysis and manipulation tool, built on top of the Python programming language.	Socioeconomic data	Tool info Training
pyod	A comprehensive but easy-to-use Python library for detecting anomalies in multivariate data.
Scikit-learn	Machine learning tools in Python	Socioeconomic data	Tool info Training
seaborn	Python data visualization library that provides a high-level interface for drawing attractive and informative statistical graphics.		Training
validate R package	Declare data validation rules and data quality indicators; confront data with them and analyze or visualize the results.

Contributors