Introduction
The FAIR principles provide guidelines for making research data (i.e. digital assets) Findable, Accessible, Interoperable, and Reusable (Wilkinson et al., 2016). In infectious diseases research, adhering to these principles is crucial for facilitating data sharing, collaboration, and accelerating progress towards:
- Meta-analyses
- Reliable transmission maps
- Understanding of the spread and evolution of pathogens
- Improved diagnostics, treatments, and vaccines
By ensuring maximal data usability, FAIRness increases the efficiency and impact of your infectious disease research.
Findability
Findability is a crucial aspect of infectious diseases research, as it ensures that relevant data and resources can be easily discovered and located by researchers and other stakeholders.
This is particularly important in the context of infectious diseases, where rapid access to accurate, comprehensive and purpose-specific data is essential for effective outbreak response and disease management.
Moreover, by making infectious disease data more findable, researchers promote transparency, accountability and reproducibility of infectious diseases research, as well as avoid duplicating the effort of their discoveries. Ultimately, all of these factors help to build trust among stakeholders and enable better and faster collaborations and knowledge sharing.
Considerations
- Use (globally) unique and persistent identifiers (e.g. biosample:SAMEA6864906) for each of your records, asserting they are unambiguously resolvable from anywhere in the world.
- Use standard naming conventions for human and disease data (e.g. Brill-Zinsser disease), as well as for taxonomic classifications (e.g. taxonomy:9606 for humans or taxonomy:2697049 for COVID-19).
- Describe your data with clear variable names with possible searchable keywords and comprehensive descriptions: choose field standards if possible. Prioritise primary and usual users’ standards, but do not forget that metadata may be used by novel users to the field, to which you can cater with generic and understandable. Metadata must be sufficient and appropriate.
- Register your data as open and accessible as possible through repositories and data portals (e.g. European Genome-phenome Archive (EGA), dbGaP, European Health Information Portal…).
- Make your webpage machine accessible and readable, especially for search engines. You can always check the findability of the data you submitted (e.g. using a new session on a web browser), adjust and correct it if needed.
Existing approaches
- It is vital that the data you produce gets archived in a permanent archive that allows for controlled distribution, not just for the set of years your project is active. Some examples of human archives are the European Genome-phenome Archive (EGA) and dbGaP, also encompassed by other major frameworks like BioStudies or the COVID-19 Data Portal.
- Take a look at other approaches at the Finding metadata section.
Accessibility
Accessibility in infectious diseases research is crucial to ensure quick, secure and equitable access to clinical and biomolecular data for all, regardless of the reputational or economical power of researchers and institutions, thus promoting the infectious disease research to be community driven.
Considerations
- Albeit the emergency of an infectious disease outbreak may pose a tempting chance to obtain human (meta)data through a back-door, when dealing with personal data, and specially sensitive information, privacy and security measures must remain in place:
- Sensitive (meta)data must be accessible through the authorised (and community approved) procedures of the archives to which they are submitted.
- You should avoid distributing (meta)data through unsecured channels (e.g. sending metadata of patients in an email to another researcher elsewhere), and instead make use of the procedures already in place by archives.
- In a rapid environment like an infectious disease outbreak, (meta)data may also change rapidly: human errors may occur and making changes to (meta)data records may be a necessity. Nevertheless, for the sake of traceability, it is crucial that the (meta)data you submit to an archive is not removed, even when the information it contains is technically wrong: there are other alternatives (e.g. record deprecation) that allow for traceability. Additionally, it is also important to state when a record has been withdrawn, updated or replaced.
- For all of these aspects, your institution will often do the work for you, you just need to follow their guidance. If not existing, report your action to national repositories or international disciplinary repositories: they provide efficient support for data deposit.
- Check the accessibility of your data. Your dataset, when submitted to an archive, is likely to have a landing page with several data access services (e.g., download, transcription, contact person, visualisation…). From the perspective of someone foreign to the system to access the data, try to find out how easy it is and then adjust and retest. To increase the quality of accessibility, you can also apply the stable W3C WCAG and use their tools and methods to assess it.
Existing approaches
- There are multiple archives with secure procedures already in place for the distribution of sensitive human information through authentication and granted access. For example, the European Genome-phenome Archive (EGA) has a request and grant method to provide secure ad-hoc access to human datasets.
- Check other use-cases and examples at the Data access section.
Interoperability
Interoperability is essential for infectious diseases research because it enables the integration and analysis of data from different sources (e.g. different hospitals, countries, biological sources, etc.), leading to a more comprehensive understanding of disease transmission, prevention, and treatment.
Without interoperability, data silos may emerge, restricting researchers’ ability to link different datasets and slowing progress in the fight against infectious diseases. Specifically, issues such as synonymy and polysemy can lead to misinterpretation of scientific results and hinder the clear communication of recommendations, which can be crucial, especially during a pandemic.
Considerations
- Provide detailed metadata for infectious disease datasets, including the source, collection date, location, and any performed protocols (e.g. nasal swab being the method of isolation: EFO:0010741). Even when the granularity of the (meta)data varies, you should always use descriptive fields with broadly understandable values.
- Use controlled vocabularies and ontologies to describe human data and infectious diseases (e.g. EFO:0007182 for Brill-Zinsser disease). Furthermore, do not forget contextual data that must meet intercommunity standards, for example: time, temperature, pressure, chemical components…
- Controlled vocabulary refers to a set of terms, standardised by the field community, used to describe and categorise concepts, ensuring consistency and accuracy in data organisation and retrieval. For example, when an infectious disease (e.g., malaria) has multiple names (e.g., Plasmodium infection, jungle fever), it is recommended to use the designated one in the ontologies to minimise redundancy and improve data integration.
- Make use of existing metadata standards, such as Data Catalog Vocabularies (DCATs) or the Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea (Bowers et al., 2017).
- Structure your data so that it is machine-actionable.
- Your data should include qualified references to other data sources and metadata, which would increase the traceability and context of your dataset. This may ultimately be needed for necessary meta analyses in pandemic situations. References in fields like the source of patient data (e.g. UBERON:0001707), the laboratory that performed the analysis (e.g. including the name of the laboratory, the name of the institution, and its location), or the specific protocol (e.g. sample collection) used, greatly enhance the quality and transparency of the data.
- Verify interoperability by reviewing the definitions of variables and linking them to relevant ontology concepts, ensuring that potential users can assess the compatibility of their data. To achieve this, import corresponding definitions and consult with your data producers to confirm that field definitions do not have significant discrepancies with the data you generate. That has to be done each time you integrate new variables and variable names/definitions.
Existing approaches
- For controlled vocabularies and ontologies you can use the Ontology Lookup Service (OLS). This handy service compiles multiple ontologies through which you can search at once. Examples of ontologies related to infectious diseases and human data and diseases are EFO (Experimental Factor Ontology), MONDO (Mondo Disease Ontology), HP (Human Phenotype Ontology), CIDO (Ontology of Coronavirus Infectious Disease), IDO (Infectious Disease Ontology), IDO-COVID-19 (The COVID-19 Infectious Disease Ontology), VIDO (The Virus Infectious Disease Ontology), DOID (Human Disease Ontology), the OBI (Ontology for Biomedical Investigations), and VO (Vaccine Ontology).
- It is possible to disseminate any recommendation on how to choose “good” ontologies, participating in the better understanding of well used and better recognized terminologies in related fields. To do it, some ideas can be found in: Identifying, naming and interoperating data in a Phenotyping platform network : the good, the bad and the ugly.
- To aid with the taxonomy classification of your samples (human source, xenografts, tissue cultures, viral agents, etc.) you can make use of the NCBI’s taxonomybrowser.
- Please refer to RDA COVID-19 recommendation (and others) to help you to use most recognized terminologies adapted to your case: RDA COVID-19 Working Group. (2020). RDA COVID-19 Recommendations and Guidelines on Data Sharing (1.0)
Reusability
Infectious disease research heavily relies on the reusability of human clinical and health data, well-defined usage policies, and metadata with community approved quality check processes and sufficient, appropriate and trustworthy provenance information. Without these elements, conducting effective research in this field can be challenging or even impossible.
Considerations
- Metadata quality is needed to truly see the data quality and data reusability. Without the metadata, the data, regardless of its quality, is unusable.
- The provenance of your data must be transparent for effective tracking of infectious diseases. Key provenance fields include, for example: the geographic location of the study or data source, the date and time of data collection, sources of patient demographic information (e.g., age, gender, ethnicity), sample collection and processing methods (e.g., blood draw, tissue biopsy, RNA extraction), and ethical or legal considerations (e.g., informed consent, data ownership, privacy regulations). The content of provenance information is always context-dependent, tailored to the purpose of provenance collection and the nature of the data produced. Efficient and accurate metadata for provenance should also capture all sensor parameters, ensuring replicability even if the original sensor is no longer available.
- Make sure that the clinical data you intend to share complies with the applicable laws regarding privacy (e.g. GDPR Art. 9), and rely on the existing data archives for its distribution. When possible, make sure that your data is not completely blocked from being reused, which would render it unusable when shared. Remember: as safe as it needs, but as open for research as the regulations allow.
- Infectious diseases may require an even quicker approach to data discoverability (multi-indexation), distribution and reuse. Within this timeframe, developing a plan from scratch for the data collection, storage, sharing, access, dissemination and reuse seems unfeasible. This timeframe advocates for a thorough preparation: you should create protocols for each of these steps, so that when the time comes, you and your team are prepared.
- Check your reusability (your dataset should have a landing page with several data and other digital object access services e.g. download, transcription, contact person, visualisation…) -> adjust and retest with different kinds of users to better understand their needs. If potential users are not sure of your data content : data quality, they never reuse them for their own study. (several checklists exist, see for instance: PARSEC DDOMP)
Existing approaches
- Redacting and interpreting data reuse policies is a complex and tedious task, especially when time is the main bottleneck of the research. For this reason, Data Use Conditions (The Data Use Ontology (DUO)) were created (search for yours at Ontology Lookup Service (OLS)). These allow to annotate datasets with usage restrictions, enabling:
- Automatic discovery of the data based on user authorization level or intended use.
- A quick and easy interpretation, from the perspective of the users, of the conditions to be met for data usage. (e.g. use very well known and open licences like Creative Commons and repositories that permit public licences and embargos like Zenodo)
- Make these controls in an iterative way and publish your metadata!
- Keep track of data reuses, and if publicly available, give a perspective of what was done with your dataset
- Make your dataset citable by uploading it to a well-established data repository that provides DOI or another stable identifier!
Related pages
More information
Links to FAIRsharing
FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.
Links to RDMkit
RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)
Tools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
BioStudies | The BioStudies database holds descriptions of biological studies, links to data from these studies in other databases at EMBL-EBI or outside, as well as data that do not fit in the structured archives at EMBL-EBI. The database can accept a wide range of types of studies described via a simple format. It also enables manuscript authors to submit supplementary information and link to it from the publication. | Linked pathogen and ho... | Tool info Standards/Databases Training |
COVID-19 Data Portal | The COVID-19 Data Portal enables researchers to upload, access and analyse COVID-19 related reference data and specialist datasets. The aim of the COVID-19 Data Portal is to facilitate data sharing and analysis, and to accelerate coronavirus research. The portal includes relevant datasets submitted to EMBL-EBI as well as other major centres for biomedical data. The COVID-19 Data Portal is the primary entry point into the functions of a wider project, the European COVID-19 Data Platform. | Human biomolecular data Human clinical and hea... Socioeconomic data The Swedish Pathogens ... | Tool info Standards/Databases Training |
dbGaP | The Database of Genotypes and Phenotypes (dbGaP) archives and distributes the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. | Human biomolecular data | Tool info Standards/Databases Training |
European Genome-phenome Archive (EGA) | The European Genome-phenome Archive (EGA) is a service for permanent archiving and sharing of personally identifiable genetic, phenotypic, and clinical data generated for the purposes of biomedical research projects or in the context of research-focused healthcare systems. Access to data must be approved by the specified Data Access Committee (DAC). | Human biomolecular data Human clinical and hea... Linked pathogen and ho... | Tool info Standards/Databases Training |
European Health Information Portal | The Health Information Portal provides access to population health and healthcare data across Europe. | Human clinical and hea... | Standards/Databases |
Ontology Lookup Service (OLS) | EMBL-EBI's web portal for finding ontologies | Human clinical and hea... | Tool info Standards/Databases Training |
The Data Use Ontology (DUO) | The Data Use Ontology (DUO) describes data use requirements and limitations. DUO allows to semantically tag datasets with restriction about their usage, making them discoverable automatically based on the authorization level of users, or intended usage. This resource is based on the OBO Foundry principles, and developed using the W3C Web Ontology Language. It is used in production by the European Genome-phenome Archive (EGA) at EMBL-EBI and CRG as well as the Broad Institute for the Data Use Oversight System (DUOS). | Human biomolecular data | Standards/Databases |
Zenodo | Zenodo is a generalist research data repository built and developed by OpenAIRE and CERN. | Human biomolecular data | Standards/Databases Training |