Introduction
This document aims to serve as a comprehensive resource for anyone interested in the data sources platforms of infectious diseases using human biomolecular data, which have become increasingly important in recent years as a result of advancements in technology and the ever-growing threat of global pandemics.
By providing this overview of the data sources platforms of infectious diseases using human biomolecular data, this document aims give the user information on where to find these types of data and how to access them (secondary use of the data) and to facilitate the development of new research initiatives and collaborations in the field.
What is human biomolecular data, and why is it important for infectious diseases research?
Human biomolecular data refers to information obtained from the analysis of biological molecules, such as DNA, RNA, proteins, and metabolites. This type of data is used to study the molecular mechanisms underlying disease and to identify potential drug targets.
Human clinical data, on the other hand, refers to information obtained from the study of patients, including their medical histories, physical exams, and laboratory tests. This type of data is used to diagnose and treat disease, as well as to evaluate the safety and efficacy of new therapies.
While both types of data are important for understanding human health and disease, they are collected and analysed in different ways and for different purposes.
Human biomolecular data is of great importance in infectious disease research because it plays a critical role in understanding the molecular basis of the disease. By analyzing the biomolecular data, researchers can gain a deeper understanding of the disease’s pathogenesis, evolution, and transmission. Personalised medicine is another area where biomolecular data can be applied. By analyzing an individual’s biomolecular data, researchers can develop personalised treatment plans that are tailored to the specific needs of the patient. For example, if a patient has a genetic mutation that predisposes them to a particular disease, this information can be used to develop a personalised treatment plan that takes into account the patient’s genetic profile.
Furthermore, its significance extends far beyond disease diagnosis and treatment, encompassing a broader scope of applications. For instance, a secondary use of human biomolecular data becomes vital at the population level, facilitating important policy responses during future infectious disease outbreaks. Additionally, investigating the cost-effectiveness of interventions based on these datasets enables optimal resource allocation for the most impactful healthcare strategies. By analyzing this data, we can make well-informed decisions to protect public health in a long term.
It is important to mention that most analyses of human biomolecular data are complemented by using the clinical data related to them.
Data deposition
For advancing in the understanding of human biomolecular data, data deposition plays a pivotal role. The deposition of this data in publicly accessible databases serves as a valuable resource for scientists and researchers worldwide. It enables the integration and analysis of diverse datasets as well as by sharing and archiving data, researchers can ensure the reproducibility and transparency of their findings, allowing others to validate and build upon their work, encouraging collaboration. In summary, data deposition not only fuels scientific progress but also empowers the global scientific community to unlock the complexities of human biology, ultimately leading to improved health outcomes and advancements in medical research.
Considerations
- Choose a reliable and established public database or repository for data deposition.
- Ensure that the data is properly organized, documented, and annotated for easy understanding and interpretation.
- Adhere to data sharing and privacy regulations to protect sensitive information and maintain data confidentiality.
- Include metadata, such as experimental protocols, sample characteristics, and data processing methods, to provide context and facilitate reproducibility.
- Use standardized data formats and ontologies to enhance interoperability and enable integration with other datasets.
- Use metadata standards (such as DCAT) to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites.
- Include appropriate quality control measures to ensure data accuracy and reliability.
- Consider data anonymization or de-identification techniques to protect the privacy of individuals involved in the study.
- Provide sufficient data access and sharing permissions, specifying any restrictions or limitations, while ensuring compliance with legal and ethical requirements.
- Consider long-term data preservation strategies to ensure the accessibility and availability of the deposited data for future researchers.
- Promote open and collaborative practices by encouraging data citation and acknowledgement to recognize the contributions of the original data creators.
Please note that these considerations are general in nature and may vary depending on the specific requirements and guidelines of the chosen data repository or database.
Existing approaches
- Public databases: Various publicly accessible databases serve as repositories for human biomolecular data, such as the National Center for Biotechnology Information (NCBI) databases (e.g., GenBank, GEO, SRA) and European Bioinformatics Institute (EBI) databases (e.g., European Nucleotide Archive (ENA), ArrayExpress).
- Controlled access repositories: Some data deposition platforms, like dbGaP (dbGaP) and EGA (European Genome-phenome Archive (EGA)), adopt a controlled access model to protect sensitive human biomolecular data. Researchers interested in accessing the data need to request permission and comply with specific data usage policies.
- Data integration platforms: Platforms like the Global Alliance for Genomics and Health (GA4GH) provide frameworks and standards for federated data access and integration across multiple repositories. These initiatives aim to facilitate the aggregation and analysis of human biomolecular data from diverse sources while maintaining data privacy and security.
- Data citation and DOI assignment: To acknowledge and promote the contributions of researchers who deposit human biomolecular data, many repositories assign unique digital object identifiers (DOIs) to datasets. This enables proper citation and recognition of the deposited data, enhancing its visibility and impact.
- Data submission portals: Some repositories offer user-friendly web portals or submission systems that guide researchers through the process of depositing human biomolecular data. These portals often provide templates, validation checks, and step-by-step instructions to ensure the completeness and quality of the deposited data.
- Consortium-specific databases: Collaborative research initiatives often establish dedicated databases for sharing and depositing human biomolecular data, such as The Cancer Genome Atlas (TCGA) for cancer genomics data or the Genotype-Tissue Expression (GTEx) project for gene expression data across different tissues.
- Standardized data formats: Commonly used data formats like FASTQ, BAM, and VCF facilitate data deposition and sharing by ensuring compatibility and interoperability between different analysis tools and databases.
- Data publication: Journals and publishers increasingly require researchers to deposit their human biomolecular data in public repositories as a prerequisite for publication. This promotes data sharing, reproducibility, and transparency in scientific research.
- Data sharing platforms: Online platforms like Figshare, Zenodo, and Dryad provide researchers with the means to deposit and share their human biomolecular data, ensuring its long-term accessibility and enabling collaboration.
Search and discoverability
Search and discoverability are crucial for finding and accessing relevant information, resources, and data related to a specific topic or area of interest. Infectious diseases can evolve rapidly and have significant impacts on public health, making it necessary to monitor and respond to outbreaks effectively. This requires access to up-to-date information on disease prevalence, transmission patterns, and clinical outcomes, which can come from various sources, such as clinical data, biomolecular data, public health reports, and social media.
Standardised terminology and data formats are also essential for effective search and discoverability in infectious disease surveillance. The use of common disease codes and data structures can facilitate the integration and analysis of data from multiple sources, making it easier to identify trends and patterns. This can improve the ability to identify emerging disease threats and develop effective disease control measures. See Data harmonisation section.
Biomolecular data sources can be facilitated through the use of standardised data formats and data sharing platforms, allowing the development of effective diagnostic and treatment strategies of infectious diseases data.
Overall, search and discoverability are essential for effective infectious disease surveillance and response. By ensuring that relevant data sources are easily accessible and usable, public health professionals can more effectively monitor and control the spread of infectious diseases, ultimately protecting public health.
Considerations
Despite the growing amount of infectious disease data stored in various sources, finding and analyzing this data can be challenging for the scientific community. There is a clear need for a user-friendly and efficient way to discover and analyse this data.
- Data sharing platforms: Access to data sharing platforms can facilitate the discovery and sharing of biomolecular data related to infectious diseases. Such as the COVID-19 Data Portal.
- Data privacy and security: Privacy and security protocols must be in place to protect sensitive biomolecular data from unauthorised access.
- National regulations: Taking into account the National regulations and the General Data Protection Regulation (GDPR) rules.
- Data quality: High-quality biomolecular data is critical for accurate disease surveillance, diagnosis, and analysis. Efforts should be made to ensure that data quality is maintained throughout the data lifecycle. See Quality control - Human biomolecular data page.
- Data storage and management: Proper data storage and management practices must be followed to ensure that biomolecular data is organised and easily accessible to relevant parties.
- Metadata: Metadata should be included with biomolecular data to provide context and facilitate search and discoverability.
- Collaboration: Collaboration between data producers, curators, and users can promote effective search and discoverability of biomolecular data related to infectious diseases.
- Data standardization: Standardised data formats and common disease codes are essential for integrating and analyzing biomolecular data from different sources.
Existing approaches
Consequently, we have compiled some of the main tools, portals, and data sharing platforms that allow for searching and discovering biomolecular data related to infectious diseases from various sources with the next considerations.
- Beacon: Beacon v2 is an API (usually extended with a user interface) that allows for data discovery of phenoclinic and biomolecular data. The version 2 (v2) of the Beacon protocol has been accepted as GA4GH standard in Spring 2022. It includes, among other changes:
- Query options for biological or technical metadata using filters defined through CURIEs (e.g. phenotypes, disease codes, sex or age).
- An option to trigger the next step in the data access process (e.g. who to contact or which are the data use conditions).
- An option to jump to another system where the data could be accessed (e.g. if the Beacon is for internal use of the hospital, to provide the Id of the EHR of the patients having the mutation of interest).
- Annotations about the variants found, among which the expert/clinician conclusion about the pathogenicity of a given mutation in a given individual or its role in producing a given phenotype.
- Information about cohorts.
Useful links related to Beacon v2:
- Beacon v2 Website
- Beacon v2 Models
- Beacon v2 GitHub
- Beacon v2 GitHub API
- Beacon v2 Reference Implementation paper
Example of Beacon v2 with synthetic data:
Beacon v2 implementations with COVID-19 data:
Beacons | Viral genomes | Patient basic data | Patient rich data | Patient genomics | Epidemiological data |
---|---|---|---|---|---|
CRG COVID-19 Viral Beacon | X | X | – | – | – |
EGA Beacon | – | X | X | X | – |
COVID-19 NL | – | X | X | – | – |
UNottingham Beacon | – | X | X | – | – |
Bento platform | X | X | X | X | – |
SARS-COV-2 outbreak in Andalucia | X | X | X | – | – |
COVID-19 BEACON | X | X | – | – | X |
Viral AI | X | – | – | – | – |
- Biosamples: BioSamples stores and supplies descriptions and metadata about biological samples used in research and development by academia and industry. For example it stores data from infectious diseases such as COVID-19.
- COVID-19 DataPortal: The COVID-19 Data Portal facilitates data sharing and analysis in order to accelerate coronavirus research and acts as a Data sharing platform. The European COVID-19 Data Platform consists of three connected components:
- SARS-CoV-2 Data Hubs, which organise the flow of SARS-CoV-2 outbreak sequence data and provide comprehensive open data sharing for the European and global research communities.
- Federated EGA, which provides secure controlled access sharing of sensitive patient and research subject data sets relating to COVID-19 while complying with stringent privacy national laws.
- COVID-19 Data Portal, which brings together and continuously updates relevant COVID-19 datasets and tools, will host sequence data sharing and will facilitate access to other SARS-CoV-2 resources.
You can find further information about the Covid-19 Data Portal on RDMkit.
Data access and transfer
Sharing genetic and molecular information between researchers and institutions is essential for gaining a better understanding of human biology and disease. This is especially important when it comes to infectious diseases. By studying the genetic makeup of pathogens and how they interact with human cells, researchers can identify new targets for treatments and vaccines. They can also develop strategies to prevent or contain potential outbreaks before they occur.
Of course, sharing this kind of sensitive information comes with challenges. Privacy, security, and ethical considerations must be taken into account. Researchers need to handle this information responsibly and with respect for individuals’ rights. Legal and regulatory barriers can also impede data sharing and collaboration.
However, the benefits of sharing human biomolecular data outweigh the risks. Access to this data is crucial for scientific progress and medical advancements. It’s important for us to continue finding ways to responsibly and securely share this valuable resource.
Considerations
When looking for solutions to human biomolecular data access, you should consider the following aspects:
- Data Security and Privacy: Prioritize solutions that ensure strict data security and compliance with ethical guidelines, protecting sensitive biomolecular information and adhering to regulatory requirements.
- Interoperability and Data Sharing: Choose solutions that seamlessly integrate with existing biomolecular research platforms and data repositories, facilitating secure data sharing and collaboration among researchers.
- Data Reproducibility and Transparency: Choose solutions that promote data reproducibility by providing transparent methodologies, making it easier for researchers to validate and build upon previous findings.
- Data Quality and Standardization: Verify that the solution provides reliable and accurate data, while supporting data standardization and metadata organization for consistent data exchange and improved research outcomes.
- Data Integration Capabilities: Depending on you study, prioritize solutions that can seamlessly integrate diverse biomolecular data types, such as genomics, proteomics, and metabolomics, for comprehensive analysis and insights.
- Data structure: Choose a database with a defined data structure, enabling homogeneity of the data and facilitating standardized and consistent data storage, retrieval and usability.
- Scalability and Performance: Look for solutions capable of efficiently handling large-scale biomolecular data sets while maintaining optimal performance, supporting advanced analysis tools for meaningful insights.
- User-Friendly Interface: Opt for solutions with intuitive interfaces and flexible access controls, enabling researchers of varying technical backgrounds to access, analyze, and interpret data effectively.
When looking for solutions to data transfer, you can check RDMkit.
Existing approaches
- You can check a list of existing controlled access repositories:
- You can use one of these standards to make your data use conditions publicly available to possible data requesters.
- The The Data Use Ontology (DUO) is an international standard, which provides codes to represent data use restrictions for controlled access datasets.
- The ADA-M provides a standardised way to unambiguously represent the conditions related to data discovery and access.
- By depositing your data to one of the existing controlled access repositories, they will already show the data use conditions (e.g. EGAD00001007777)
- A data access committee (DAC) is a group responsible for reviewing and approving requests for access to sensitive data, such as human biomolecular data. Its role is to ensure that requests are in compliance with relevant laws and regulations, that data is being used for legitimate scientific purposes, and that privacy and security are being maintained. To know more about what is a DAC and how to become one, you can check the European Genome-phenome Archive - Data Access Committee website.
You can find further information about sharing human data on RDMkit.
Data harmonisation
To ensure that researchers can effectively utilise data, it is essential that it be collected and stored in a standardised manner. Using standardised formats and databases facilitates data sharing across research groups, enabling more efficient and effective analysis. This approach not only saves time, but also yields more accurate results.
Thanks to the Sars-CoV-2 outbreak, the scientific community has established standards, schemes, and data models with controlled vocabularies and ontologies explained in detail in the Human Clinical and Health Data section. One example could be the Developing a standardized but extendable framework to increase the findability of infectious disease datasets paper from February of 2023.
Considerations
- Looking for an existing standardised metadata schema for human biomolecular data, like MIABIS or EGA schemas.
- Incorporating key data elements such as patient demographics, clinical features, and laboratory test results in the metadata schema
- Ensuring interoperability with other existing metadata schemas to facilitate data sharing and integration
- Including metadata fields for sample collection, processing, and storage information to ensure data quality and reproducibility
- Implementing controlled vocabularies and ontologies for standardised annotation and data integration
- Enabling data harmonisation across different studies to facilitate meta-analyses and systematic reviews
- Seek input and feedback from stakeholders across the research and public health communities to ensure that the schema meets the needs of diverse users and supports a range of research questions and applications.
- Regularly updating and refining the metadata schema to accommodate new data types and emerging research needs.
Existing approaches
- When looking for solutions to standards, schemas, ontologies and vocabularies, you can check the RDMkit for documentation.
- FAIRsharing is also a good resource to find metadata standards that can be useful for your research.
More information
Links to RDMkit
RDMkit is the Research Data Management toolkit for Life Sciences describing best practices and guidelines to help you make your data FAIR (Findable, Accessible, Interoperable and Reusable)
Tools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
ACE Cohort | Asymptomatic COVID-19 in Education (ACE) Cohort | Training | |
ADA-M | Responsible sharing of biomedical data and biospecimens via the Automatable Discovery and Access Matrix (ADA-M). The Automatable Discovery and Access Matrix (ADA-M) provides a standardized way to unambiguously represent the conditions related to data discovery and access. By adopting ADA-M, data custodians can generally describe what their data are (the Header section), who can access them (the Permissions section), terms related to their use (the Terms section), and special conditions (the Meta-Conditions). By doing so, data custodians can participate in data sharing and collaboration by making meta information about their data computer-readable and hence directly available for digital communication, searching and automation activities. | Tool info | |
ArrayExpress | ArrayExpress is a database of functional genomics experiments that can be queried and the data downloaded. It includes gene expression data from microarray and high throughput sequencing studies. Data is collected to MIAME and MINSEQE standards. Experiments are submitted directly to ArrayExpress or are imported from the NCBI GEO database. | Linked pathogen and ho... | Tool info Standards/Databases Training |
Beacon v2 | Beacon v2 is a protocol/specification established by the Global Alliance for Genomics and Health initiative (GA4GH) that defines an open standard for federated discovery of genomic data and associated information in biomedical research and clinical applications. | Human clinical and hea... | Tool info Standards/Databases Training |
Bento platform | The Bento platform enables the research community to explore the BQC19 cohort aggregate data. | ||
BioSamples | BioSamples stores and supplies descriptions and metadata about biological samples used in research and development by academia and industry. Samples are either 'reference' samples (e.g. from 1000 Genomes, HipSci, FAANG) or have been used in an assay database such as the European Nucleotide Archive (ENA) or ArrayExpress. It provides links to assays and specific samples, and accepts direct submissions of sample information. | Linked pathogen and ho... | Tool info Standards/Databases Training |
COVID-19 BEACON | The COVID-19 Beacon is a searchable platform for SARS-CoV-2 genomic variants conforming to the Beacon specifications of the Global Alliance for Genomics and Health but adjusted for viral genome searches. | ||
COVID-19 Data Portal | The COVID-19 Data Portal enables researchers to upload, access and analyse COVID-19 related reference data and specialist datasets. The aim of the COVID-19 Data Portal is to facilitate data sharing and analysis, and to accelerate coronavirus research. The portal includes relevant datasets submitted to EMBL-EBI as well as other major centres for biomedical data. The COVID-19 Data Portal is the primary entry point into the functions of a wider project, the European COVID-19 Data Platform. | Human clinical and hea... Socioeconomic data The Swedish Pathogens ... | Tool info Standards/Databases Training |
CRG COVID-19 Viral Beacon | A platform allowing for browsing SARS-CoV-2 variability at the genome, amino acid, structural, and motif levels | An automated SARS-CoV-... | |
dbGaP | The Database of Genotypes and Phenotypes (dbGaP) archives and distributes the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. | Tool info Standards/Databases Training | |
DCAT | An RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. | Human clinical and hea... | Standards/Databases |
Dryad | Dryad is an open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data. | Standards/Databases | |
Dutch COVID-19 Data Portal | The dutch COVID-19 Data Portal provides researchers with a clear overview of what is available, allow searching for specific data and make access to such data easier when the necessary ethical and legal conditions have been met. | ||
EBI | The European Bioinformatics Institute is a bioinformatics research center that is part of the European Molecular Biology Laboratory and is located in Hinxton, England. The institution combines intense research activity with the development and maintenance of a set of bioinformatics lines, services and databases. | Training | |
EGA Beacon | Interface to query on the EGA data through Beacon v2 | ||
Estonian Biobank | The Estonian Biobank has established a population-based biobank of Estonia with a current cohort size of more than 200,000 individuals (genotyped with genome-wide arrays), reflecting the age, sex and geographical distribution of the adult Estonian population. Considering the fact that about 20% of Estonia's adult population has joined the programme, it is indeed a database that is very important for the development of medical science both domestically and internationally. | ||
European Genome-phenome Archive (EGA) | The European Genome-phenome Archive (EGA) is a service for permanent archiving and sharing of personally identifiable genetic, phenotypic, and clinical data generated for the purposes of biomedical research projects or in the context of research-focused healthcare systems. Access to data must be approved by the specified Data Access Committee (DAC). | Human clinical and hea... Linked pathogen and ho... | Tool info Standards/Databases Training |
European Nucleotide Archive (ENA) | Provides a record of the nucleotide sequencing information. It includes raw sequencing data, sequence assembly information and functional annotation. | Pathogen characterisation Human clinical and hea... Pathogen characterisation An automated SARS-CoV-... Using the ENA data sub... SARS-CoV-2 sequencing ... Linked pathogen and ho... | Tool info Standards/Databases Training |
FAIRsharing | FAIRsharing is a FAIR-supporting resource that provides an informative and educational registry on data standards, databases, repositories and policy, alongside search and visualization tools and services that interoperate with other FAIR-enabling resources. FAIRsharing guides consumers to discover, select and use standards, databases, repositories and policy with confidence, and producers to make their resources more discoverable, more widely adopted and cited. Each record in fairsharing is curated in collaboration with the maintainers of the resource themselves, ensuring that the metadata in the fairsharing registry is accurate and timely. | Pathogen characterisation Ethical, Legal, and So... | Standards/Databases Training |
Federated EGA | The Federated EGA is an infrastructure built upon the European Genome-phenome Archive (EGA), an EMBL-EBI and CRG data resource for secure archiving and sharing of human sensitive biomolecular and phenotypic data resulting from biomedical research projects. | Human clinical and hea... | Training |
Figshare | Figshare is a generalist, subject-agnostic repository for many different types of digital objects that can be used without cost to researchers. Data can be submitted to the central figshare repository (described here), or institutional repositories using the figshare software can be installed locally, e.g. by universities and publishers. | Standards/Databases Training | |
GenBank | GenBank is the NIH genetic sequence database of annotated collections of all publicly available DNA sequences. | Tool info Standards/Databases Training | |
GEO | The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. Accepts next generation sequence data that examine quantitative gene expression, gene regulation, epigenomics or other aspects of functional genomics using methods such as RNA-seq, miRNA-seq, ChIP-seq, RIP-seq, HiC-seq, methyl-seq, etc. GEO will process all components of your study, including the samples, project description, processed data files, and will submit the raw data files to the Sequence Read Archive (SRA) on the researchers behalf. In addition to data storage, a collection of web-based interfaces and applications are available to help users query and download the studies and gene expression patterns stored in GEO. | Standards/Databases Training | |
Global Alliance for Genomics and Health (GA4GH) | The metadata model for GA4GH, an international coalition of both public and private interested parties, formed to enable the sharing of genomic and clinical data. | Tool info Standards/Databases Training | |
GTEx | The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 53 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. Remaining samples are available from the GTEx Biobank. The GTEx Portal provides open access to data including gene expression, QTLs, and histology images. | Tool info Standards/Databases Training | |
MIABIS | MIABIS represents the minimum information required to initiate collaborations between biobanks and to enable the exchange of biological samples and data. The aim is to facilitate the reuse of bio-resources and associated data by harmonizing biobanking and biomedical research. | Standards/Databases | |
National Center for Biotechnology Information (NCBI) | The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. | Training | |
Panther | The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. | Tool info Standards/Databases Training | |
SARS-CoV-2 Data Hubs | Using technology that builds upon existing EMBL-EBI infrastructure, we provide SARS-CoV-2 Data Hubs to those public health agencies and other scientific groups responsible for generating viral sequence data from the outbreak at national or regional levels. | ||
SARS-COV-2 outbreak in Andalucia | SARS-CoV-2 whole genome sequencing circuit of Andalusia | ||
SRA | The SRA is NIH's primary archive of high-throughput sequencing data and is part of the International Nucleotide Sequence Database Collaboration (INSDC) that includes at the NCBI Sequence Read Archive (SRA), the European Bioinformatics Institute (EBI), and the DNA Database of Japan (DDBJ). Data submitted to any of the three organizations are shared among them. SRA accepts data from all kinds of sequencing projects including clinically important studies that involve human subjects or their metagenomes, which may contain human sequences. These data often have a controlled access via dbGaP (the database of Genotypes and Phenotypes). | Tool info Standards/Databases Training | |
TCGA | The Cancer Genome Atlas (TCGA) is a comprehensive, collaborative effort led by the National Institutes of Health (NIH) to map the genomic changes associated with specific types of tumors to improve the prevention, diagnosis and treatment of cancer. Its mission is to accelerate the understanding of the molecular basis of cancer through the application of genome analysis and characterization technologies. | Standards/Databases Training | |
The Data Use Ontology (DUO) | The Data Use Ontology (DUO) describes data use requirements and limitations. DUO allows to semantically tag datasets with restriction about their usage, making them discoverable automatically based on the authorization level of users, or intended usage. This resource is based on the OBO Foundry principles, and developed using the W3C Web Ontology Language. It is used in production by the European Genome-phenome Archive (EGA) at EMBL-EBI and CRG as well as the Broad Institute for the Data Use Oversight System (DUOS). | Standards/Databases | |
UNottingham Beacon | Beacon from UNottingham to query a backend OMOP database of synthetic COVID-19 patient EHRs (electronic health records). | ||
Viral AI | A global network for genomic surveillance and infectious disease research | ||
Zenodo | Zenodo is a generalist research data repository built and developed by OpenAIRE and CERN. | Standards/Databases Training |