Synthetic Data

One of the limitations in genomics research is that human genomics data is not openly available; access must be controlled according to participant consent agreements and data protection regulations such as GDPR. Obtaining authorization to access such data can sometimes take a long time, resulting in delays to important research work. In this context, synthetic genomic and phenotype data can be useful resources for researchers to avoid these delays.

Synthetic data are artificially generated datasets, often created with algorithms, which can be used without the need for authorization to test new products and tools, build technical demonstrators, validate data models, and train AI models. The EGA provides access to synthetic cohort datasets augmented with rich synthetic metadata that overcomes these real data usage restrictions. Whilst synthetic datasets are not included in the general EGA mandate and services, we can consider such submissions and evaluate their acceptance on the basis of their unique use cases not already covered by existing synthetic datasets. Access to synthetic data studies is managed by the EGA Helpdesk Data Access Committee.

Study ID	Title
EGAS00001002472	CINECA synthetic cohort EUROPE UK1 referencing fake samples
EGAS00001005591	Synthetic data - Genome in a Bottle
EGAS00001005042	Test Study for EGA using data from 1000 Genomes Project - Phase 3
EGAS00001005702	Human genomic and phenotypic synthetic data for the study of rare diseases
EGAS50000000190	EOSC4Cancer Synthetic Colorectal Cancer Genomic data