Quality Control Reports
High-throughput sequencing techniques have become the leading method to study, decode and discover the genomic origins of biological phenomenons. EGA provides a secure archival of such identifiable genomics data with the purpose of data-upcycling, i.e. to re-use these data for research. High-quality data standards are essential to ensure the quality and credibility of the research. Moreover, a quality check report can assure a researcher beforehand about the data that they will request access, therefore saving time and effort.
The EGA has developed a File Quality Control Report (QC Report) to provide generic quality control reports for Fastq, SAM/BAM/CRAM, and VCF files deposited at EGA. This QC Report will allow users to get information regarding the files submitted within a specific dataset. The data requesters will obtain information such as the quality of reads, mapped reads, number of variants, and other features before starting the requesting process, which will save the efforts and time.
Accessing file quality control reports
In each dataset page, the user can explore the files that it contains by clicking the "Browse files" button in the top-right corner of the page.
Once in the dataset files page, the user is able to browse the different files and see some information about them in the main table, such as the format or the file location. The user can access the QC report of each file by clicking the button in the "QC Report" column.
The Quality Control report of a file has two sections. The first one, contains general information about the file, such as the inferred assembly, the number of records, the dataset or study where it comes from, etc. The second section contains plots that summarize interesting information about the file, for example, the site frequency distribution or the variant types.
The description of each plot is accessible by clicking the "i" button at the top-right corner of each plot box.
For analyzing the fastq, SAM/BAM/CRAM and VCF files, the EGA applies a set of tools widely used in the bioinformatics community.
- FASTQ: FastQC, recognized as the gold standard tool by the community.
- Per base sequence quality, per sequence quality scores, per base sequence content, per sequence GC content, sequence duplication levels, etc.
- SAM/BAM/CRAM: samtools, also the gold standard, generates results plots useful to get an overall idea of the quality of the file.
- base coverage distribution, base quality, % of mapped reads, % of both mates mapped, singletons, duplicates, etc.
- VCF: vcftools and bcftools, combined with a custom script to infer the genome assembly.
- site frequency distribution, Ts/Tv, base changes, indel distribution, etc.
The pipeline is organized as follows:
A Bash main script is used to call the different commands of these softwares. The resulting files are then processed by a Python script that will process them and output a JSON file. This JSON file holds the final information that will be used. The EGA website will conveniently parse the JSON to display the information in a user-friendly way and create the final QC report of the file.