For example, these guidelines recommend a minimal number of uniquely mapped reads to evaluate DNA-seq sequencing files, while the number of useable fragments should be considered for ChIP-seq with different thresholds for narrow-peak or broad-peak experiments. The information about quality held by some features could substantially vary depending on the assay as shown by the popular experiment guidelines from the ENCODE project ( 14). For chromatin and protein-DNA interaction assays, such as DNAse-seq, ATAC-seq and Chip-seq, it may be of interest for quality evaluation to use the genomic locations and the distribution of reads near transcription start sites (TSSs), which are of interest in these assays anyway ( 10, 11, 12, 13).Īpplying all these methods and using their features to determine the quality of a new sample can be complicated. While raw quality score and mapping statistics can be used in combination with any NGS-sequencing data, for some applications additional steps can be taken to complement the quality analysis of the data. For example, genome mapping statistics include the number of reads mapped to a unique position or unmapped reads, which are significant with respect to the quality of the input data ( 6, 7, 8, 9). Downstream analysis can also provide insight on the quality of the given data. Examples are position-dependent biases, sequencing adapter contamination and DNA over-amplification. The most popular tool for quality control of FastQ files is FastQC ( 5), which can be used to get multiple features that hold information on the quality of the raw data. Manual interpretation of these scores is not possible because each base of each read must be taken into account. The score Q is an integer mapping of the probability P that a base call is incorrect ( 4). The raw data are stored in FastQ files, which contain the sequence of the read and a corresponding quality score Q, encoded in ASCII characters. Classical quality control (QC) tools analyze raw data exported from the machine performing the assay. There are a variety of tools that can be used to compute features holding information about the quality of NGS data. We could show in a previous work that the systematic removal of lower quality samples within RNA-seq datasets improves the clustering of disease and control samples ( 3). Especially in the clinical context, misinterpretation of data due to faulty samples can have dire consequences for patients, such as false diagnosis or wrong therapy approaches. It is crucial to filter out low-quality data as early as possible to prevent negative impact on downstream analysis ( 1, 2). NGS experiments require stepwise data analysis, to gain information from short reads, which first need to be assembled or aligned to a reference genome. Many different assays have been developed reaching from the classical sequencing to gene expression quantification (RNA-seq), identifying epigenetic modifications (ChIP-Seq, Bisulfite Sequencing) and measuring chromatin accessibility (DNAse-seq, MNAse-seq, and ATAC-seq). Next-generation sequencing (NGS)–based analyses of regulatory functions of the genome are widely used in clinical and biological applications and have gained a key role in research in recent years. Thanks to this approach, we confirm the high relevance of genome mapping statistics to assess the quality of the data, and we demonstrate the limited scope of some quality features that are not relevant in all conditions. Therefore, we present new data-driven guidelines derived from the statistical analysis of many public datasets using quality features calculated by common bioinformatics tools. In this work, we have characterized well-known quality guidelines and related features in big datasets and concluded that they are too limited for assessing the quality of a given NGS file accurately. Therefore, the NGS community would highly benefit from condition-specific data-driven guidelines derived from many publicly available experiments, which reflect routinely generated NGS data. Moreover, it is usually difficult to know if quality features are relevant in all experimental conditions. Available quality control tools require profound knowledge to correctly interpret the multiplicity of quality features. However, the quality of this data is not always guaranteed. More and more next-generation sequencing (NGS) data are made available every day.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |