4 Quality check fields

We used a number of tools to collect potentially useful quality-control (QC) measures. Specifically, we used seqtk (Li 2020 (accessed August 18, 2020)), the idxstats subcommand of samtools, the output of STAR, our own megadepth tool, and featureCounts. We examine each in turn, listing the specific QC measures calculated be each.

Monorail runs the seqtk fqchk command on input FASTQ files to collect base-quality and base-composition summaries for all sequencing cycles. We distill these into a few QC measures included with every summarized run in recount3.

min_len: minimum read length
max_len: maximum read length
avg_len: average read length
#distinct_quality_values: number of different quality scores present in the sequence run’s base qualities
#bases: total number of bases across spots (not including both read mates if paired)
%A: percent of bases that are A
%C: percent of bases that are C
%G: percent of bases that are G
%T: percent of bases that are T
%N: percent of bases that are N
avgQ: weighted average over Phred quality scores present in the sequence run, where weights are the Phred quality values themselves
errQ: negatively scaled log of the weighted average Phred quality scores present in the sequence run, where weights are the error probabilities associated with the Phred quality scores

Monorail uses STAR to align RNA-seq reads in a spliced fashion to a reference genome, without using any annotation. Files output by STAR, particularly the Log.out and Log.final.out, report a number of measures that can be used for QC. We compile these into a number of QC measures included with every summarized run in recount3. Note that some of these measures are reported separately for the two ends of a paired-end read. We omit the second-end versions of these QC measures here for space reasons.

From the STAR manual (version 2.7.2b):

Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately). Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.

Some of the following definitions include text from the STAR manual/source code, reprinted here for convenience. Please see the STAR manual for more in depth information.

%_of_chimeric_reads: Number of chimeric reads divided by number of input reads
%_of_reads_mapped_to_multiple_loci: Number of reads mapped to multiple loci divided by number of input reads
%_of_reads_mapped_to_too_many_loci: Number of reads mapped to (> 10) loci divided by number of input reads
%_of_reads_unmapped:_other: Reads are unmapped due to no acceptable seed/windows divided by number of input reads
%_of_reads_unmapped:_too_many_mismatches: Number of reads where best alignment has more mismatches than max allowed number of mismatches divided by number of input reads
%_of_reads_unmapped:_too_short: Number of reads where best alignment was shorter than min allowed mapped length divided by number of input reads
all_mapped_reads: Total number of reads aligned
average_input_read_length: Average length of a read
average_mapped_length: Average length of an alignment
deletion_average_length: Average length of a genomic deletion, i.e. genomic gaps
deletion_rate_per_base: Genomic deletions per mapped base
insertion_average_length: Average length of a genomic insertion, i.e. read gaps
insertion_rate_per_base: Genomics insertions per mapped base
mapping_speed_million_of_reads_per_hour: How fast it was to align this sample
mismatch_rate_per_base_%: Mismatches per mapped base
number_of_chimeric_reads: Total number of reads which were fragmented on aligning, e.g. fusion potential reads
number_of_input_reads: Total number of reads input to STAR
number_of_reads_mapped_to_multiple_loci: Number of reads mapped to multiple loci
number_of_reads_mapped_to_too_many_loci: Number of reads mapped to (> 10) loci
number_of_reads_unmapped:_other: Number of reads left unmapped due to no acceptable seed/windows
number_of_reads_unmapped:_too_many_mismatches: Number of reads where best alignment has more mismatches than max allowed number of mismatches
number_of_reads_unmapped:_too_short: Number of reads where best alignment was shorter than min allowed mapped length
number_of_splices:_at/ac: Number of canonical splices of AT-AC (and reverse)
number_of_splices:_annotated_(sjdb): Number of splices found that were also in the annotation database
number_of_splices:_gc/ag: Number of canonical splices of GC-AG (and reverse)
number_of_splices:_gt/ag: Number of canonical splices of GT-AG (and reverse)
number_of_splices:_non-canonical: Number of non-canonical splices, anything not GT-AG, AT-AC, GC-AG, or their reverse complement
number_of_splices:_total: Total number of splices
uniquely_mapped_reads_%: Number of reads which mapped to a single locus divided by number of input reads
uniquely_mapped_reads_number: Number of reads which mapped to a single locus

Monorail runs the samtools idxstats on the BAM file output by STAR to collect statistics about how many reads aligned to each chromosome in the genome assembly. This can be helpful in, for instance, confirming the sex of the individual sequenced based on alignments to sex chromosomes, or measuring effectiveness of ribosomal RNA depletion by considering the fraction of reads aligned to the mitochondrial genome. We compile these into a number of QC measures included with every summarized run in recount3:

aligned_reads%.chrm: Percent of reads aligning to the mitochondrial genome.
aligned_reads%.chrx: Percent of reads aligning to chromosome X.
aligned_reads%.chry: Precent of reads aligning to chromosome Y.

Monorail runs our megadepth tool on the BAM files output by STAR. The chief function is to convert BAM files to bigWig files that are then added to the recount3 archive. As megadepth performs this conversion, it also summarizes the amount of sequencing coverage within the intervals of a provided BED file representing a gene annotation. These quantifications can be useful for quality control, tell us, for example, what fraction of the coverage is within annotated genes.

Fragment length distribution is based on a special read filter only applied for this purpose to be compatible with CSAW’s fragment counting approach (Lun and Smyth 2016), paired reads in a passing fragment must not be secondary, supplementary, have conflicting read order, be unmapped or be mapped on more than one chromosome. The bc_ prefix here refers to the previous name of the Megadepth tool ¹.

bc_auc.all_reads_all_bases: Area under coverage (total depth of coverage evaluated at all bases) for all alignments
bc_auc.all_reads_annotated_bases: Area under coverage for all alignments, but only for bases in annotated exons
bc_auc.unique_reads_all_bases: Area under coverage for uniquely aligned reads
bc_auc.unique_reads_annotated_bases: Area under coverage for uniquely aligned reads, but only for bases in annotated exons
bc_auc.all_%: bc_auc.all_reads_annotated_bases divided by bc_auc.all_reads_all_bases
bc_auc.unique_%: bc_auc.unique_reads_annotated_bases divided by bc_auc.unique_reads_all_bases
bc_frag.count: Total number of read fragments in BAM after filtering
bc_frag.kallisto_count: Number of read fragments (< 1000) bp in length in BAM after filtering
bc_frag.kallisto_mean_length: Mean length of read fragments (< 1000) bp in length in BAM after filtering
bc_frag.mean_length: Mean length of all read fragments in BAM after filtering
bc_frag.mode_length: Mode of the read fragment length of all fragments in BAM after filtering
bc_frag.mode_length_count: Number of read fraqments with the bc_frag.mode_length in BAM after filtering

Finally, Monorail runs featureCounts on the BAM files output by STAR. This provides a second opinion on the quantifications produced by megadepth. While we have not yet found compelling examples where the megadepth and featureCounts outputs disagree, we keep summaries of the featureCounts quantifications as potential QC measures.

exon_fc.all_%: exon_fc_count_all.assigned divided by all_mapped_reads (from STAR)
exon_fc.unique_%: exon_fc_count_unique.assigned divided by uniquely_mapped_reads_number (from STAR)
exon_fc_count_all.total: Total number of fragments, including multi-mappers, input to featureCounts
exon_fc_count_all.assigned: Number of fragments, including multi-mappers, assigned by featureCounts to an exon
exon_fc_count_unique.total: Total number of uniquely mapping fragments input to featureCounts
exon_fc_count_unique.assigned: Number of uniquely mapping fragments assigned by featureCounts to an exon
gene_fc.all_%: gene_fc_count_all.assigned divided by all_mapped_reads (from STAR)
gene_fc.unique_%: gene_fc_count_unique.assigned divided by uniquely_mapped_reads_number (from STAR)
gene_fc_count_all.total: Total number of fragments, including multi-mappers, input to featureCounts
gene_fc_count_all.assigned: Number of fragments, including multi-mappers, assigned by featureCounts to a gene
gene_fc_count_unique.total: Total number of uniquely mapping fragments input to featureCounts
gene_fc_count_unique.assigned: Number of uniquely mapping fragments assigned by featureCounts to a gene

References

Li, H. 2020 (accessed August 18, 2020). Seqtk: Toolkit for Processing Sequences in FASTA/q Formats. https://github.com/lh3/seqtk.

Lun, A. T., and G. K. Smyth. 2016. “csaw: a Bioconductor package for differential binding analysis of ChIP-seq data using sliding windows.” Nucleic Acids Res. 44 (5): e45.

Megadepth used to be called BamCount.↩︎