split nanopore reads by qscore

When basecalling nanopore reads with guppy or some other basecaller, it will split the reads into pass/fail files depending on some given mean qscore value. Mean qscore here is the average of the probabilities associated with the quality scores in the 4th line of your fastq read. It is calculated with the following equation:

Put into code:

def calculate_qscore(qstring):
    '''
    calculate a qscore from a qstring
    '''
    qs = (np.array(qstring, 'c').view(np.uint8) - 33)
    mean_err = np.exp(qs * (-np.log(10) / 10.)).mean()
    score = -10 * np.log10(max(mean_err, 1e-4))
    return score

So you have basecalled some nanopore reads, but now you want to be more or less stringent on your pass/fail filtering. How do you re-split your reads? One way is to use the sequencing_summary.txt file to extract the mean_qscore field, and search the fastq for the @readID. The sequencing_summary.txt file has changed many, many, many, many times over the years, and has the possibility of being removed or altered dramatically soon. So what other way is there?

Introducing, split_qscore.py, a script to split your fastq and sam files by a qscore cutoff.

https://github.com/Psy-Fer/buttery-eel/blob/main/scripts/split_qscore.py

usage: split_qscore.py [-h] [-p PREFIX] [-q QSCORE] [-f {fastq,sam}] input output

split a fastq or sam file by qscore value into pass/fail files

positional arguments:
  input                 fastq, sam file or - for STDIN/pipe
  output                output path to write pass/fail files

options:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        filename prefix to give pass/fail output files, eg -p exampl gives example.pass.fastq & example.fail.fastq (default: reads)
  -q QSCORE, --qscore QSCORE
                        qscore to split on:pass=>qscore(default: 9)
  -f {fastq,sam}, --format {fastq,sam}
                        if using - for stdin, give format as fastq or sam (default: None)

You can now input your fastq to be split by qscore.

python3 split_qscore.py -q 15 test.fastq ./ > stats.txt

>cat stats.txt

~~~ STATS ~~~:

pass_reads: 2670
fail_reads: 1330
total_reads: 4000
pass fraction: 66.75%

You can also split sam/bam files. You can provide a sam file like the fastq file above, and everything will just work. If starting from a bam file, you can stream the data using the following piping pattern where - (dash) is for stdin:

samtools view -h test.bam | python3 split_qscore.py -q 15 -f sam - ./ > stats.txt

Don't forget the -h argument in samtools to also send the header. You can then convert the resulting sam files back to bam. This is useful for unaligned sam/bam output when calling methylation with remora in guppy/dorado.

When calculating the mean qscore of a read, split_qscore.py will first try to see if a tag is present containing the pre-computed score. If your data was basecalled using the slow5 wrapper for guppy, buttery-eel, your reads will have a mean_qscore=# in each read header. Any sam outputs from guppy/buttery-eel, dorado will have the unofficial sam tag qs:i:#. If these tags cannot be found, then the mean qscore will be manually calculated. This, of course, can be a little slower.

I hope this small script is helpful to those doing long-read sequencing and coming across this issue of read splitting.

split nanopore reads by qscore

Recent Posts

Contact
Information

split nanopore reads by qscore

Recent Posts

Contact Information

Contact
Information