High throughput sequencing data is often presented in the .fastq
format. This flat text file format contains both the nucleotide sequences and Phred quality scores(Q scores). Quality scores estimate the accuracy of each nucleotide.
Phred Quality Score | Estimated Accuracy |
---|---|
10 | 90 % |
20 | 99 % |
30 | 99.9 % |
40 | 99.99 % |
Q scores are estimations; The real accuracy of a nucleotide could be lower.
Q score are different between sequencing platforms; Illumina reports the probability of an substitution error, while Ion Torrent and 454 Roach report the probability of an insertion or deletion.
The relative quality of sequening platforms is hotly debated (PDF, PDF). For this discussion, we will accept Q scores as reasonable estimates of accuracy.
There are many ways that Q scores can be used increase the quality of a dataset.
- Trimming.
- Once a single nucleotides has a low Q score, remove all following nucleotides
- Filtering.
- Remove reads with a low average Q score
- Some combination of trimming and filtering, like
- Once series of nucleotides has a low average Q score, remove all following nucleotides
Because common sequencing technologies produce lower quality nucleotides near the end of the reads, trimming is common.
Illumina sequencing produced paired-end reads that can be joined. These joined reads are high quality on both ends, making making filtering a better fit.
As explained discussed in Edgar & Flyvbjerg, 2015, the average Q score of a read is a very poor indicator of quality because a simple average dramatically underestimates the number of errors predicted by cumulative Q scores. Take this example from Edgar, shown below.
Q scores in read | Avg. Q | Expected number of errors |
---|---|---|
140 x Q35 + 10 x Q2 | 33 | 6.4 ! |
150 x Q25 | 25 | 0.5 |
Coming soon!