colinbrislawn/QualityControl.md

## QualityControl.md

      
    Raw
  

              QualityControl.md
            
          
    Quality Control

High throughput sequencing data is often presented in the .fastq format. This flat text file format contains both the nucleotide sequences and Phred quality scores(Q scores). Quality scores estimate the accuracy of each nucleotide.


Phred Quality Score
Estimated Accuracy


10
90 %


20
99 %


30
99.9 %


40
99.99 %


Q scores are not perfect

Q scores are estimations; The real accuracy of a nucleotide could be lower.
Q score are different between sequencing platforms; Illumina reports the probability of an substitution error, while Ion Torrent and 454 Roach report the probability of an insertion or deletion.
The relative quality of sequening platforms is hotly debated (PDF, PDF). For this discussion, we will accept Q scores as reasonable estimates of accuracy.
Quality Filtering with Q scores

There are many ways that Q scores can be used increase the quality of a dataset.

Trimming.


Once a single nucleotides has a low Q score, remove all following nucleotides


Filtering.


Remove reads with a low average Q score


Some combination of trimming and filtering, like


Once series of nucleotides has a low average Q score, remove all following nucleotides

Because common sequencing technologies produce lower quality nucleotides near the end of the reads, trimming is common.
Illumina sequencing produced paired-end reads that can be joined. These joined reads are high quality on both ends, making making filtering a better fit.
Average Q is a bad idea!

As explained discussed in Edgar & Flyvbjerg, 2015, the average Q score of a read is a very poor indicator of quality because a simple average dramatically underestimates the number of errors predicted by cumulative Q scores. Take this example from Edgar, shown below.


Q scores in read
Avg. Q
Expected number of errors


140 x Q35 + 10 x Q2
33
6.4 !


150 x Q25
25
0.5


Expected Error filtering

Coming soon!