Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute.

Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

QUALITY SCORES

Quality scores The sequencer outputs base calls at each position of a read It also outputs a quality value at each position – This relates to the probability that that base call is incorrect The most common Quality value is the Sanger Q score, or Phred score – Q sanger -10 * log 10 (p) – Where p is the probability that the call is incorrect – If p = 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect – If p = 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect – If p = 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect Using the equation: – p=0.05, Q sanger = 13 – p=0.01, Q sanger = 20 – p=0.001, Q sanger = 30

For the geeks…. In R, you can investigate this: sangerq <- function(x) {return(-10 * log10(x))} sangerq(0.05) sangerq(0.01) sangerq(0.001) plot(seq(0,1,by=0.00001),sangerq(seq(0,1,by=0.00001)), type="l")

The plot

For the geeks…. And the other way round…. qtop <- function(x) {return(10^(x/-10))} qtop(30) qtop(20) qtop(13) plot(seq(40,1,by=-1), qtop(seq(40,1,by=-1)), type="l")

The important stuff Q30 – 1 in 1000 chance base is incorrect Q20 – 1 in 100 chance base is incorrect

QUALITY ENCODING

Quality Encoding Bioinformaticians do not like to make your life easy! Q scores of 20, 30 etc take two digits Bioinformaticians would prefer they only took 1 In computers, letters have a corresponding ASCII code: Therefore, to save space, we convert the Q score (two digits) to a single letter using this scheme

The process in full p (probability base is wrong) : 0.01 Q (-10 * log10(p)) : 30 Add 33 : 63 Encode as character : ? PQCode 0.0513. 0.01205 0.00130?

For the geeks…. code2Q <- function(x) { return(utf8ToInt(x)-33) } code2Q(".") code2Q("5") code2Q("?") code2P <- function(x) { return(10^((utf8ToInt(x)-33)/-10)) } code2P(".") code2P("5") code2P("?")

QC OF ILLUMINA DATA

FastQC FastQC is a free piece of software Written by Babraham Bioinformatics group http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Available on Linux, Windows etc Command-line or GUI

Read the documentation Follow the course notes

Per sequence quality One of the most important plots from FastQC Plots a box at each position The box shows the distribution of quality values at that position across all reads

Obvious problems

Less obvious problems

Really bad problems

Other useful plots Per sequence N content – May identify cycles that are unreliable Over-represented sequences – May identify Illumina adapters and primers

Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute.

Similar presentations

Presentation on theme: "Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute.

Similar presentations

Presentation on theme: "Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute."— Presentation transcript:

Similar presentations

About project

Feedback